001 , Shortle
002 , Lomize-Andrei
003 , Gerloff
004 , Harold-Scheraga
012 , Levitt
018 , Raghava-GPS
027 , SHESTOPALOV
028 , Ram-Samudrala
031 , BioInfo.PL
032 , Wolynes
035 , Rose-Group
039 , Sandia-USF
040 , Brooks-Feig
046 , Avbelj-Franc
055 , Bystroff
058 , Harrison-Weber
065 , Torda-Andrew
080 , Skolnick-Kolinski
090 , Hogue-Feldman
115 , Baldi
139 , Dill-Ken-A
150 , Chandonia-Cohen
151 , Pred2ary/Chandonia
152 , Yoon
155 , TUDELFT
214 , Aberystwyth
221 , osgdj
223 , Braun-UTMB
225 , Rokko
251 , HenryS
252 , Akiyama
281 , Jones-Ab
309 , roland-luethy
325 , DChu11
352 , zhu
354 , baker
375 , Ho-Kai-Ming
383 , HeadGordon-Teresa
384 , Murzin
385 , Prof-server
392 , CIRB
402 , PROF/Rost
414 , Friesner
424 , PDG-contact-pred/Valencia
439 , bme-mathatiitb
446 , Xia
453 , Noguchi
469 , fain
487 , GNM-AB
489 , FCLD
498 , Kollman-Baker
510 , PSSP
512 , ELAN
522 , Beverdige
FRAGFOLD: Ab initio predictions made by assembling fragments of supersecondary
structure Brunel University
Doublet Code of Protein Secondary Structure and its Application for Secondary
Structure Prediction and Fold Recognition Institute of Cytology of Russian Academy of Sciences
Protein Secondary Structure Prediction Using Nearest Institute of Microbial Technology
Secondary structure prediction from threading results Australian National University
Pediction of protein tertiary structure based on dihedral Institute of Microbial Technology
MOLECULAR DYNAMICS SIMULATION OF HYDROPHOBIC COLLAPSING FROM A MODEL Delft University of Technology
The Prof protein secondary structure prediction server University of Wales, Aberystwyth
A Combined Sequence Homology and Secondary Structure Prediction Methodology University of Wales, Aberystwyth
PROF: profile-based neural network prediction Columbia Univ
ORFeus, protein structure prediction strategies International Institute of Molecular and Cell Biology
Ab Initio Prediction of Protein Structure based on Modeling the Denatured State
as an Ensemble of Overlapping Gapped Fragments The Johns Hopkins University School of Medicine
Predicting Protein Tertiary Structure using a Global Optimization Lawrence Berkeley National Laboratory
Stanford University
A segment matching and folding algorithm for Stanford University
Prediction of the Three-Dimensional Structure of Proteins National Institute of Chemistry
Prediction and Refinement Protein Structures by Indian Institute of Technology, Bombay.
Ab initio protein folding using threading based, tertiary restraints Danforth Plant Science Center
THREADNNSSP - Secondary Structure Prediction Combining Local Threading and
Salamov/Solovyev NNSSP. Amgen, Inc.
Solvent-induced multi-body force field for ab-initio protein structure prediction
Kobe University
Prediction of contact maps with neural networks and correlated mutations. CIRB and Dept. of Biology University of Bologna
Physical Model Ab Initio Protein Folding University of Bath
Assembly of protein cores from regular secondary structures: College of Pharmacy, University of Michigan
Fully Automatic Protein Structure Prediction Using Evaluation Functions Independent Contributor
SSpro, a web server for protein secondary structure University of California, Irvine
Prediction of Protein Secondary Structure Using the Threading Algorithm and
Local Sequence Homology Electrotechnical Laboratory
Ab initio predictions by residue-residue contact matrices Washington University
Protein Structure Prediction using Trajectory Information Samuel Lunenfeld Research Institute, Mount Sinai Hospital
Randomized and Multiple Model Approaches to Homology Modeling and Ab Initio
Modeling. TJU
Predicting Secondary Structure with Ensembles of Decision Trees Sandia-USF
Assmebly of protein tertiary structures from Bayesian blocks Amgen, INC.
Rosetta/AMBER Hierachy Method, Molecular Dynamics refinement of Rosetta
predictions UCSF
Jones-Ab , 281
number of submitted models: 88
David T. Jones
email: David.Jones@brunel.ac.uk
For CASP4 targets which we could not predict using fold recognition
methods, our FRAGFOLD (D.T. Jones, PROTEINS. Suppl. 1, 185-191)
method was used to generate up to 5 ab initio structures. This approach to
protein tertiary structure prediction is based on the assembly of recognized
supersecondary structural fragments taken from highly resolved protein
structures using a simulated annealing algorithm.
For all targets (including CM and FR targets), secondary structures were predicted
using PSIPRED (D.T. Jones, J. Mol. Biol. 292, 195-202, 1999). PSIPRED predictions in
CASP4 (as opposed to CAFASP2) were generated with a database updated at the CASP
deadline rather than the CAFASP2 deadline and with alpha/beta structure weights set
to
In detail, our 3-D submissions were calculated using the following procedure:
1. Selection of fragment library. At each sequence position a list of 10
supersecondary structural fragments is generated. These fragments (alpha
hairpins, beta hairpins, alpha corners, beta corners, beta-alpha units, beta-
alpha-beta units and split-beta-alpha-beta units) are taken from 200 highly
resolved protein chains with no chain breaks. The selection process
involves ranking the fragments in order of potential energy Z-scores (an
ungapped alignment is used for this ranking), though excluding any
fragments which had a secondary structure different to the PSIPRED
secondary structure prediction (only where the PSIPRED confidence was
>= 0.8).
2. Simulation. A classic Metropolis scheme is employed in running the
simulation. Random moves are made by selecting either one of the
preselected 10 fragments at a randomly chosen sequence position, or a free
choice is made from all 3-5 residue fragments from the entire fold library.
These moves are first tested to ensure than the generated structure is
physically possible (steric checks) and then accepted if the Metropolis
criterion is met. The starting temperature for the simulation is selected by
making 500 random moves to the starting conformation and calculating the
largest absolute energy change between any two moves. The simulation is
started at a temperature corresponding to 10 times this delta-E, and the
temperature is halved after either 5000 random moves have been accepted
by the Metropolis criterion, or a total of 50000 sterically allowable moves
have been tested.
3. Potentials. The THREADER V3.0 potentials were used. These are
distance-dependent potentials of mean force compiled from a non-
redundant set of protein chains with resolutions < 2.6 Angstroms. For the
single cyclic peptide target a simple harmonic bond constraint was added to
draw together the N and C-termini. Predicted secondary structure was not
incorporated in the objective function. Energies were summed over homologous
sequences found by running the target sequence through one iteration of
PSI-BLAST. In addition to the mean force terms, simple terms were added to
take account of hydrogen bonding in beta-sheets, chain compactness and steric
hindrance (tables of minimum CA-CA, CA-CB and CB-CB distances were used).
4. Final model selection. 20 separate simulations were run in parallel with
different random seed values on a farm of 12 dual-CPU Linux machines.
The 20 final structures were clustered using the NMRCLUST program
(L.A. Kelley et al., Protein Engineering, 9, 1063-
1065, 1996) and the representatives of the highest ranked clusters were
submitted (up to the CASP maximum of 5) as final predictions.
SHESTOPALOV , 027
number of submitted models: 152
Shestopalov Boris V
email: shest@mail.cytspb.rssi.ru
The problem of the protein three-dimensional structure prediction has not
yet resolved. We propose to resolve this problem using the Linderstrom-Lang
hierarchial model of the protein three-dimensional structure formation [1].
The first step of this process is the secondary structure formation. Then the
local folds are formed - the supersecondary structure stage. The final stage
is the tertiary structure formation.
We state that all the information on these stages is coded and contained in
the previous levels of structure. The protein secondary structure code for
water-soluble proteins is now determined (the preliminary versions are
described in [2] and [3], some modifications are done for CASP4). The code is
doublet one. The alpha-helices are coded by the amino acid residue pairs (i, i+4),
the beta-structures - by the pairs (i, i+2), the coil regions - by the
pairs (i, i+1). The code is overlapping one and the overlapping is resolved by
the selection rule aiming to keep the most number of codons after selection.
During the CASP4 experiment the protein secondary structure code has been used
for the secondary structure prediction and the protein fold recognition. For
secondary structure prediction the homologous sequences information has been
used. The homologous sequences were searched by BLAST2 [4](EMBL service),
PSI-BLAST and Conserved Domain Database [5] , NCBI service and PRODOM [6]. The
most deverse subset has been used. Then the predicted secondary structure has
been confronted with the secondary structures from Protein Data Bank in search
of similar sequences of secondary structure elements. The results obtained
have been used for the fold recognition. In the case of helix-turn-helix motif
our method has been used [7]. In some cases, when it was possible and useful,
the expert considerations have been used. After the fold recognition the
secondary structure prediction is corrected using the secondary structure
alignment of the predicted secondary structure and secondary structures for
proteins from PDB, recognized as most similar to the predicted protein.
Alignment has been constructed manually, using, when it is possible, Yale
structural alignments for PARENT and its homologues [8].
Evidently the result of the fold recognition depends on the quality of the
secondary structure prediction and the final secondary structure prediction
depends on the quality of the fold recognition. The main restriction of the
method used is the application of the doublet code of the secondary structure
based on the middle interactions only, excluding long ones. The use of the
homologous sequence information, as it is known, does not garantee correct
secondary structure prediction as well as the use of the fold recognition
results.
We hope to correct this situation after the completion of our theory
of the protein three-dimensional structure, now in development. Then all the
formal and expert ad hoc schemes developed for CASP4 will became unnecessary.
We hope that in future it will be possible to construct the code tables using
only pure physical considerations without statistical analysis of PDB data as
now.
Five models for secondary structure prediction have been constructed.
MODEL A is single sequence prediction (SSP), obtained by DOUBLET CODE METHOD
as model 3 in CASP3 [3], with slightly modified code tables, MODEL B is
obtained from MODEL A by transforming ambiguous and undetermined regions into
COIL. MODEL C is multiple sequence prediction (MSP), obtained by application
of DOUBLET CODE and PSI-BLAST, MODEL D is variant of MODEL C, obtained using
Prodom, MODEL E is MSP with more expert intervention, including the using of
the fold recognition results, described above. MODEL 1 is MODEL E, if absent -
MODEL D, if absent - MODEL C, if absent - MODEL B. For the fold recognition it
has been used MODEL D, if absent - MODEL C, if absent - MODEL B.
References
1. Linderstrom-Lang K.V. (1952) Proteins and enzymes, Stanford Univ. Press,
Stanford, California.
2. Shestopalov B.V. Prediction of protein secondary structure by doublet code
method. Mol. Biol., Moscow, Engl. transl.,24/4, p.900-907.
3. Shestopalov B.V.,CASP3, submitted.
4. Yan P. Yuan, Eulenstein, O., Vingron, M. & Bork, P. 1998. Towards detection
of orthologues in sequence databases. Bioinformatics, 14, 285-289
5. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z, Miller W &
Lipman D.J. /1997/. Nucl. Acid. Res. v. 25, pp 3389-3402.
6. http://www.toulouse.inra.fr/prodom.html.
7. Shestopalov B.V. Amino-acid sequence template useful for
alpha-helix-turn-alpha-helix prediction, FEBS Lett. 233: (1) 105-108 JUN 6 1988.
8. Krebs W., Gerstein M. http://bioinfo.mbb.yale.edu/align/server.cgi.
PSSP , 510
number of submitted models: 32
Neighbor and Neural Network Approach
G P S Raghava
email: raghava@imtech.ernet.in
Author developed a method for predicting secondary structure of protein from
its amino acid sequence. This method is based on artificial intelligence
techniques (AI) called i) Nearest Neighbor approach; and ii) Neural network
approach. In our approach four steps are in involved in secondary structure
prediction. In first step, secondary structure (SS) is predicted using
nearest neighbor method. In second step, SS is predicted using neural
network. In third step, SS is predicted using previous two steps based on
probability of correct prediction. In fourth and final step, predicted SS is
refined using structure to structure prediction approach. This method allows
to predict SS from single sequence as well as from multiple alignment. Detail
description of these steps is given below.
1. Nearest Neighbor Method
-----------------------
The basic idea of nearest neighbor methods is to use the examples closely
related to the test instance to determine the secondary structure of the test
instance. The success of prediction of this approach is directly depend upon
the closely related known examples corresponding to a test instance. The
nearest neighbor methods outperform the neural network if there are similar
or identical examples corresponding to a test instance but perform poorly in
absence of closely related examples. The number of known examples is
increasing with time, because the protein structure is continuously
increasing.
Author optimizes the parameters used in nearest neighbor method to improve the
accuracy of prediction of standard nearest neighbor method. In past the database
of known examples was limited because these examples were generated from
limited set of proteins (around 126). To overcome this limitation, author generates
a database of known examples from all proteins in PDB (PDB, 1998), which
maximize the number of examples in database. This improves the accuracy of
prediction because performance of nearest neighbor method is directly
proportional to number of know examples. Author also estimates the probability of
correct prediction of each amino acid as well as distribution over the three states.
One of the major drawback of nearest neighbor method is its speed. This method
is very slow and time taken to predict SS of a protein is directly
proportional to number of known examples, because it compares query pattern
with all known patterns. This method becomes nearly impractical when we
consider all known examples in PDB. To overcome this problem, we modified the
standard nearest neighbor approach. In this approach closely related patterns
were searched by comparing query patterns to only those known patterns, who
have three central residues (One central and two adjacent residues) identical
to query pattern. This makes our modified approach nearly 800 (20*20*20)
times faster than standard approach. It was observed that this decreases the
performance of method marginally.
2. Neural Network Method
---------------------
In this study we implement standard feedforward neural network method to
predict SS of protein. Our network consists 75 hidden units and an input window
of 17 amino acids. We trained network on all protein in PDB. To overcome
problem of overfitting, divided our data in two sets, where one was used for
training and other for checking overfitting. We also compute the probability of
correct prediction based on score.
3. Combined Prediction
-------------------
Performance of two approaches depends upon known examples. If known
examples are there than nearest neighbor can outperform neural network but in
absence of known example its performance is poor. Predicted SS from these
methods were combined to get the best accuracy, which was based on probability
of correct prediction.
4. Structure to Structure Prediction
---------------------------------
Structure to structure prediction approach has been used to refined the predicted
secondary structure. In this approach, authors develop a neural network method
for predicting secondary structure of a residue from predicted SS. This help in
producing more realistic results by, for instance, suppressing helix and strands of
length one.
One of the challenge in area of Bioinformatics is to provide wide accessibility
to the methods /techniques /information in field of biosciences. Author
developed a dynamic web server allow user to predict secondary structure of
their proteins from its amino acid sequence, based on the above approach. The
common gateway interface (CGI )script of this dynamic web server is written
in programming language PERL. This web sever is accessible via Internet:
http://imtech.chd.nic.in/raghava/pssp or
http://mail.imtech.res.in/raghava/pssp/ or
http://www.imtech.res.in/raghava/pssp/
Torda-Andrew , 065
number of submitted models: 93
Abraham, M, Ayers, D, Dosztanyi, Z, Huber, T,Procter, JB,
Russell, AJ, Torda, AE
email: Andrew.Torda@anu.edu.au
Alignments were calculated and models ranked using the sausage
program [1]. Sidechains were fitted using a self-consistent
mean-field method [2].
Three force fields were used in three different steps
1. Sequence to structure alignments used a score function
which used the identity of only one interaction partner
[5]. This allowed us to use the Gotoh method [4] for speed,
while avoiding the frozen approximation or double dynamic
programming.
2. Ranking of models used a z-score optimised force field [3]
3. Fed by unbounded optimism or perhaps pure faith,
side-chains were placed on the models using a more
conventional, physically based, molecular mechanics style
force field.
The first two force fields may be knowledge-based, but they
were built in complete ignorance of Boltzmann
statistics. Instead, the parameters are optimised so as to
distinguish native coordinates from a mass of misfolded
structures.
A second series of optimisation calculations allowed us to
find weights for additional terms for secondary structure
predictions [6], sequence similarity and gap penalties.
Finally, the library of templates consisted not of simple
protein coordinates, but rather of precalculated fields due to
averaging over similar structures.
The alignment code and methodology is undisputably fast. It
may occasionally be correct.
For the last few targets, secondary structure predictions were
made using a neural net fed on the sausage alignment
calculations.
--------------------
[1] Huber T, Russell AJ, Ayers D, Torda AE (1999)
Bioinformatics, 15, 1064-1065.
Sausage: protein threading with flexible force fields.
[2] Huber T, Torda AE, van Gunsteren WF (1996), Biopolymers,
39, 103-114.
Optimization methods for conformational sampling using a
Boltzmann-weighted mean field approach.
[3] Huber, T and Torda, AE (1999) Protein Sci, 7, 142-149.
Protein fold recognition without Boltzmann statistics or
explicit physical basis.
[4] Gotoh, O. (1982) J Mol Biol, 162, 705-708.
An improved algorithm for matching biological sequences.
[5] Huber T, Torda AE (1998) J Comput Chem, 15, 1455-1467.
Protein sequence threading, the alignment problem, and a
two-step strategy.
[6] Rost B and Sander C. (1993) J Mol Biol, 232, 584-599.
Prediction of protein secondary structure at better than 70%
accuracy.
Raghava-GPS , 018
number of submitted models: 124
angles and secondary structure prediction
G P S Raghava
email: raghava@imtech.ernet.in
A method is developed to compute tertiary structure of a protein from its
amino acid sequence. This method uses the predicted dihedral angles and
secondary structure to build the model. Following steps are involved in
this approach i) dihedral angles of protein backbone is predicted using
method PDAP developed during this study; i) secondary structure is predicted
using standard secondary structure prediction method, to refined the dihedral
angles; iii) backbone dependent rotamer library is used to predict dihedral
angles of side-chains; iv) tertiary structure is build from dihedral angles
of protein backbone and side-chains. v) structure was refined using PROCHECK
by removing the unusual bond lengths and angles. Following is the description
of procedure
Prediction of dihedral angles
-----------------------------
Author developed a method to predict the dihedral angles of protein backbone
from its amino acid sequence. The Protein backbone conformation is divided in
seven states where each conformation state represents the allowed
conformations of isolated peptide as described by Rooman et al. (1991) (J.
Mol. Biol. 221:961-79). Each state can be represent by single value of the
dihedral angles (phi,psi,omega). Author developed a seven state prediction
method using neural network, which allows to predict, conformation state of
residues in protein. The Neural Network used in this study is a 75 hidden
units, feedforward network with backpropagation learning. The protein
backbone dihedral angles were assigned for each residue from predicted
conformation state of residue.
Refinement of dihedral angles
-----------------------------
In second step, secondary structure of residues in protein was predicted using
PSSP. The backbone dihedral angles of residues whose secondary structure was
predicted as helix and strand by PSSP were substituted by new dihedral
angles (--65.30,-39.40) and (-117.00, 142.00) respectively. Residues who
were assigned as coils by PSSP were kept unchanged (as assigned in step 1).
Dihedral angles of side-chain
-----------------------------
The dihedral angles of side-chains of each residue were assigned using BBDEP
program. This program uses the backbone dependent library to assign dihedral
angles of side-chains.
Building tertiary model from dihedral angle
-------------------------------------------
The dihedral angle information of backbone and side-chain predicted by above
procedure was used to build tertiary structure. A computer program was
developed which uses the standard bond length and angles and predicted
dihedral angles to generate the tertiary structure of protein in Cartesian
coordinate.
Final Refinement by PROCHECK
----------------------------
To refined the tertiary structure of protein, we applied PROCHECK validation
suite. We clean the structure by taking resolution criteria 2.5 angstrom. New
file generated by PROCHECK was used as final structure.
TUDELFT , 155
number of submitted models: 44
IN EXTENDED STATE
Jaap A. Flohil, Simon W. de Leeuw,
email: j.a.flohil@tn.tudelft.nl
Protein synthesis is a complex, multistep process with main stages of initiation,
elongation and termination (Wimberly et al, 2000; Nature Vol 407 pp. 327-339).
This process is mimicked by a protocol in which the early events of folding
are simulated by gradually releasing a target polypeptide in extended state from
its restraints. An initial simulation model was created by mutation of a fully
extended polyglycine into the target sequence. GROMACS (Berendsen et al,
1995; Comp. Phys. Comm. 95 pp. 43-56) with GROMOS96 force field (improved
alkane dihedrals) was used for all simulations, and among applied parameters
were periodic boundary conditions, temperature coupling and long range
electrostatics. For each target, the system was initially built up by a shell
of explicit water of 0.6 nm added around the stretched model, and the system
was placed in a 12x12x150 nm box. Before each of the following simulations an
energy minimization was performed. After adding water or renewing the water,
a 10 ps run with position restraints on the protein was done to equillibrate
the water. A main collapsing run of 1 ns was performed. This run started with
all amino acid positions harmonically restrained, releasing each 10 ps a
consecutive residue from its restraints until the complete chain was able to
remove free. The first residue restraint was released at the N-terminal side,
the last restraint at the C-terminal side. For some targets, e.g. T0088, the
residues from the second model were released from their restraints in reversed
order. In some cases (T0086) no restraints were applied.
If the residue releasing procedure was completed
before the end of the simulation, then the remaining time was used to continue
without restraints. From the 1 ns trajectory, each 1 ps a snapshot was recorded,
and for each snapshot the radius of gyration about the principal axes of the
protein atoms were computed.
During the collapsing run, most polypeptides folded into a low density globular form.
The speed and the efficiency of the collapse depended on the used protocol.
The frame with the conformation in the most compact state
was selected for further refinement. Drifting water molecules having no contact
with water shell or protein were removed from the box, and the dimensions of
the box were maximally reduced, and empty holes in the box were filled with
water molecules.
For some of the targets, for example T0102, additional information about the
structure was used to steer the simulation by external forces. T0102 is a
cyclic polypeptide, and the secondary structure was known a priory. The helix
structures were formed by applying a distance depended potential on the
internal helix hydrogens bonds. The bond formation and ring closure was
mimicked by adding a distance dependent potential to the MD hamiltonian. A
penalty is added to a linear potential, as long as the distance between the
terminal atoms 1 N MET and 70 C TRP exceeds the treshold value of the regular
peptide bond distance. During a 3 ns simulation, following after the
collapsing run, early formation of secondary structure is visible. The first
ns shows a sharp decrease of coil, but an increase in bended structure.
Formation of B-bridge and B-sheet was visible in about 10 0f the residues, and
remains at constant level during the second ns. Formation of helix like
structures occurred very rare. based on the evolution of forming secondary and
tertiary segments, as well as backbone-backbone contacts and free energy of
the water, the best model conformation was selected for submission.
Although a ns simulation is at least 6 orders of magnitude shorter then
required to obtain sufficient folding statistics, a rather stable intermediate
conformation maybe found and might serve as starting structure for fold recognition
or folding acceleration methods.
Prof-server , 385
number of submitted models: 42
Ross D. King
Mohammed Ouali
email: rdk@aber.ac.uk
The protein secondary structure prediction program Prof (Ouali & King,
2000) was used to predict all sequences. All predictions were made
automatically using the default server (see URL below). Prof predicts
structure by cascading multiple neural networks and linear
discrimination classifiers. The accuracy of Prof is estimated at
76.7% (using leave-one-out cross-validation) on a non-redundant
dataset of 496 non-homologous sequences (obtained from G.J. Barton and
J.A. Cuff). This database was especially designed to train and test
protein secondary structure prediction methods, and it uses a more
stringent definition of homologous sequence than in previous studies.
A major strength of Prof is that it can be tuned to discriminate the 3
classes (H, E, C) with an accuracy of up to 78% for beta-strands.
Prof Server: http://www.aber.ac.uk/~phiwww/prof/
Ouali, M., & King, R.D. (2000) Cascaded multiple classifiers for
secondary structure prediction. Prot. Sci 9, 1162-1176
Aberystwyth , 214
number of submitted models: 41
Ross D. King
Mohammed OualiAndreas Karwath
email: rdk@aber.ac.uk
The following combined methodology was applied to predict the secondary
structure of all sequences.
1) For all sequences the HI sever (URL below) was first run to
identify homologous sequences of known 3D structure. HI is a sequence
homology program which uses machine learning to bootstrap sequence
similarity searches. Hi learns rules that are true for sequences
which are clearly homologous to the target sequence, and uses these
rules to discriminate homologous sequences in the twilight zone.
2a) If no homologous sequence was detected of known structure then the
Prof server (Ouali & King, 2000: URL below) was used to predict
secondary structure.
2b) If there was a homologous sequence of known structure, then the
sequence of strongest homology was chosen and aligned to the target
and then used as a template to predict secondary structure.
Occasionally the homology predictions of HI were overruled by hand
and Prof used instead.
HI server: http://www.aber.ac.uk/~phiwww/hi/
Prof server: http://www.aber.ac.uk/~phiwww/prof/
Ouali, M., & King, R.D. (2000) Cascaded multiple classifiers for
secondary structure prediction. Prot. Sci 9, 1162-1176
PROF/Rost , 402
number of submitted models: 43
Burkhard Rost
email: rost@columbia.edu
PROF extends PHD in various ways.
First, the neural network system was retrained on a larger
data set of more than 1000 proteins.
Second, the profiles input to the networks were generated
with an iterated PSI-BLAST search against SWISS+TREMBL+PDB.
Third, particular aspects were added to the input units
neural networks.
BioInfo.PL , 031
number of submitted models: 97
Janusz M Bujnicki, Leszek Rychlewski
email: leszek@bioinfo.pl
Several features of a protein can be inferred based on sequence
similarity or assumed homology with other proteins. These include 3D
structure or general functional description. Sequence alignment is the
most common approach used for the assertion of homology. The
predictive power and utility of homology based prediction methods
increases with the continuously growing database of proteins with
annotated structure or functions. Additional increase in predictive
power can be attributed to the improving accuracy and sensitivity of
sequence comparison methods. Consideration of sequence information
deduced from the family of proteins closely related to the query
protein, as performed by PSI-Blast, enabled a dramatic boost in the
predictive power of sequence comparison methods. The approach of
amplification of sequence information by incorporation of evolutionary
related sequence is being pursued further in a bilateral fashion. The
family of BASIC (Bilaterally Amplified Sequence Information
Comparison) programs represents prediction methods utilizing the
evolutionary information on both sides of the comparison: the query
and the template. ORFeus, the current descendant of this approach was
used during CASP4. The main difference between such prediction
methods and threading approaches is the ability the possibility of a
sensitive exploration of the protein sequence space. The target protein
and its family based profile is compared with more than 100,000 other
profiles. This comparison is expanded in an iterative fashion until a
profile of a family with known structural information is found.
In CASP4, the sequence analysis and the final structural predictions
were supported by the utilization of modern threading approaches,
which were carefully benchmarked before (http://bioinfo.pl/livebench).
The Structure Prediction Meta Server was used for this additional
analysis (http://bioinfo.pl/meta).
Shortle , 001
number of submitted models: 29
David Shortle and Michael Ackerman
email: shortle@welchlink.welch.jhu.edu
The method attempts to predict protein structure at low resolution by crudely
modeling the denatured state as an ensemble of overlapping fragments. The
free energy of fragment ensembles is very roughly approximated by weighting
both the energies of individual fragments and an ensemble “pseudo-entropy”
term, which contains the frequency of occurrence of secondary structure, the
handedness of connections and cross-overs, proximity between pairs of segments
of secondary structure, and more global features of topology. The central
working assumption is that the denatured state ensemble is positioned in the
widest free energy well in compact conformation space (Shortle et. al, 1998).
In the first step, the sequence of the target protein is divided into fragments
(ungapped 15 - 35 residues, gapped 35 -45 residues) that extend from the end of
one turn to the beginning of a second turn. Initially, turns are localized by
computing turn propensities (Chou and Fasman, 1980), and subsequently their
positions are refined from turn locations in low energy fragment ensembles.
These fragments of sequence are threaded through a set of 1600 proteins (the
FSSP database of Holm and Sander, 1996), with evaluation of both the empirical
pair potential of Bryant and Lawrence (1993) and the secondary structure
probability from the helix, strand, and turn propensities of Chou and Fasman.
Secondary structure is then predicted from profiles of the frequencies of
turns, helices and strands displayed by overlapping sets of 10 to 30 lowest
energy fragments that do not exceed a specified level of secondary structure
improbability.
Using the predicted central residues of helices, strands,
and turns, the PDB is searched again for larger gapped fragments (+/- 3
residues in each predicted turn) with the correct secondary structure and the
lowest energy. Sets of overlapping fragments were superposed to identify the
most common topological relationship between segments of secondary structure
inside and outside the region of overlap. For the predictions submitted to
CASP4, visual inspection was the principal method used to identify patterns
among helices and strands, and then all of these features were put into a
consistent schematic model of the overall topology. In this process, the
empirical rules described Jane Richardson and others were used to resolve
uncertainties. The initial model, usually built manually by docking sets of
fragments from more than one protein, contained gaps of +/- 3 residues in all
turns; these were patched using the lego_loop function of the O software
(Alwyn Jones). In work since the end of CASP4, superpositioning based on
distance matrices and rapid coordinate calculations is being used to build
gapped models with little or no human intervention.
If the schematic model resembled a known structural class, protein structures
taken from this class were used to build a low energy model. Thus for some
targets, the method serves for fold identification. In addition, the
secondary structure from fragment ensembles can be used to aid alignment to
known structural homologues when there are too few homologues to nail down the
secondary structure with confidence using methods such as PHD or PSIPRED.
Because fragment assembly into a model was based on human intervention, the
method was not immune to additional sources of information and input of the
pattern-recognition type.
Shortle, D., Simons, K.T. and Baker, D. Proc. Natl. Acad. Sci. USA 95: 11158 (1998).
Bryant, S.H. and Lawrence, C.E. PROTEINS 16:92 (1993).
Fasman, G.D. Prediction of Protein Structure and the Principles of Protein
Conformation, Plenum Press, (1989), page 282.
Holm, L. and Sander, C. Science 273:595 (1996).
HeadGordon-Teresa , 383
number of submitted models: 21
Algorithm
E. Eskow, B. Bader, V. Lamberti, R. H. Byrd, and R.B. Schnabel
Computer Science Department, University of Colorado at BoulderCampus Box 430, Boulder,
CO 80309, USA.andSilvia Crivelli and T. Head-GordonPhysical Biosciences, NERSC, and
Life Sciences DivisionsLawrence Berkeley National Laboratory, Berkeley, CA 94720,
USA
email: TLHead-Gordon@lbl.gov
The stochastic/perturbation with soft constraints method
(henceforth, the SPSC method) for predicting tertiary structure
of proteins combines knowledge about secondary structure with a
large-scale global optimization method and a full-scale local
minimization algorithm (1-4). It uses secondary structure
information both to guide the search and to reduce the size of
the search tree through the formulation of soft constraints.
The SPSC algorithm is composed of two phases.
Phase I locates some good local minimizers through local
minimizations with soft constraints. These minimizations are done
using a limited-memory BFGS local minimization algorithm varying
all the cartesian coordinates of the atoms in the protein. The
soft constraints are derived from predictions of secondary structure
obtained from neural networks; we primarily relied on the Psi-pred
server for these predictions, although we used J-pred predictions
for one protein. Given the primary sequence, the neural networks
predict whether the secondary structure of each amino acid
should be alpha-helix, beta-sheet, or coil, and provides
an indicator of the strength of each prediction. Phase I begins
with a starting structure that is the minimum closest to the fully
extended chain with all the backbone dihedral angles phi=180° and
psi=-180°. The local minimizations encourage the formation of
alpha-helices and beta-sheets through the use of penalty (reward)
functions; the strength of a penalty (reward) function depends on
the strength of a neural network prediction. For the case of
beta-sheets, i,j pairs were predicted using the algorithm
described in H. Zhu & W. Braun, Protein Science (1999) V.8,
326-342, and we used several most probable pair lists. The
local minimizers that result from Phase I therefore contain
predicted secondary structure but do not contain any correct
tertiary structure.
Phase II improves those minimizers through expensive global
minimizations in a sub-space of the torsion angles corresponding
to amino acids predicted to be coil. The global optimization for
finding the optimal dihedral angles of coil residues uses a method
introduced by Rinnooy-Kan. This global optimization approach is one
of the few that provides a (weak) theoretical guarantee of finding
global optimum. A brute-force search is avoided by *selectively*
doing a local minimization based on whether a new proposed start
structure lies within a certain distance metric, and whether its
energy is lower than other existing structures; if a new start
structure lies within the distance metric of another structure, and is
higher in energy, it is assumed to lie within an existing basin of attraction,
and is rejected from further computational consideration. Because the
probabilistic theoretical guarantee is higher for small dimensional problems,
the idea is to select a subset of ~6-10 variables from the space of torsion
angles predicted to be coil. A small scale global optimization is performed
on the selected torsion angles as variables while keeping the rest
temporarily fixed at their current values. This algorithm is general
in the sense that subspaces of arbitrary dimension can be explored.
However, in practice, the amount of work required to reach the
theoretical guarantee is prohibitive for large subspaces. The small-scale
optimization produces a number of local minimizers in the subspace
of torsion angles chosen. A number of those conformations with low
energy values are selected for local minimizations in the full
variable space. The new minimizers obtained from the local minimizations
are merged into the current list, are clustered and ordered by energy value
(see below) and the second phase starts again. The process repeats for a
number of iterations, until no further progress is made according to a
stopping criteria we describe below.
To guarantee that the search through the space of possible
solutions is well balanced in breadth and depth of the tree, a
heuristic is used to determine which local minimizer to pick
from the list of local minimizers for the next round of phase II.
In a balancing stage, the work is balanced over a fixed number of
trees. A tree consists of an initial minimizer and all the minimizers
generated from it by applying a global optimization on a small
subspace of torsion angles followed by a local minimization
over the entire space. At each iteration of the balancing stage,
the tree with the least amount of work performed on its members
so far is selected, and the best configuration in this tree that
has not already been used is chosen. After the fixed number of
iterations of the balancing stage have been performed, the
remaining iterations of Phase II correspond to the non-balancing
stage. In this stage, the best configuration is selected, regardless
of which tree it comes from. We have found that the combination of
breadth and depth in the search of the configuration space
contributes to the success of this method.
The SPSC algorithm uses the AMBER molecular mechanics energy
function, EAMBER, to represent the protein-protein interactions.
We have added an empirical solvation free energy term that models
the hydrophobic effect as a two-body interaction between all
aliphatic carbon centers. This description is motivated by our
recent experimental, theoretical and simulation work to determine
the role of hydration forces of model protein systems [5-7]. The
Esolv interaction potential has two minima: one for the carbons in
contact, and one for the carbons seperated by a distance of one
molecular layer of water, with a barrier in between. The benefits
of this form are that (1) we introduce a stabilizing force for
forming hydrophobic cores, (2) it is a well-defined model of the
hydrophobic effect, and (3) it can be described as a continuous
potential that is computationally tractable. We tested the effect
of the solvation energy function on conformations of the 70 amino
acid protein uteroglobin (2utg A) and the 72 amino acid DNA
binding protein (1pou), initially using parameter values based
on the experimental/theoretical work in [5-7]. We found good
agreement between misfolded structures and energies values
using this form of potential, but agreement was better using
parameter values that exaggerate the stability of the contact
and solvent-separated minima. In addition, it proved beneficial
to model the screening of electrostatic interactions by scaling
with a dielectric constant of four, which is typical of a
dielectric for a protein environment. In extensive testing of
Esolv using these parameters with three different alpha-helical
proteins, we compared the energies of the crystal structures
with approximately 40,000 misfolded conformations generated by
our global optimization algorithm. These structures had roughly
correct secondary structure but incorrect tertiary structure.
For each of the proteins the potential function with Esolv
correctly gave the crystal structure as the lowest energy
relative to the 40,000 misfolds.
The global optimization algorithm runs on the T3E at NERSC,
where for CASP4 we used between 64 and 128 processors. At the
end of each Phase II run the algorithm returns between 60-80 of
the best (lowest energy) configurations found thus far. These
60-80 structures are clustered into groups in which members of
a given cluster are within 5-10Å r.m.s.d. of each other (lower
bound for small proteins (under 100 residues), upper bound for
large proteins) and the members of each cluster are energy
ranked. Typically we see on the order of 10-20 clusters in
early Phase II. The best configuration from each cluster is
used as a starting point for the next round of Phase II (whether
balancing or un-balancing). Our experience is that a good sign
of convergence is the contraction of the number of distinct
clusters to one or two, and an energy value which is no longer
changing. We always submitted structures that were our lowest
in energy as the first prediction, the second prediction generally
as the lowest energy of the member of the second lowest energy
cluster, etc.
1. S. Crivelli, T. Head-Gordon, R. H. Byrd, E. Eskow, R. Schnabel
(1999). A hierarchical approach for parallelization of a global
optimization method for protein structure prediction. Lecture Notes in
Computer Science, Euro-Par '99, P. Amestoy, P. Berger, M. Dayde, I.
Duff, V. Fraysse, L. Giraud, D. ruiz (eds.), pg. 578-585.
2. S. Crivelli, T. M. Philip, R. Byrd, E. Eskow, R. Schnabel, R. C. Yu,
T. Head-Gordon (2000). A global optimization strategy for predicting
protein tertiary structure: a-helical proteins. Computers & Chemistry
24, 489-497.
3. A. Azmi, R. H. Byrd, E. Eskow, R. Schnabel, S. Crivelli, T. M.
Philip, T. Head-Gordon (2000). Predicting protein tertiary structure
using a global optimization algorithm with smoothing. Optimization in
Computational Chemistry and Molecular Biology: Local and Global
Approaches, C. A. Floudas and P. M. Pardalos, editors (Kluwer Academic
Publishers, Netherlands), 1-18.
4. S. Crivelli & T. Head-Gordon (2000). A Hierarchical Approach for
Parallelization of Large Tree Searches. Submitted to J. Parallel &
Distributed Computing.
5. A. Pertsemlidis, A. K. Soper, J. M. Sorenson & T. Head-Gordon (1999).
Evidence for microscopic, long-range hydration forces for a hydrophobic
amino acid. Proc. Natl. Acad. Sci. 96, 481-486.
6. J. M. Sorenson, G. Hura, A, K, Soper, A. Pertsemlidis & T.
Head-Gordon (1999). Determining the role of hydration forces in protein
folding. Invited Feature Article for J. Phys. Chem. B, 103 5413-5426.
7. G. Hura, J. M. Sorenson, R. M. Glaeser & T. Head-Gordon (1999).
Solution x-ray scattering as a probe of hydration-dependent structuring
of aqueous solutions. Perspectives in Drug Discovery and Design 17,
97-118.
Xia , 446
number of submitted models: 35
Yu Xia, Michael Levitt, Ram Samudrala
email: yuxia@csb.stanford.edu
We present a combined approach to construct low resolution models of
protein structure from sequence information alone [1,2]. Starting from
protein sequence, we uniformly sample protein conformational space by
complete enumeration on a simple lattice and have a pool of up to 10
billion structures. We then use a variety of selection and
reconstruction techniques to both reduce the size of candidates and
push the distribution of candidate structures to more native-like,
till a single structure model of the protein sequence is generated. A
detailed description follows:
First, we exhaustively enumerate all possible compact bounded lattice
walks on a tetrahedral lattice to capture the overall protein topology
[3,4]. The maximum walk length in our approach is 50; hence each
vertex can represent more than one residue for a protein of size up to
200 residues. To obtain a model, a lattice walk is threaded with the
target protein sequence and the score is evaluated and minimized using
a residue-residue contact function. The 20,000 best scoring structures
and their mirror images are selected using a simple lattice-based
scoring function. At this stage only protein tertiary topology is
captured; there is no secondary structure or side chain information in
these structures. To build all-atom models, we fit predicted secondary
structures to the selected lattice conformations using the consensus
of a combination of methods based on the output produced by the CAFASP
server. The consensus predictions are fit to the lattice structures
using an off-lattice four-state phi/psi model and a sequential
build-up algorithm [5]. Side chains were predicted by simply using the
most frequently observed rotamer in a database of protein structures
[6]. The all-atom conformations are minimised using ENCAD and scored
using a combination of scoring functions that hierarchically reduces
the total number of conformations output produced to five which are
used for the final submissions.
The scoring functions used for the final filtering include an all-atom
distance-dependent conditional probability discriminatory function
(rapdf) [7], a hydrophobic compactness function (hcf) [2], a simple
residue-residue contact function (Shell) [2], a density-scoring
function that is based on the distance of a conformation to all its
relatives in the conformation pool, a secondary structure based
scoring function that evaluates the match between the predicted
structure and the secondary structure of a final energy-minimised
conformation, and standard physics-based electrostatics and Van der
Waals terms.
We have tested our prediction procedure on numerous proteins,
including making predictions at CASP3, with reasonably positive
results [1,2,3]. Unlike Monte Carlo and molecular dynamics, our
approach is largely deterministic and provides a uniform sampling of
structure space. The weakest part in our approach is the
low-resolution nature of the tetrahedral lattice model: structural
features such as sharp turns cannot be accurately represented. In
general, our procedure is more effective for alpha and mixed proteins
than beta proteins. Our main change in this effort since CASP3 has
been the use of improved scoring functions to achieve the best
discrimination. We are thus able to select conformations that are
similar even if the secondary structures are predicted completely
inaccurately (as in the case of T102, where the consensus secondary
structure prediction predicts two sheets instead of helices which we
used "as is" since there is no manual intervention in this method).
[1] Samudrala R, Xia Y, Levitt M. Huang ES. Proteins, S3: 194-198,
1999.
[2] Xia Y, Huang ES, Levitt M, Samudrala R. J. Mol. Biol., 300:
171-185, 2000.
[3] Hinds DA, Levitt M. Proc. Natl. Acad. Sci. USA, 89:2536-2540,
1992.
[4] Hinds DA, Levitt M. J. Mol. Biol., 243:668-682, 1994.
[5] Park B, Levitt M. J. Mol. Biol., 249:493-507, 1995.
[6] Samudrala R, Huang ES, Koehl P, Levitt M. Prot. Eng., 7: 453-457,
2000.
[7] Samudrala R, Moult J. J. Mol. Biol., 275:895-916, 1997.
Ram-Samudrala , 028
number of submitted models: 207
ab initio structure prediction
Ram Samudrala and Michael Levitt
email: ram@csb.stanford.edu
Our general paradigm for predicting structure involves sampling the
conformational space (or generating "decoys") such that native-like
conformations are observed, and then selecting them using a
hierarchical filtering technique using many different scoring
functions. Our goal was to devise a method that would combine the
best aspects of the more successful ab initio methods at CASP3. There
are three stages to our approach, which is completely automated:
1. Generation of secondary structure prediction
The consensus of the secondary structure predictions from the various
servers at the CAFASP meta-server was used as the secondary structure
prediction.
2. Searching of conformational space
We initially start with an all-atom conformation where residues
predicted to be in helix/sheet by the consensus secondary structure
prediction are set to idealised helix and sheet values. The remaining
phi/psis are set in an extended conformation. Side chain conformations
are predicted by simply using the most frequently observed rotamer in
a database of protein structures [1]. New conformations are generated
by replacing three phi/psi values for three residues with identical
sequence which are obtained from a database of known structures. The
scoring function used is primarily a combination of an all-atom
distance-dependent conditional probability discriminatory function
(rapdf) and a hydrophobic compactness function (hcf) [2]. The
conformations are modified based on the fragment insertion approach
and and the energies are minimised by using two different protocols: a
straight-forward monte carlo/simulated annealing approach similar in
spirit to that of Baker and colleagues [3], and a conformational space
annealing approach developed by Scheraga and colleagues [4]. A
combination of minimisation parametres and scoring functions were
used to generate a large pool of conformations.
3. Selection of conformations
The conformations generated were minimised using ENCAD and scored
using a combination of scoring functions that hierarchically reduces
the total number of conformations produced to five which are used for
the final submissions. The scoring functions used for the final
filtering include a simple residue-residue contact function (Shell), a
density-scoring function that is based on the distance of a
conformation to all its relatives in the conformation pool, a
secondary structure based scoring function that evaluates the match
between the predicted structure and the secondary structure of a final
energy-minimised conformation, and standard physics-based
electrostatics and Van der Waals terms.
This work is an attempt at combining three prediction methods from
CASP3, the conformational space annealing method by Scheraga [3], the
use of fragments to sample conformational space by Baker [4], and
using scoring functions of Samudrala et al [1] to not only drive the
search method, but also to make the final selections. In addition,
there are components that are unique to this approach primarily in the
form of the heirarchical filtering methodology employed and in subtle
variations of each of the search methods.
[1] Samudrala R, Huang ES, Koehl P, Levitt M. Prot. Eng., 7: 453-457,
2000.
[2] Xia Y, Huang ES, Levitt M, Samudrala R. J. Mol. Biol., 300:
171-185, 2000.
[3] Simons KT, Kooperberg C, Huang E, Baker D. J. Mol. Biol., 268:
209-225, 1997.
[4] Lee J, Liwo A, Scheraga HA. Proc. Natl. Acad. Sci. USA, 96:
2025-2030, 1999
Avbelj-Franc , 046
number of submitted models: 25
Using the Electrostatic Screening Model and Hierarchic Condensation
F. Avbelj
email: francl@sg3.ki.si
The method for predicting three-dimensional (3D) structure of proteins
is based on the electrostatic screening model of amino acid residue
backbone conformational preferences (ESM) (1-4). According to the ESM,
the stability of backbone conformations is primarily determined by the
balance of strengths between local (E-local) and nonlocal (E-nonlocal)
main-chain electrostatic interactions. These strengths depend on the
amino acid side-chains because the local and nonlocal electrostatic
interactions are screened to a different degree by the solvent. The
hypothesis is based on an analysis of potentials of mean force obtained
from high resolution experimental protein structures (5,1,6-7). The
strongest support for the ESM is provided by recent experimental studies,
which demonstrated that an enthalpic factor is involved in determining
the preferences for alpha-helices and beta-strands (8-9). Further
support for the ESM is provided by the NMR studies of denatured
proteins (10). The ESM has been used for predicting secondary (2; ESM
implemented in the Lifson-Roig theory, Q3 is 69%) and 3D structure
of peptides and proteins (6-7). The 3D structure of proteins is predicted
from sequence alone. The free energy of a protein as a function of its
conformation is obtained from the potentials of mean force analysis of
high resolution x-ray protein structures (6,7). The free energy function
is simple and contains a small number of fitted coefficients (~50). Various
different models for the most critical hydrophobic interactions are used.
The minimization of the free energy is performed by the torsion space
Monte Carlo simulations (6,7). In the first phase of the minimization
procedure only short-range interactions are activated. Short-range
interactions are interactions between amino acids less than four residues
apart in the sequence. The secondary structures are formed during this
initial phase. The long-range interactions are gradually activated in
later phases which is causing the hierarchic condensation of
alpha-helices and beta-strands into super-secondary and the larger compact
structures. Long-range interactions are interactions between the residues
distant in the sequence. All heavy atoms including polar hydrogens are
included in simulations. The procedure is described in details in
references 6 and 7.
References:
1. F. Avbelj and J. Moult, Biochemistry, 34, 755-764 (1995).
2. F. Avbelj and L. Fele, J. Mol. Biol., 279, 665-684 (1998).
3. F. Avbelj, J. Mol. Biol., 300, 1337-1361 (2000).
4. F. Avbelj, P. Luo, R. L. Baldwin, Proc. Natl. Acad. Sci. USA, 97, 10786-10791 (2000).
5. F. Avbelj, Biochemistry, 31, 6290-6297 (1992).
6. F. Avbelj and J. Moult, Proteins: Struc., Funct., Genet., 23, 129-141 (1995).
7. F. Avbelj and L. Fele, Proteins: Struc., Funct., Genet., 31, 74-96 (1998).
8. P. Luo and R. L. Baldwin, Proc. Natl. Acad. Sci. U.S.A., 96, 4930-4935 (1999).
9. M. Lorch, J. M. Mason, R. B. Sessions, A. R. Clarke, Biochemistry, 39, 3480-3485 (2000).
10. M. Hennig, W. Bermel, A. Spencer, C. M. Dobson, L. J. Smith, H. Schwalbe,
J. Mol. Biol., 288, 705-723 (1999).
bme-mathatiitb , 439
number of submitted models: 5
Nonparametric Regression and Heuristic Constraints
Jyothi S. and Rajani R. Joshi
email: jyothi@math.iitb.ernet.in
We estimate the short and medium range correlations between the primary
and tertiary C-alpha distances (dij , |j-i| < 5) using nonparametric
discriminant analysis of some features of the primary chain. These
features include the global as well as the local measures of the sequence
in terms of certain physico-chemical properties of the amino acid residues
in the selected segments of the primary chain.
The heuristics derived from these correlations are then used in a
nonparametric regression model to estimate the short and medium range
C-alpha distance intervals [1]. A large sample of nonhomologous, high
resolution proteins of sizes between 75-150 residues, obtained from the
PDB, was used as the training sample in model estimation and validation.
The model was further validated on a different set of proteins with known
structures. The average accuracy of estimation was found to be above 90%.
Long range distance intervals were then estimated from the sequence by
imposing certain compactness and globular constraints for a well defined
hydrophobic core. A distance geometry program [2] was used to generate the
C-alpha trace of the protein using these distance estimates. About 20 to 30
compatible structures were generated which are then ranked on the basis of
the number of globular constraint and bump distance threshold violations.
We have used our method to determine the structures of various classes of
proteins and have found that the method is well suited (with the Root Mean
Square Error ranging between 1A to 3.0A only) for the prediction of local
folds. Moreover these local structures are stable and are in good agreement
with those computed by the threading methods of [3]. The validation runs on
the set of proteins used by [4] have further shown the superiority of our
method as compared to DRAGON and XPLOR in terms of accuracy and computa-
tional efficiency. For more details of this study see [1].
We have used the method described above to predict the 3d-structures of
the CASP4 targets T0097, T0110 and T0091 --- the main criteria for the
selection of targets being the number of residues in the chain. The atomic
coordinates for these proteins were then obtained from the best ranked
C-alpha structure using Maxsprout [5]. We then refine the backbone for good
stereochemistry through the minimization of a distance based shape
function derived by us. No homology modelling or standard energy function
minimization is used in our method.
References:
[1] Jyothi, S. and Joshi, R. R. "Protein Structure Determination by Non-
parametric Regression and Knowledge Based Constraints". To be published
in Computers and Chemistry
[2] More, J.J. and Zhijun Wu "Distance Geometry optimization for protein
structures." (1999) J. Global Optim. 15(3), 219-234.
[3] Sippl, Manfred "Calculation of conformational ensembles from potentials
of mean force." (1990) J. Mol. Biol. 213, 859-883.
[4] Aszodi, A., Gradwell, M.J., and Taylor, W.R. "Global fold
determination from a small number of distance restraints." (1995) J. Mol.
Biol., 251, 308-326.
[5] Holm, L. and Sander, C. "Database algorithm for generating protein
protein backbone and side-chain coordinates from C_alpha trace." (1991)
J. Mol. Biol. 218, 183-194.
Skolnick-Kolinski , 080
number of submitted models: 142
Andrzej Kolinski, Daisuke Kihara, Marcos Betancourt, Piotr Rotkiewicz,
Michal Boniecki, and Jeffrey Skolnick
email: skolnick@danforthcenter.org
Structure prediction is done using a hierarchical, ab initio folding method
that has potentials constructed using predicted tertiary and secondary
restraints extracted from weakly significant fragments generated from
PROSPECTOR (1), a new threading algorithm, predicted secondary structure,
generic protein like potentials, burial terms and protein specific pair
potentials. First, using a side chain, center of mass based lattice model
(SICHO) of Kolinski and Skolnick (2), initial structures are generated as
follows. Using a prediction of secondary structure, gapless threading of
structures of comparable size is performed using the matching fractions of
the predicted secondary structure to the actual secondary structure of the
templates as a scoring function. Fifty lattice chains were built using the 50
best scoring structures as templates. Then, in the preliminary simulation
runs, fifty replicas were used. The second iterations used the top 20 (20
lowest energy replicas) as the input pool. The simulation results from the
last iteration of the lattice-folding algorithm were subject to a clustering
procedure (3) that was also used to make the final fold selection. All folds
are locally relaxed using a more detailed off- lattice model comprised of the
alpha carbons and a one or two center description of the side chains that
depends on the side chain size; this tends to improve the secondary structure
of the models. Atomic detail is then added and the resulting structures are
reported.
1. J. Skolnick and D. Kihara, Defrosting the frozen approximation: PROSPECTOR:
A new approach to threading ,Proteins in press (2000).
2. A. Kolinski and S. A., Assembly of protein structure from sparse experimental
data: An efficient Monte Carlo Model ,Proteins 32 475-494 (1998).
3. M. Betancourt and J. Skolnick, Finding the needle in a haystack: Educing protein
native folds from ambiguous ab initio folding predictions ,J. Comp. Chem
in press (2000).
HenryS , 251
number of submitted models: 43
Henry Schultz and Michael Bass
email: hschultz@amgen.com
The Salamov/Solovyev NNSSP Secondary Structure Prediction algorithm ((1995)
J.Mol.Biol, 247,11-15) is perhaps the best of the current crop of single method
prediction algorithms (King, et al, (2000) Protein Engineering , 13, 15-19). To help
correct the known deficiencies of this algorithm we combine local thread predictions
with NNSSP to improve overall accuracy.
A reasonable window size is selected and the window is slid across the
target protein one AA at a time. Each window is threaded against a 35
on-redundant collection of 3D structures from PDB. The threading Z-scores are
used to select windows of optimal prediction and the secondary structure
predictions thus obtained are compared to NNSSP results by rules using both
the threading information and the NNSSP alpha and beta probabilities. There
are further rules for handling multiple overlaps and conflicts. The algorithm
generates an initial prediction automatically. Some manual editing may be
done afterward to resolve difficult situations using the multiplicity of
window thread data. After CASP4, we will develop rules to do this and make
the predictor completely automatic.
The algorithm is intended for
predicting the secondary structure of novel proteins so no other information
about biology, alignments or structural similarity are used.
Rokko , 225
number of submitted models: 30
Yoshimi Fujitsuka, Hiroaki Fukunishi, Wenzhen Jin, and Shoji Takada
email: stakada@kobe-u.ac.jp
We utilize a coarse-grained model of proteins that take into account
solvent effects and perform folding simulation of which final structures after
quenching to zero temperature are submitted as predicted structures. The most
important characteristics of our approach is to combine basically
physico-chemical potential energy function with knowledge-based parameter
optimization. The latter is base on the energy landscape theory of protein
folding.
The protein chain is represented with 4 united atoms per residue;
three for backbone atoms and one for side-chain centroid. The side-chain
centroid can move among several positions, each of which corresponds to major
classified rotamer. All the bond length and bond-bond angles are fixed to the
standard values. The interaction potential includes a usual local potential
around dihedral angles, the Ramachandran potential, the hydrophobic (HP)
interaction, the electrostatic interaction that includes the hydrogen bond
(HB), and the van der Waals (vdW) interactions.
One major feature of our model
is the hydrogen bonding that depends on local dielectric constant (Takada,
Luthey-Schulten, and Wolynes (1999) J. Chem. Phys. Vol.110 11616). Namely, the
HB in the protein core is stronger than that at surface. Such a
context-dependence is represented in a multi-body functional form. The same
kind of dependence is taken into account for other electrostatic interaction
(Takada, (2000) Proteins, in press). Physically, this feature is based on
solvent and thus we call this interaction model the "solvent induced
multi-body force field" (SIMFOLD). Amino acid specificity mainly comes in via
the hydrophobic interactions and the vdW interaction between side-chains. The
former depends on the nature of a residue, while the latter depends on the
nature of a pair of residues.
The model includes many energetic parameters
that sensitively affect the prediction power. Thus, we optimize these
parameters based on available protein structural database. Here, we are based
on the so-called energy landscape theory of protein folding. Parameters are
optimized so that the TF/TG for protein libraries are maximized. It is
interesting that optimized energy parameters are significantly correlated to
experimentally obtained energetic scale for the hydrophobic interaction. This
implies potential advantage of combining physico-chemical energy function with
knowledge-based parameter optimization.
Another aspect of the model is to use knowledge-based bias. In particular,
we rely on the secondary structure prediction (PHD server is used) and
introduce an additional bias to the Ramachandran potential. This bias makes
the predicted structure much more secondary structure rich. Especially, the
beta sheet protein is difficult and is time consuming without this type of
bias.
Folding simulation of the standard simulated annealing method is
performed from random coil structures. With one time step 0.03ps, folding up
to 0.4 microseconds is taken place, during which the system is cooled from
450K to 350K. Based on eye inspection, sometimes we repeat these annealing
runs few times for better sampling. Out of several predicted structures, we
choose the final structure mostly by the content of secondary structures,
compactness of the overall shape, and the total energy scoring.
References
Shoji Takada, Zaida A. Luthey-Schulten, and Peter G. Wolynes, Folding dynamics with
nonadditive forces: A simulation study of a designed helical protein and a random
heteropolymer, Journal of Chemical Physics, 110, 11616-11629 (1999)
Shoji Takada, Protein folding simulation with solvent-induced force field:
Folding pathway ensemble of three-helix-bundle proteins, PROTEINS, in press.
CIRB , 392
number of submitted models: 40
Piero Fariselli, Osvaldo Olmea, Alfonso Valencia and Rita Casadio
email: casadio@alma.unibo.it
Our predictor is based on neural networks which were trained to learn the association
rules between the covalent structure of each protein of a selected data base
and its contact map. The input coding includes evolutionary information,
sequence conservation, correlated mutations, and predicted secondary
structures.
The database
For training the network, we use a large set of non-homologous proteins of known
3D structure. The list includes all proteins in the PDB-select list of non
sequence-redundant protein structures whose chain was not interrupted and for
which alignments with more than 15 sequences were obtained: in total our set
includes 173 proteins.
Computing sequence conservation and correlated mutations
Sequence variability was taken from the HSSP database (Sander and Schneider, 1993).
In the HSSP definition, variability is 0 when positions in the multiple sequence
alignments are completely conserved and it increases proportionally to the
number of amino acid changes occurring at that position. Correlated mutations
are calculated as previously described (Göbel, et.al., 1994). Briefly, a
distance array is used to codify each position in the alignment. This
position-specific array contains all the residue-residue distances between all
the possible pairs of sequences at that position. The correlation value
between each pair of positions in the alignment is computed as the correlation
of the two arrays for each possible residue pair. Corresponding elements in
the arrays contain the distance between the same two sequences in the two
positions under comparison. The scoring matrix of McLachlan (McLachlan, 1971)
defines the distances between residues. Positions with a percentage of
gaps>10% are set at a correlation value of -1 and completely conserved
positions are set at a correlation value of 0. The similarity value of gaps is
set to a dummy value of 0.
Neural network architecture and input codings
Neural networks have been proven to be one of the most successful methods for
prediction of contact maps of proteins (Fariselli and Casadio 1999). In this
work we implement a neural network architecture which is similar to that
described before. This topology is the best performing one with the problem at
hand. A single output neuron codes for contact (output value close to 1) and
non contact (output value close to 0). The hidden layer consists of 8 hidden
neurons. A new type of input coding was previously introduced (Fariselli and
Casadio 1999) and it is also used here. Each residue pair in the protein
sequence is coded as an input vector containing 210 elements (20x (20+1)/2),
representing all the possible ordered couples of residues (considering that
each residue couple and its symmetric are coded in the same way). This is done
in order to reduce the number of weight junctions. When single sequence is
used, the input neuron coding for the ordered couple of amino acidic residues
at positions i and j is set to 1, while the remaining 209 are set to 0. In
order to take into account the sequence neighbours we use a 3-residue long
input window, considering both parallel and anti-parallel pairing of the two
segments centred at positions i and j, respectively. This procedure requires
1050 (210x5) input neurons. In order to include evolutionary information the
binary input is changed in a frequency-based one. This means that for any two
positions i and j the frequency of each type of ordered couple obtained from
the same sequence in the alignment is computed (Fariselli and Casadio 1999).
The neural network takes into account for each positions in the sequence
alignment information derived from sequence conservation and correlated
mutations as previously computed (Olmea and Valencia 1997). Moreover the
predicted secondary structures are also added as input to the network.
The network is trained using the back-propagation algorithm and a balancing
procedure (Fariselli et al., 1993) in order to avoid deterioration of the
performance due to the large imbalance between contacts (less abundant) and
non contacts (more abundant). The weight junctions are randomly initialised in
the range [-0.01, 0.01]; the learning rate and the momentum term are set to
0.1 and 0.9, respectively. A sorting procedure based on the network output
values is adopted. Contacts are defined as the highest Lp prediction values
for a protein of length equal to Lp and are routinely characterised by output
activation values greater than 0.75-0.80.
The filtering procedure
To avoid contact overprediction, the predicted pairs are filtered taking into
account the amount of contacts that each residue type can make (Olmea and
Valencia, 1997). The filtering procedure is based on the occupancy data (or
residue-coordination numbers) of each residue. This value is statistically
derived from the set of protein structures of the data base and takes into
account the secondary structure type and the solvent exposition of each
residue. By this, the number of predicted contacts of a residue becomes a
function of its structural environment. The occupancy can be therefore
considered an estimate of the maximal number of contacts that each residue can
make and is used to limit the number of contacts predicted for each residue.
Fariselli P, Compiani M & Casadio R (1993). Predicting secondary structures of
membrane proteins with neural networks. Eur Biophys J 22: 41-51.
Fariselli, P. & Casadio, R. (1999). Neural network based predictor of residue
contacts in proteins. Prot. Engng., 12, 15-21.
Goebel,U., Sander,C., Schneider,R. & Valencia,A. (1994). Correlated mutations and
residue conyùtacts in proteins. Proteins, 18, 309-317.
Mclachlan, A. (1971). Test for comparing related aminoacid sequences. J. Mol. Biol.,
61, 409-424.
Sander, C. & Schneider, R. (1993). The HSSP data base of protein structure-sequence
alignments. Nucleic Acids Res., 21, 3105-3109.
Olmea,O. & Valencia,A. (1997). Improving contact predictions by the combination of
correlated mutations and other sources of sequence information. Fold. Des.,
2, S25-32.
osgdj , 221
number of submitted models: 56
D.J. Osguthorpe
email: djosg@mgu.bath.ac.uk
The Reduced Representation Model and Force Field
Simplified Geometry Model
The model involves representing the backbone of each residue by
one sphere, or 'atom', and the sidechains by up to 3 'atoms'. The
side chains of Ala, Val, Ile, Ser, Thr and Pro are represented by 1
sphere, Leu, His, Asp, Glu, Asn, Gln, Cys and Met by two spheres and
Phe, Tyr, Trp, Lys and Arg by three spheres. The different number of
spheres reflects the anisotropic nature of the average shape of the
corresponding side chains. It also enables assigning different
characteristics to parts of the side chain of a residue, for example,
the side chain of Arg includes a hydrophobic chain and a polar/charged
end. Although in this representation many residues have the same
number of atoms, they do not lose their unique identity since they
have different parameters.
Simplified Potentials
The potentials required can be split into three major groups, the
virtual internal potentials which stabilise the geometry of the
protein, secondary structure stabilisation potentials and the global
potentials, which deal with the effects of the environment but do not
require the environment to be modeled explicitly.
The potential energy function for the model is defined as:
E total = E Internal + E Secondary Structure + E van der Waals + E
Global
Internal Potentials
The values of the parameters were derived by fitting observed
distributions of the corresponding internals in experimental
structures and by emulating the energy surface calculated using a full
atom model.
The internal energy is defined in terms of virtual bond, angles and
torsions (or out of plane). A number of functional forms are used,
the standard full-atom model harmonic terms, quadratic functions and
Gaussian functions plus combinations of these terms. Additionally an
out of plane-virtual valence angle cross-term is defined.
E Internal = E V. bond + E V. angle + E V. torsion + E V. oop + E V.
oop X V. angle
Virtual angle - virtual angle - virtual torsion angle (theta-theta-
phi) cross-terms are defined for dealing with correlations between the
two internal valence angles of a torsion angle in the backbone. These
are particularly important for turn conformations.
Secondary structure energy/Backbone Hydrogen bonding Potentials
With the simplified geometry model only C alpha atoms exist for the
backbone and yet backbone hydrogen bonding is very important in the
stabilisation of the standard secondary structures. However, the
standard secondary structures have a fixed and specific set of
distances between the C alpha atoms. Hence the basic approach was to
determine the equilibrium distances between C alpha atoms in 3-10,
alpha-helices and parallel and anti-parallel beta-sheets and to use
Gaussian functions to stabilise these distances.
E Secondary Structure = E Helix + E Sheet
For the beta-sheets it was also necessary to include some vector terms
as well to ensure only when the two strands were aligned was the
potential strong. Further improvements were necessary to the sheet
potentials as from trial folding runs it became clear that additions
were needed to remove conformations created that are never seen in
real proteins.
It should be noted that in all cases the secondary structure
potentials merely stabilise distances that are found, this is not a
pre-imposition of secondary structure. The beta-sheet potentials do a
full search of all residue pairs to find any that are close enough to
form sheets in each energy calculation.
Secondary structure "prediction" energy
This is a new term added since CASP2 which is required in this model
as the basic preference of residues for a secondary structure is due
to local interactions between the side chain atoms and the backbone
atoms which are missing in this model. Ala, Lys, Arg, Glu, Gln, Leu
and Met are assigned a helix preference, while Val, Ile, Thr, His,
Phe, Tyr, Trp and Cys are assigned a strand preference. As individual
residue conformations only affect the virtual valence angle, the
overall preference is specifically increased only for contiguous pairs
of residues which both prefer the helical conformation or both prefer
the strand conformation.
E Secondary Structure Prediction = E Turn + E Strand
Global/Solvation Potentials
The remaining potentials are used to represent the non-bonded
interactions of the residues with each other and the interactions with
solvent. The fundamental idea behind the solvation potentials was to
use fast approximations to the physical forces involved in real
protein structures.
Physical Model Solvation Potentials.
In this potential model the physical forces of solvation were included
using simple potential models. The main idea was that most protein
atoms should not have an attractive interaction with other protein
atoms, reflecting the fact that the real interactions with protein
atoms would be replaced by solvent interactions if the atom became
exposed, hence its overall energy would not change depending on
whether it was buried or exposed. However, the atoms should still
have excluded volume so a repulsion potential is required at short
distances. An offset Lennard-Jones potential is used, where the well-
depth is offset to 0 at the Lennard-Jones radius and the energy is set
to 0 for distances between atoms greater than the Lennard-Jones
radius. This potential is used for most atoms, in particular the C
alpha backbone atoms and any atom which does not have a specific
Lennard-Jones potential.
E van der Waals = E Offset Lennard-Jones
Physical Model Solvation Potentials - Hydrophobicity
The next effect to consider is the "hydrophobic" effect. I consider
this to be associated with two parts, the Van der Waal's potential
between atoms (which is attractive) and effects due to the
interactions with water. When side-chains are buried in the
hydrophobic core of a protein the only interactions available are the
standard Van der Waal's interactions, as there is no water present.
Hence side-chain atoms of hydrophobic groups were given a standard
Lennard-Jones potential with the energy between the same atom types
the enthalpy of vapourisation of the most similar hydrocarbon. This
would reproduce the energy of the hydrophobic core when hydrophobic
side-chains are buried.
This determined the potential between the same side-chain atom types.
For dissimilar side-chain atom types an analysis of the distribution
of side-chain atoms around an atom in known protein structures showed
to a first approximation little difference in preference between the
atoms. This distribution is not that which is created by rules such
as the geometric mean rules. A function was created which would give
such a distribution and this was used to generate the mixed terms for
the Lennard-Jones parameters of hydrophobic side chain atoms.
Having accounted for the potential of hydrophobic side-chains when
buried and away from water, a potential for hydrophobic atoms when
exposed to water is required. It is only this term which I consider
to be truly the "hydrophobic" part, in the sense it reflects the
effect of hydrophobic groups on water structure. This was done by
introducing a hydrophobic sigmoid potential. (The initial folding
work of Levitt had used a sigmoid potential for hydrophobic residues.)
For CASP4 this sigmoid potential is only used between the aliphatic
side-chain atoms.
A final adjustment to the "hydrophobicity" potential was to give
certain groups in residues not normally considered hydrophobic a non-
zero Lennard-Jones function so that an interaction existed between
them and hydrophobic groups. These groups were not included in the
sigmoid potential. Such groups were the Ala C beta, the Thr C beta
(because of the methyl group), the C beta of the charged amino-acids
Asp, Glu, Lys and Arg and Asn and Gln. It also included the C gamma
atom of Lys and Arg. Observations of experimental structures and
surface accessibility calculations show that these groups are as
buried as any of the atoms in the classic hydrophobic side-chains.
E Global = E van der Waals + E hydrophobic sigmoid
Physical Model Solvation Potentials - Electrostatics
An "inverse" Kirkwood-Tanford model is used for electrostatics, in
which the interactions between the charged groups are varied according
to their local dielectric environment, defined by counting how many
non-polar groups are surrounding them, using a sigmoid function as
before. To take into account of ionic strength effects, which are
assumed to have an affect at large distances between charges but not
at short range (as the Debye-Huckel theory on which this aspect is
based assumes an averaged ionic atmosphere around each charge which is
certainly not true for charges on the surface of a folded protein), a
distance dependent dielectric of the distance squared was used.
Electrostatic interactions were computed using a distance cubed term
and the same term scaled by the sigmoid function of surrounding non-
polar groups. Note that the scaled term accounts for salt-bridges
automatically, as interactions between charged pairs not surrounded by
non-polars (high dielectric) will be weak but strong when surrounded
by non-polars (low dielectric).
E Global = E Electrostatic + E scaled Electrostatic
The other feature of electrostatics that needs to be covered is the
difficulty of burying charges. It is actually a much stronger rule of
proteins that the charged group of charged residues is exposed than
that the sidechains of hydrophobic residues are buried. The simple
electrostatic explanation for this is the self-energy of a charge
which says it requires a lot of energy to move a charge from a high
dielectric region into a low dielectric region.
The same sigmoid function counting the number of non-polar groups
surrounding a charge is used as before, scaled by a potential constant
which gives a positive energy for burying a charge.
E Global = E charge-non-polar sigmoid
Physical Model Solvation Potentials - Scaling
In the low dielectric environment of the folded protein the stability
of the backbone-backbone hydrogen bonds is significantly enhanced as
these hydrogen bonds are excluded from solvent and a hydrogen bond is
essentially an electrostatic interaction. In the unfolded protein the
stability of backbone-backbone hydrogen bonds is likely to be similar
to that of backbone-water hydrogen bonds, hence there should be no
energy stabilising backbone hydrogen bonds. This effect has been
included by scaling the backbone hydrogen bond energy term (E Helix
and E Sheet) by a sigmoid function counting the number of surrounding
non-polar groups.
Folding Simulations - Simulated Annealing procedure
The starting conformation was an all-extended structure using a
rigid geometry procedure based on a standard geometry for the RR
model. A random Maxwell-Boltzmann distribution was used to assign
initial velocities. The initial temperature was set such the average
temperature initially was around 340-350K. 84000x2 steps were run
before starting cooling. The annealing protocol was first to reduce
the total energy by the energy equivalent to 25 degrees of temperature
in 84000 steps followed by 84000 steps at constant total energy. This
was repeated three times. 84000 steps at constant total energy
followed and then the energy was reduced by 12.5K in 84000 steps. The
total energy was then continuously reduced by 6.25K in 84000 steps for
30 runs. The reduction was increased to 12.5K for 5 runs followed by
5 runs at a constant temperature of 100K. Final annealing to close to
0K was done in a single run of 84000 steps.
Lomize-Andrei , 002
number of submitted models: 28
ab initio and fold recognition techniques.
Andrei L. Lomize, Irina D. Pogozheva, and Henry I. Mosberg
email: almz@umich.edu
3D models of protein cores (complexes of several interacting
alpha-helices and beta-sheets, excluding nonregular loops) have
been generated for 19 CASP4 targets with no detectable sequence
homology to proteins of known structure. The partially automatic
procedure described below reproduces main blocks of a large software
package that is under development in our group to test the validity
of the entire approach and its specific parts. The procedure
includes the following three steps.
STEP 1. Ab initio prediction of secondary and supersecondary
structure using two different methods:
(a) calculation of alpha-helices, alpha-hairpins, and beta-hairpins
in hydrophobically collapsed protein using the program Framework [1];
(b) identification of alpha-helices and beta-strands based on
hydrophobicity patterns in multiple sequence alignments [2];
Possible beta-sheet topologies and the structural class of the
target (beta-sandwich, beta-barrel, beta-helix, beta-prism, different
alpha+beta and alpha/beta structures, alpha-superhelix, or alpha-bundle)
were suggested based on a qualitative analysis of results produced by
both methods.
STEP 2. Fold recognition. The procedure included the following
three parts.
(1) Identification of related PDB structures using a library of
"supersecondary nuclei" in proteins [3], and the following criteria:
(a) similar secondary structures of the target and template,
including number, order, and lengths of alpha-helices and beta-
strands, and identical beta-sheet topologies,
(b) similar biological functions;
Twelve of the nineteen targets considered (T0088,T0094,T0098,T0100,
T0101,T0102,T0104,T0107,T0108,T0109,T0118, and T0126) satisfied
these criteria, and therefore were designated for fold recognition.
(2) Finding optimal alignment of secondary structures in the
target and template that maximizes formation of aliphatic, aromatic,
and polar clusters and burial of nonpolar side-chains.
(3) Adjustment of side-chain conformers and the spatial positions
of entire alpha-helices to improve close packing, burial of nonpolar
groups, and hydrogen bonding.
STEP 3. Ab initio assembly of 3D cores from alpha-helices and
beta-sheets - for targets that could not be assigned to any known
protein fold in STEP 2 (T0091, T0095, T0097, T0105, T0106, T0110,
and T0114). The docking of regular secondary structures (using
QUANTA and our unpublished software) sought to optimize burial
of nonpolar side-chains, segregation of aliphatic, aromatic, and polar
groups into separate clusters, close packing, and hydrogen bonding
in simultaneously constructed models of several homologous proteins
from the target family. Two different assembly strategies were tested
for all-alpha-helical domains: stepwise building of the core from
gradually growing structures (T0106 and models 2 of T0095 and T0097),
and formation of a nearly complete core (models 1 of T0095 and T0097).
[1] A.L.Lomize and H.I. Mosberg (1997) Thermodynamic model of
secondary structure for alpha-helical peptides and proteins.
Biopolymers, v.42, pp. 239-269
[2] A.L.Lomize, I.D. Pogozheva, and H.I. Mosberg (1999) Prediction
of protein structure: the problem of fold multiplicity. Proteins,
Suppl.3, pp.199-203
[3] A.L.Lomize, I.D. Pogozheva, and H.I. Mosberg (1999) Protein
structure assembly pathways. Protein Sci., v. 8, Suppl.1, p.86
DChu11 , 325
number of submitted models: 43
David Chu
email: davidchu@attglobal.net
A wide variety of numerical and statistical methods are utilized to determine the
most optimal protein conformation for the prediction of protein secondary structure.
At the heart of our technology is the ability to predict protein secondary
structure using evaluation functions. Each evaluation function is an
algorithm that measures the \223goodness\224 of a given protein conformation.
Conformations with positive values are good candidates for protein\222s
secondary structure. Conversely conformations with negative values are
rejected. Since no single evaluation function alone works the best under all
circumstances, we have constructed many evaluation functions to meet all the
situations we can think of. The evaluation functions take into account all
single and residue pair propensity to be within and at boundaries of loop,
strand, and helix. Also taken into account is [protein sequence] to
[structure pattern] evaluation function obtained from neural network training
and multiple sequence alignment of homologous sequences. Charge distribution,
super secondary structure patterns, degree of sequence similarity to sequences
of known structures from the PDB are all taken into considerations. Our
system is fully automatic and operates totally without human intervention; one
simply inputs a protein sequence, and it outputs a secondary structure
prediction.
Baldi , 115
number of submitted models: 43
prediction based on recurrent neural networks
Gianluca Pollastri, Pierre Baldi
email: gpollast@ics.uci.edu
SSpro is a fully automated system for the prediction of protein secondary
structure. The system is based on an ensemble of bidirectional recurrent
neural networks (BRNNs) [1, 2]. BRNNs are graphical models that learn from
data the transition between an input and an output sequence of variable
length. The model is based on two hidden Markov chains, a forward and a
backward chain, that transmit information in both directions along the
sequence, between the input and the output sequences. Three neural networks
model respectively the forward state update, the backward state update and the
input and hidden states to output transition. BRNNs are trained in a
supervised fashion using the gradient descent algorithm. The error signal is
propagated through the model using the BPTS (backpropagation through
structure) algorithm [3], an extension of BPTT (backpropagation through time),
used in unidirectional recurrent neural networks.
The system is trained on a set of 1180 structures and tested on a set of 126
structures. The test set is the same on which the first version of the server
PHD [4] was trained. The training set is extracted from the Protein Data Bank
that was online in April 1999. The structures obtained using NMR or with a
resolution worse than 2.0 Angstoms are first removed from the set, then an
all-against-all redundancy reduction procedure is run using a rigorous
Smith-Waterman algorithm with the Pam120 matrix for pairwise alignments,
discarding a sequence if it shows more than 25 0dentity to any sequence in the
test set. The same threshold holds for each pair of sequences in the test set.
A second all-against-all redundancy reduction procedure is then run on the set
thus obtained using a threshold of 50equence identity. The target secondary
structure assignments are compiled with the program DSSP [5]. We assign to the
class Helix the alpha-helix (H) and 310-helix (G) DSSP classes, to Strand the
classes extended strand (E) and beta bridge (B), to Coil the other four
classes, consistently with the CASP classification. The system takes as input
a profile obtained from a multiple alignment of protein sequences. The
multiple alignments are compiled with the program BLAST [6], using default
parameters. The database of sequences adopted is the NR that was online in
October 1999 (roughly 420,000 sequences). No further check or filter is run on
the database. Every sequence in the alignment is assigned a weight
proportional to the information the sequence carries with respect to the
unweighted profile. A weighted profile is then compiled and used as input for
the system.
A set of 11 bidirectional recurrent neural networks is trained on
the dataset. For details on the implementation, see [1,2]. The networks
contain roughly 70,000 adjustable weights, have normalised exponentials on the
outputs and are trained using the relative entropy between the target and
output distributions.
The final predictions are obtained averaging the network
outputs for each residue. A performance of approximately 76.5orrect residue
classification is observed on our independent test set (roughly 800n the
training set). SSpro is implemented into a web server that can be found online
at the address: http://promoter.ics.uci.edu/BRNN-PRED/
[1] P. Baldi, S. Brunak, P. Frasconi, G. Pollastri, and G. Soda. Bidirectional
Dynamics for Protein Secondary Structure Prediction, Proceedings of the
Sixteenth International Joint Conference on Artificial Intelligence (IJCAI99),
Stockholm, Sweden (1999).
[2] P. Baldi, S. Brunak, P. Frasconi, G. Pollastri and G. Soda. Exploiting the
Past and the Future in Protein Secondary Structure Prediction. Bioinformatics,
15:937-946, (1999).
[3] P. Frasconi, M. Gori, A. Sperduti. A General Framework for Adaptive
Processing of Data Structures. IEEE Trans. on Neural Networks, 9, 5:768-786, (1998).
[4] B. Rost, C. Sander. PHD - An automatic mail server for protein
secondary structure prediction. Comput. Appl. Biosci., 10(1):53-60, (1994).
[5] W. Kabsch, C. Sander. Dictionary of protein secondary structure: pattern
recognition of hydrogen bonded and geometrical features. Biopolymers,
22:2577-2637, 1983.
[6] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W.
Miller, D. J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs, Nucleic Acids Res. 25:3389-3402 (1997).
Noguchi , 453
number of submitted models: 58
Tamotsu Noguchi
email: tnoguchi@etl.go.jp
The SSThread [1]: a method for prediction of protein secondary structure using
the threading algorithm relied totally on the global aspect of a protein. We
have improved the method by introducing weights of the local sequence
homology and the 3D-1D compatibility score (new SSThread). The weight of the
local sequence homology gives higher priority to secondary structures in some
sequence region whose homology score is high in the structural alignment
between the query protein and a protein in the library. And the weight of the
3D-1D compatibility score gives higher priority to all secondary structures
in the sequence whose 3D-1D compatibility score is high. A structural library
for the threading was reconstructed for predictions of CASP4 targets. The
library contains 1692 protein chains, which were selected from PDB at July
11, 2000 by using the system for PDB-REPRDB [2]: a database of representative
protein chains from the PDB at the PAPIA (PArallel Protein Information
Analysis) WWW server (http://www.rwcp.or.jp/papia/). All selected protein
chains are of < 30 identity with one another. On the other hand, a target
protein: T0116 of length >500 residues was applied a method that the protein
was divided into two sequences to overlap 100 residues each other, and the
prediction results of each sequence by the new SSThread were jointed at the
overlap region.
We predicted the protein secondary structure of CASP4
targets by the new SSthread and the New Joint Method [3] using the five
methods: Qian-Sejnowski, SSThread, Ptitsyn-Finkelstein, Nisikawa-Ooi,
Gibrat-Garinier-Robson. The New Joint Method is available from the PAPIA WWW
server. Finally, the results of secondary structure prediction for each
protein were also jointed as giving a priority to the prediction by each
method manually according to the 3D-1D compatibility score. The new SSThread
takes a priority of the prediction, in case that the target size is longer
than 100 residues and a similar protein with a target, whose total
compatibility score is better than -2.8, is found in the structural library
by the 3D-1D compatibility search. In these cases, we also submitted the
structural alignment between a target protein and the highest score protein.
References
[1] Ito, M., Matsuo, Y. and Nishikawa, K. (1997) "Prediction of
protein secondary structure using the 3D-1D compatibility algorithm", CABIOS,
13, 415-423.
[2] Noguchi, T., Onizuka, K., Ando, M., Matsuda, H. and Akiyama,
Y. (2000) "Quick Selection of Representative Protein Chain Sets Based on
Customizable Requirements" Bioinformatics, 16, 520-526.
[3] Nishikawa, K. and Noguchi, T. (1991) "Predicting Protein Secondary
Structure Based on Amino Acid Sequence", Methods in Enzymology, 202, 31-44.
GNM-AB , 487
number of submitted models: 17
Galaktionov S., Nikiforovich G.V., Marshall G.R.
email: gregory@ibc.wustl.edu
The cornerstone of our procedure is the prediction of the residue-residue
contact matrices, A, based on restoring the elements of their eigensystems,
namely, the eigenvectors, Y, associated with the three largest eigenvalues.
We have shown that on the basis of the first three terms of eigenvalue
decomposition, it is possible to make a reasonable reconstruction of the
contact matrix and an excellent delineation of the regions of disallowed
contacts. Specific properties of these eigensystems are discussed in the
poster presentation (Galaktionov S., Nikiforovich G.V., Marshall G.M., 2000,
the CASP-4 Meeting). The set of the eigenvalues of a contact matrix depends
only on the size of the protein and can be predicted with good accuracy. The
elements of the first eigenvector, Y1, are tightly correlated with the
coordination numbers for each residue. Therefore, it is possible to obtain
estimations of the elements of Y1 from the vector of the coordination numbers.
Combining the first eigenvector, Y1, with the constant part of A (contacts
in positions ai,i+1, ai,i+2; standard contacts within alpha-helices and
beta-strands; absence of contacts between the ends of longer elements of
alpha-helices and beta-strands), it is possible to obtain a first
approximation of the contact matrix, Ao, and of the \223non-contact\224
matrix, Aon, consisting of the largest and the smallest elements,
respectively, corresponding to the areas with a very high or low probability
of contact.
The most important element of the procedure is the
evaluation of the second and third eigenvectors, Y2 and Y3. We have shown
that for smaller proteins (< 150 residues) the eigenvectors have a
\223block\224 structure, consisting of 3 to 5 alternating positive and
negative \223blocks\224 of the elements along the sequence. The absolute
values of these elements are correlated with the corresponding elements of Y1.
Based on this observation, all possible Y2 and Y3 eigenvectors can be
generated by systematic combination of all possible block positions; the most
feasible eigenvectors can be selected by checking their orthogonality to Y1
and to each other, conformity with Ao and Aon, etc. For targets we have
submitted (< 130 residues), only 1 \226 3 pairs of the Y2 and Y3 met the above
requirements and were used for prediction of the new contact and non-contact
matrices, A1 and A1n. These were used for the iterative refinement of the
elements of Y2 and Y3 until the procedure converged.
The computational protocol began with the prediction of elements of regular
secondary structure (alpha-helices and beta-strands) by a consensus of
statistical methods available at the URL addresses
http://insulin.brunel.ac.uk/psiform.html and
http://pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=npsa_server.html. (For
T0102, the assignment of helical fragments was assumed to be that of Langdon
G.M. et al. (1998) J. Biomol. NMR, v.12, pp.173-175.) Coordination numbers
for the protein sequence were predicted in general accordance with Rodionov
M.A. & Galaktionov S.G. (1992) Molecular Biology, v.26, pp. 777-83. These
were used to predict the contact and non-contact matrices according to the
procedure described above. These matrices were further utilized in the
build-up procedure, which generated the self-avoiding backbone structures in
compliance with the matrices (Nikiforovich G.V., unpublished). For most
targets, the resulting structures were too loose, i.e., not satisfying the
expected average interresidue C-alpha - C-alpha distance
Hogue-Feldman , 090
number of submitted models: 22
Howard J Feldman and Christopher W V Hogue
email: feldman@mshri.on.ca
The predictions (T0091, T0097, T0102) were done ab initio, starting
with a secondary structure prediction from PHDsec. In the case of T0102
the secondary structure from (1) was used. The prediction of two very
long helices in T0091 was suggestive of a coiled coil structure.
Using a map of the conformational space available at each residue, based
on the observed conformations of helical, sheet and coil residues in the
PDB, we came up with probability distributions for the conformation of
each residue based on residue type and biased by 3-state secondary
structure prediction, called "trajectory distributions". These were used
as input to a modified version of the FOLDTRAJ algorithm (2) which
performed a random walk in Ramachandran space rather than alpha carbon
space.
Approximately 200,000 structures were generated using the trajectory
distributions to build the backbone. Sidechains are placed
probabilistically using a backbone dependent rotamer library(3).
All residues are chirally and sterically valid, have a minimum of
non-hydrogen van der Waal collisions. For T0091 several distance
constraints from a known coiled coil structure (1A93) were used to bias
the random walk as well.
To account for the cyclic nature of T0102, we observed that a helix was
predicted to pass through residues 1 and 70, so rather than generating
starting at residue 1, our N-to-C random walk was started at residue 6
(through 70, followed by 1-5) at a loop between two helices. After
generation, those for which the N and C termini were more than a few
Angstroms apart were discarded, leaving a few hundred structures. The
N and C termini were then "spliced" together in these structures and the
residues renumbered properly, the Met being residue 1. No attempt was
made to correct the geometry at the splice site, so some minor error is
expected in this region.
From the pool of generated structures, various statistics were gathered,
and the best models were chosen based on their energy
scores(4)(5), radii of gyration, topology, secondary structure content, and
whether they actually satisfied the distance constraints in the case of
T0091. Energy scores were ignored for T0102 since it is a membrane
protein, and the energy scores used were derived for aqueous globular
proteins. Only this final step of choosing the "best" models was
subjective and non-automated.
REFERENCES
1. Langdon GM, Bruix M, Galvez A, Valdivia E, Maqueda M and Rico M.
(1998) Sequence-specific 1H assignment and secondary structure
of the bacteriocin AS-48 cyclic peptide. Journal of Biomolecular
NMR. 12: 173-175.
2. Feldman HJ and Hogue CWV. (2000) A Fast to Sample Real
Protein Conformational Space. Proteins. 39(2): 112-131.
3. Dunbrack RLJ and Karplus M. (1993) Backbone-dependent rotamer
library for proteins. Application to sidechain prediction.
J. Mol. Biol. 230: 543-574.
4. Zhang C, Vasmatzis G, Cornette JL and DeLisi C. (1997)
Determination of Atomic Desolvation Energies From the
Structures of Crystallized Proteins. J. Mol. Biol. 267:
707-726.
5. Bryant SH and Lawrence CE. (1993) An Empirical Energy
Function for Threading Protein Sequence Through the Folding
Motif. Proteins. 16: 92-112.
Harrison-Weber , 058
number of submitted models: 91
Ivan Y. Torshin, Irene T. Weber and Robert W. Harrison
email: robert.harrison@acm.org
Molecular modeling is a combinatorial, multiple minimum optimization problem.
In homology modeling, the known homolog serves as a good starting point for the
search, while in ab initio folding there are only limited geometric data.
Two complimentary classes of algorithms were explored in our CASP-4 predictions:
randomized algorithms, and multiple modeling algorithms.
Randomized algorithms, either based on the Kohonen self-assembling neural
network or an analytic solution for simultaneous circular equations, were used
to explore conformational space and delineate regions of allowed molecular
geometry. These algorithms are computationally efficient; it was possible to
fold most of the CASP-4 ab initio targets several hundred times in a few CPU
hours. Multiple models from independent runs of the randomized procedures
were used to extract conformations that occurred repeatedly, as this improved
the reliability in tests. Hundreds of models were used for ab initio
predictions and ten models for homology modeling. AMMP (Harrison, 1999) was
used to predict 12 ab initio targets and 30 homology modeling targets.
Randomized Algorithms
Our major focus has been to explore new algorithms for building molecular
models and searching conformation space. One general class of algorithms,
randomized algorithms, is especially interesting because these algorithms can
efficiently find or approximate the solutions to combinatorial and geometric
problems (Hertz 1991, de Berg 1997) and can be implemented efficiently on a
parallel computer (JaJa 1992). The general idea behind randomized algorithms
is to use a set of independent identically distributed random variables to
limit the solution to an acceptably small range. Rather than attempt to
converge to an “exact” solution of a mathematical problem, which may not exist
or may not be meaningful in the context of protein structure, randomized
approaches define a sequence of ever-closer bounds on the ranges of solutions.
Two randomized approaches were tested, these were a modified Kohonen neural
network with a distance metric and a randomized analytic solution to distance
restraints (Harrison 1999). Multiple models were constructed using the
distance restraints that were derived from homologous structures or sequences.
Averages over the models were then used to develop a single model for
submission.
Homology Modeling
Protein folds were recognized using the FFAS server (Rychlewski et al. 2000),
the 3D-PSSM server (Kelley et al. 2000), and the screening method we used for
ab initio folding. Clustal (Thompson et al. 1994) was used for multiple
sequence alignments when possible. The thirty targets
86-90,92,93,99-101,103,104,106,107,109,111-113, 115-123, 125,127, and 128 were
modeled. Ten models were generated from each template using either the Kohonen
algorithm or the analytic approach (Harrison 1999) coupled with energy
minimization and a short run of molecular dynamics. The averaged model was
energy minimized to generate the final model. The final models were subjected
to 3ps runs of molecular dynamics, which may degrade the accuracy for the high
homology examples. The variation among the models was calculated for each atom
and used as an estimate of the uncertainty in the positions.
Ab Initio Folding
Ab Initio folding was used for targets
91,94,95,96,97,98,102,105,108,110,114,124 and 126. A simple
hydrophobicity-electrostatics potential was supplemented by a
sequence-specific empirical potential to improve the stereochemistry of the
prediction. Inter-residue distances were estimated by searching the protein
database for short stretches of homology from different and unrelated
proteins. Simply finding the best local fit for each overlapping window of
amino acids does not result in a good self-consistent set of distances.
However, when the requirement for chain continuity is enforced, the problem of
identifying a self-consistent set of inter-residue distances becomes akin to a
convolutional error correcting code which is readily solvable by dynamic
programming (Viterbi 1967). This continuity condition is an inherent property
of all polymers and provides a significant gain in prediction accuracy.
Potential templates were identified for homology modeling by using the
proteins that had the most fits for each sequence. The models were generated
in three steps. 1) 200 models were generated with the original potential
functions, using C-alpha-only models. Then inter-residue distances
(C-alpha-C-alpha) were averaged over all the models. Those distances where the
standard deviation was less than 2 angstroms were extracted. 2) A single model
was generated that both satisfied the new distance information and minimized
the hydrophobicity-electrostatics potential. This model was achiral and can
represent either the left or right-handed solution. Secondary structure was
identified visually and used to define additional distance restraints
(published experimental data on helical locations were used for target 102).
3) All-atom models were built for both the right and left-handed solutions.
The best models had right-handed helices.
References
de Berg M., van Kreveld M., Overmars M., and Schwarzkopf, O. (1997) “
Computational Geometry” Springer-Verlag
Harrison, R.W (1999), A Self-Assembling Neural Network for Modeling
Polymers J. Math. Chem. 26,125-137
Hertz J., Krogh, A., Palmer R.G. (1991) Introduction to the theory of
neural computation, Sante Fe Institute studies in complexity lecture
notes vol. 1. Addison-Wesley pp244-246,
JaJa J. (1992) An Introduction to Parallel Algorithms, Addison-Wesley
pp 433-484,
Kelley LA, MacCallum RM, Sternberg MJ (2000), Enhanced genome annotation
using structural profiles in the program 3D-pssm, J Mol Biol 299(2):499-520
Rychlewski L, Jaroszewski L, Li W, Godzik A (2000),Comparison of sequence
profiles. Strategies for structural predictions using sequence information
Protein Sci 9(2):232-41
Thompson JD, Higgins DG, Gibson TJ (1994), CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence
weighting, position-specific gap penalties and weight matrix choice,
Nucleic Acids Res 22(22):4673-80
Viterbi A.J. (1967) Error bounds for convolutional codes and an asymptotically
optimum decoding algorithm IEEE trans inf theory IT-13,260-269.
Sandia-USF , 039
number of submitted models: 41
L.O. Hall and W.P. Kegelmeyer and C. Springer and K.W. Bowyer and N. Chawla
and T. Moore, Jr.
email: avatar@ca.sandia.gov
The Sandia-USF approach to the secondary structure prediction of
proteins was a purely pattern recognition based approach. The entire
PDB as of July 11, 2000 was used as our training set. Structures
between 2.5 and 3.0 A resolution deposited before April 1998 were not
included due solely to a data handling glitch. This provided us with a
training set of over 19,000 protein chains consisting of more than 3.6
million amino acids for secondary structure prediction. The secondary
structure was determined by DSSP from the PDB coordinates.
We used the 20 x M matrix output from the -Q option of psi-blast to
generate log-probability related values between -17 and 17 that each
amino acid would be at a given position. Our approach is similar to
that of Jones in CASP3, except that we used the "nr" (non-redundant)
database without any filtering. Then, to predict the secondary
structure of a given amino acid, a window of size 17 was
used with the amino acid being predicted in the center of the
window. Hence, there were 160 values to the left of the amino acid
being predicted and 160 values to the right of it. These 160 values
were made up of the 20 predictions of likelihood that an amino acid
was in each of the eight positions to the left or right respectively
of the amino acid whose structure was being predicted. Therefore, a
feature vector describing a single amino acid in our training set
consisted of 340 values ranging between -17 and 17 plus the class of
H, C, or E.
For amino acids at the edge of a chain, values of 50 were used as
padding. Proteins made up of more than one chain were predicted chain
by chain.
Our base prediction engine was a set of decision trees generated by
the C4.5 algorithm, a version of which was modified to run efficiently
on the ASCI Red supercomputer. Learning algorithms typically require
that all data fit in memory, which would require about 3 gigabytes of
memory in this case. While it is possible to find machines with the
required amount of main memory, such machines are rare. And though
there are no difficulties with "convergence" when training with a
decision tree, growing a single tree on all of the data requires a
prohibitive amount of time.
Hence, we took a distributed approach. We were able to fit 1/8th of
the data on one ASCI Red processor which had 256 MB of memory. Eight
decision trees could be learned from a disjoint partition of the
data. Uniform voting in which each tree gets to contribute a single
vote was found to provide the best predictions.
As CASP4 progressed and time permitted we used different disjoint
partitions of the data to produce sets of eight trees. In order to
track our performance as we made modifications to our decision
process, we created a validation set which consisted of all x-ray
crystal structures added to the PDB between July 11 2000 and July 28
2000 with resolution better than 3.0A. This set consisted of 146
chains.
We found that higher prediction performance on our validation set
could be obtained with 32 trees, that is, our top four sets of eight
trees. (Again, each set represents a different disjoint partition of
the data and all 32 sets together represent a four times resampling of
the PDB). We incrementally added trees as they were obtained during
the prediction process. The full 32 trees were used only in the last
week.
Finally, we post-processed our predictions, in order to make use of
the fact that, most generally, an amino acid's secondary structure
will match that of its neighbors.
As a first step, we statistically examined all of the training data. A
useful nugget of information, that singleton H's almost never occur,
was used in a process we call "H-stomping." A singleton H is removed
and replaced by the class which is shared by the majority of its
neighbors in window of size 5.
Further, we used a smoothing algorithm to attempt to correct errors in
a sequence of predictions. A window size of 5, centered around the
amino acid whose secondary structure was being predicted, was used.
The prediction strength at each position was the percentage of trees
which predicted the class of H, C or E. The class with the highest
prediction percentage at the center of the window was compared with a
threshold. If the prediction for that class was below the threshold,
and if the delta (difference between the prediction percentages of the
top two classes) was below a second threshold then it was assigned the
majority class of the window. The purpose of the thresholds was to
insure that we changed the original prediction only in the cases where
our decision tree engine was particularly uncertain about its own
accuracy. The smoothing algorithm was updated several times over the
course of the contest with the one described here used on the last
day.
Thus the process for a target chain is: a) expand the chain into
windows of log-probability around each amino acid, b) run each amino
acid window through the currently available n trees and obtain the
prediction percentages for each class, c) smooth the predicted class
data and use the smoothed class as the predicted target class, d)
H-stomp the smoothed chain.
The elements of our approach that are likely to be distinctive
are (1) the pure statistical pattern recognition orientation,
(2) the emphasis on using all available training data, and (3)
the incorporation of oversampling methods.
REFERENCES:
@article{qui87,
author = "J.R. Quinlan",
title = "Simplifying Decision Trees",
journal = "International Journal of Man Machine Studies, V.27",
year = "1987",
pages = "227-248"
}
@book{quib,
author = "J.R. Quinlan",
title = "C4.5: Programs for Machine Learning",
publisher = "Morgan Kaufmann",
address = "San Mateo, CA",
year = "1992",
}
@book{mitch,
author = "T.M. Mitchell",
title = "Machine Learning",
publisher = "McGraw-Hill",
address = "New York, NY",
year = "1997",
}
@Article{Jones1,
author = "David T Jones",
title = "Protein Secondary Structure Prediction Based on
Position-Specific Scoring Matrices",
journal = "Journal of Molecular Biology",
year = 1999,
volume = 292,
pages = "195--202",
note = "www.ideallibrary.com"
}
The SANDIA-USF AVATAR web site:
http://morden.csee.usf.edu/~chawla
zhu , 352
number of submitted models: 54
Jun Zhu
email: junz@amgen.com
There are two main forces that drive protein folding, local and
non-local interactions. Local interactions are highly
sequence-dependent. Sensitive sequence similarity search method
will find favorable local structures. Non-local interactions will
stablize local structures in optimal arrangements. This program
is to simulate this folding process and devided in two steps:
collecting locally similar blocks and assembling those blocks to
achieve the minimum engery state.
First, local structure similarities were predicted based on
sequence information. Bayesian model (Zhu et al, 1999) was used
as sequence similarity search method because it is a natural
block-based method, which only aligns regions of high confidence.
And also each alignment has a confidence value that can
be explored during global energy minimization. Each target
sequence was searched against NR database. Similar sequences
were collected and used in iteratively training a Bayesian
model. Then, the model was searched against PDB45 database.
Aligned blocks that met the following conditions were collected:
(1)the Bayesian evidence of the pair is less than 0.1, (2) the
highest confidence level of residues in the aligned block is
higher than 0.5, (3) the length of consecutive high confidence
region is larger than 10. Long blocks were broken down to small
overlapped blocks of 10~13 AAs.
After obtaining structure blocks, blocks were assembled using
simulated annealing method to miniminize energy functions. The
energy functions were derived similarly as Simon et al. (1999).
We investigated sequence-independent forces (compactness and
secondary structure elements interaction), first order
sequence-dependent forces (environment force, non-specific pair
distance function), second order sequence-dependent forces (
environment pair interactions, sequence specific pair distance
function). The relative weight of each function were estimated
using randomly generated structures. We found that compactness
expressed in term of radius of gyration, beta-strand
interactions and first order environment force have large
contribution to the fininal energy function and used in fininal
structure predictions.
Monte Carlo simulated annealing process was used to generate
structures and minimize energy functions. 5000 iterations were
performed in each trial. Temperature was scheduled to decrease
linearly. 100 trials were conducted for each target and the
resulted predictions were ordered by the same energy functions.
The top fives were submitted for evaluation.
References:
(1)J. Zhu, R. Luethy and C.E. Lawrence (1999) ISMB99 297-305.
(2)K.T. Simons, I. Ruczinski, C. Kooperberg, B.A. Fox,
C. Bystroff and D. Baker (1999) Proteins 34:82-95.
Kollman-Baker , 498
number of submitted models: 26
Matthew R. Lee
David BakerPeter A. Kollman
email: mrlee@cgl.ucsf.edu
For Rosetta, 3 and 9 residue fragments consistent with the query sequence's
profile and se