002 , Lomize-Andrei
003 , Gerloff
012 , Levitt
023 , Jones
027 , SHESTOPALOV
028 , Ram-Samudrala
032 , Wolynes
042 , Honig-Barry
044 , Walts-Wondrous-Wizards
045 , Del-Carpio-Yoshimori
052 , MRIT-Onizuka
065 , Torda-Andrew
073 , Holm
076 , Weng
086 , Bass-Michael
088 , ORNL-PROSPECT
094 , SAM-T2K
103 , Fugue-Cam
111 , SAM-T99
126 , Sternberg
133 , CBC-FOLD
137 , Zhou-HX
150 , Chandonia-Cohen
162 , Valencia-CNB
173 , Barton
187 , SDSC2:Reddy-Bourne
191 , Lee-Jung
197 , Godzik
204 , Finkelstein
220 , valencia-cnb-pred
223 , Braun-UTMB
229 , UCLA-DOE
229 , UCLA-DOE
248 , BMERC
255 , BinToHes
274 , Tsigelny
278 , Flake&mates
280 , Elber-Meller-2000
328 , Gibrat-Marin
329 , Tatsuya
331 , Levy
344 , PDB-ISL
357 , Fischer-Daniel
361 , GMD-SCAI
363 , Moult
375 , Ho-Kai-Ming
381 , SBfold
382 , SBauto
384 , Murzin
389 , 123D+
390 , Taylor
393 , Skolnick-Kolinski-THD
401 , Reva-Boris
414 , Friesner
473 , Mushegian
492 , Knapp
536 , Fox-Sheppard
Fold recognition using THREADER and GenTHREADER Brunel University
Doublet Code of Protein Secondary Structure Institute of Cytology of Russian Academy of Sciences
Intermediate sequence search EMBL-EBI (European Molecular Biology Laboratory's outstation the European
Bioinformatics Institute)
Fold Recognition Using the BioInBgu server Ben Gurion University
Critical Assessment of techniques for protein structure prediction Supercomputer Facility,
Australian National University
Fold recognition and sequence to structure Australian National University
PSI-BLAST, MACAW, SWISSMOD and nothing much more Akkadix Corporation
Handling interconnected structural changes in comparative modelling of Stanford University
Generalized Comparative Modelling: A combined threading Danforth Plant Science Center
Discrete State-Space Models Method BMERC, Boston University
Fold Recognition using structural profiles (3D-PSSMs) and textual information
(SAWTED) Imperial Cancer Research Fund
Fold recognition with 123D+ server. NCI-FCRF
Assembly of protein cores from regular secondary structures: College of Pharmacy, University of Michigan
Recognition of protein structure by threading Novartis Institute for Biomedical Research
Protein threading based on multiple protein structure alignment Human Genome Center, University of Tokyo
Recognition of protein structure by threading Institute of Protein Research RAS, Institute of Theoretical & Experimental Biophysics RAS
Playing protein fold charades Ceres, Inc
Use of several filters to improve the sensitivity and specificity of fold
recognition methods Institut National de la Recherche Agronomique
FFAS fold prediction The Burnham Institute
HIDDEN MARKOV MODELS BASED SYSTEM (HMMSPECTR) FOR DETECTING STRUCTURAL
HOMOLOGIES ON THE BASIS OF SEQUENTAL INFORMATION University of California, San Diego
Secondary structure and function based protein fold recognition Institute of Computer Science V, University of Mannheim
A sequence-based method of homolog detection using two rounds of BLAST ZymoGenetics, Inc.
Prediction of Protein Structure: a Cooperative Approach National Institute for Medical Research
The threading using Multi-dimensional Singleton Mean-force Potentials and the
sequence-fragment to structure-fragment alignment with continuation-bonus scoring Matsushita Research Institute Tokyo Inc.
Fold recognition with ToPLign/123D and ToPLign/RDP GMD-SCAI
SBfold's procedures for fold recogntion. SmithKline Beecham Pharmaceuticals
SBauto's procedures for fold recogntion. SmithKline Beecham Pharmaceuticals
FUGUE: sequence-structure homology recognition using environment-specific
substitution tables and structure-dependent gap penalties Department of Biochemistry, University of Cambridge
Distant Homology Recognition and Fold Prediction by a knowledge-based approach
using SCOP and Pfam Centre for Protein Engineering, Cambridge, UK
3-D MODELING OF PROTEIN TARGETS FOR THE CRITICAL ASSESSMENT University of Texas Medical Branch at Galveston
World-Wide Server Fold Recognition and Automatic Modeling Stanford University
Jones , 023
number of submitted models: 56
David T. Jones
email: David.Jones@brunel.ac.uk
THREADER3 is the latest incarnation of our well-known threading
program (D.T. Jones et al. Nature 358, 86-89, 1992) and although
it now incorporates a number of new features (in particular the use of PSI-BLAST
profiles), and a more refined set of potentials, the overall concept of the method
remains more or less unchanged since CASP2. Firstly, a library of unique,
continuous protein domain folds is derived from the database of protein
structures. The fold library used throughout CASP4 was based on the
domains found in SCOP V1.50 (A.G. Murzin et al. J. Mol. Biol. 247, 536-540, 1995).
Each fold is considered as a chain tracing through space with the original
sequence either being ignored completely (for fold recognition predictions)
or weighted into the scoring function (for comparative modelling targets).
The test sequence is then optimally fitted to each library fold (allowing for
relative insertions and deletions in loop regions), using a double dynamic programming
algorithm, with the 'energy' of each possible fit (or threading) being
calculated by summing the proposed pairwise interactions and solvation
parameters.
Unlike in previous years, THREADER3 was used to make fold
assignments without any reference to functional information. This was
partly due to a lack of time and partly because we wished to test a new idea
for automatic post-processing of threading predictions. For CASP4, the
raw threading output was evaluated using a neural network (similar to that used
in GenTHREADER) trained to discriminate between correct and incorrect fold recognition
matches. This method is still very experimental, but it was used for all "non-obvious"
predictions targets. Final predictions were based on the final neural
network output. Predictions for targets where the neural network output
(range 0-1) of the top match was < 0.5 were not submitted (but were
selected for ab initio prediction if the size permitted). Only a single
prediction was submitted for each target, unless either a second fold had an
equal score to the top hit or in a few cases where more than one alignment
was generated with and without secondary structure prediction inputs.
Remote homology targets were predicted using GenTHREADER/mGenTHREADER as submitted
to the CAFASP2 prediction section. However, in making CASP4 submissions, where
GenTHREADER was able to make a confident prediction (generally in
cases where a clear evolutionary link is apparent between the target
program and an entry in the fold library), this fold was assumed correct
and THREADER3 was simply used to generate the final alignment (though with
appropriate sequence and secondary structure weighting options).
SHESTOPALOV , 027
number of submitted models: 152
and its Application for Secondary Structure Prediction and Fold Recognition
Shestopalov Boris V
email: shest@mail.cytspb.rssi.ru
The problem of the protein three-dimensional structure prediction has not yet
resolved. We propose to resolve this problem using the Linderstrom-Lang
hierarchial model of the protein three-dimensional structure formation [1].
The first step of this process is the secondary structure formation. Then the
local folds are formed - the supersecondary structure stage. The final stage
is the tertiary structure formation. We state that all the information on
these stages is coded and contained in the previous levels of structure.
The protein secondary structure code for water-soluble proteins is now
determined (the preliminary versions are described in [2] and [3], some
modifications are done for CASP4). The code is doublet one. The alpha-helices
are coded by the amino acid residue pairs (i, i+4), the beta-structures - by
the pairs (i, i+2), the coil regions - by the pairs (i, i+1). The code is
overlapping one and the overlapping is resolved by the selection rule aiming
to keep the most number of codons after selection. During the CASP4
experiment the protein secondary structure code has been used for the
secondary structure prediction and the protein fold recognition. For secondary
structure prediction the homologous sequences information has been used. The
homologous sequences were searched by BLAST2 [4](EMBL service), PSI-BLAST and
Conserved Domain Database [5] , NCBI service and PRODOM [6]. The most deverse
subset has been used. Then the predicted secondary structure has been
confronted with the secondary structures from Protein Data Bank in search of
similar sequences of secondary structure elements. The results obtained have
been used for the fold recognition. In the case of helix-turn-helix motif our
method has been used [7]. In some cases, when it was possible and useful, the
expert considerations have been used. After the fold recognition the secondary
structure prediction is corrected using the secondary structure alignment of
the predicted secondary structure and secondary structures for proteins from
PDB, recognized as most similar to the predicted protein. Alignment has been
constructed manually, using, when it is possible, Yale structural alignments
for PARENT and its homologues [8]. Evidently the result of the fold
recognition depends on the quality of the secondary structure prediction and
the final secondary structure prediction depends on the quality of the fold
recognition. The main restriction of the method used is the application of the
doublet code of the secondary structure based on the middle interactions only,
excluding long ones. The use of the homologous sequence information, as it is
known, does not garantee correct secondary structure prediction as well as the
use of the fold recognition results. We hope to correct this situation after
the completion of our theory of the protein three-dimensional structure, now
in development. Then all the formal and expert ad hoc schemes developed for
CASP4 will became unnecessary. We hope that in future it will be possible to
construct the code tables using only pure physical considerations without
statistical analysis of PDB data as now. Five models for secondary
structure prediction have been constructed. MODEL A is single sequence
prediction (SSP), obtained by DOUBLET CODE METHOD as model 3 in CASP3 [3],
with slightly modified code tables, MODEL B is obtained from MODEL A by
transforming ambiguous and undetermined regions into COIL. MODEL C is multiple
sequence prediction (MSP), obtained by application of DOUBLET CODE and
PSI-BLAST, MODEL D is variant of MODEL C, obtained using Prodom, MODEL E is
MSP with more expert intervention, including the using of the fold recognition
results, described above. MODEL 1 is MODEL E, if absent - MODEL D, if absent -
MODEL C, if absent - MODEL B. For the fold recognition it has been used MODEL
D, if absent - MODEL C, if absent - MODEL B.
References
1. Linderstrom-Lang K.V. (1952) Proteins and enzymes, Stanford Univ. Press,
Stanford, California.
2. Shestopalov B.V. Prediction of protein secondary structure by doublet code
method. Mol. Biol., Moscow, Engl. transl.,24/4, p.900-907.
3. Shestopalov B.V.,CASP3, submitted.
4. Yan P. Yuan, Eulenstein, O., Vingron, M. & Bork, P. 1998. Towards detection
of orthologues in sequence databases. Bioinformatics, 14, 285-289
5. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z, Miller W &
Lipman D.J. /1997/. Nucl. Acid. Res. v. 25, pp 3389-3402.
6. http://www.toulouse.inra.fr/prodom.html.
7. Shestopalov B.V. Amino-acid sequence template useful for alpha-helix-turn-alpha-helix
prediction, FEBS Lett. 233: (1) 105-108 JUN 6 1988.
8. Krebs W., Gerstein M. http://bioinfo.mbb.yale.edu/align/server.cgi.
Holm , 073
number of submitted models: 21
Jong Park, Sabine Dietmann, Andreas Heger, Liisa Holm
email: holm@ebi.ac.uk
We tried to predict those targets for which there were no known
homologous structures. The basic method systematically applied
in each case was PDB-ISL [1], an intermediate sequence search.
In some cases, the alignment was improved by manual optimization
of atomic solvation preference [2]. A large number of targets
had no obvious hit by PDB-ISL, and their prediction was based
merely on human judgement. Judgement included functional
considerations, if there was a nonsignificant hit which might
be a remote homologue of a known family, and solvation
preference analysis. We did not predict structures which
looked like coiled coil from secondary structure composition.
The most fun prediction was that for T102, where we did not
spot the analogy to nk-lysin. PDB-ISL gave no hits, so we used
pencil and paper. The prediction was based on correlated
residues (implying contacts), which were identified from a
multiple alignment, between residues 15-23, 16-52 and 18-34,
and secondary structure elements known from NMR plus the
information that the ends are covalently linked. Visual
screening of helical structures rejected many as topologically
impossible given helix handedness and the contact constraints,
and resulted in three templates being selected, one of which
required reversing the chain direction. The solvation
preference profile was quite good for the t102-2tct model.
Solvation preference analysis is dangerous because it only
gives really good values for the native sequence-structure
alignment and we had to optimize alignments manually and
compensate for template divergence and gaps by guessing. Some
PDB-ISL predictions were rejected due to bad solvation
preference. Thus t109 was not predicted to be similar to the
top PDB-ISL hit 1kfs, because we only found out too late that
an optimized alignment gave a good solvation preference.
References:
[1] Teichmann SA, Chothia C, Church GM, Park J. (2000) Fast
assignment of protein structures to sequences using the
intermediate sequence library PDB-ISL. Bioinformatics 16:117-124.
[2] Bork P, Holm L, Koonin EV, Sander C. (1995) The
cytidylyltransferase superfamily: identification of the
nucleotide-binding site and fold prediction. Proteins 22:259-266.
Fischer-Daniel , 357
number of submitted models: 156
Fischer, Siew, Esterman and Mishalia
email: dfischer@cs.bgu.ac.il
We have submitted predictions for the casp4 experiment
based on the results from the bioinbgu server. The goal
was to submit the exact same output as the server,
except for the cases where the server's result appeared
to be weak. In such cases, an educated guess based on
the top hits was applied, in combination to alternative
searches performed on parts of the sequence and on homologues.
An attempt to perform basically "computable" tasks was carried
out. When all failed, a coin was flipped. We avoided using
any biological information.
The bioinbgu server has been described in Fischer, D,
Proceedings of the Pacific Symposium in Biocomputing,
Hawaii, 119-130, January 2000, and its abstract follows:
Recent assessments of structure prediction
have demonstrated that
(i) although fold recognition methods can
often identify remote similarities when standard sequence search methods
fail, the score of the top-ranking fold is not always significant
enough to allow a confident prediction;
(ii) the use of structural information such as secondary structure increases
recognition accuracy;
(iii) modern sequence-based methods incorporating
evolutionary information from neighboring sequences can often identify
very remote similarities;
(iv) there is no one single method
that is superior to other methods when evaluated over a wide range of targets,
and
(v) extensive human-expert intervention is usually required for the most
difficult prediction targets.
Here, I describe a new, hybrid fold recognition method that incorporates
structural and evolutionary information into a single fully
automated method. This work is a first attempt towards the automation
of some of the processes that are often applied by human predictors.
The method is tested
with two fold-recognition benchmarks demonstrating a superior performance.
The higher sensitivity and selectivity enable the applicability
of this method at genomic scales.
Flake&mates , 278
number of submitted models: 215
in the Cassandra package
T. Huber, M.J. Abraham, D.J. Ayers, Z. Dosztanyi,
J.B. Procter, A.J. Russell, A.E. Torda, S. Flake
email: Thomas.Huber@anu.edu.au
In previous CASP experiments, methods based on divine inspiration,
biochemical knowledge and predictions from publicly available servers
have been remarkably successful. One may even think it fool-hardy to
submit automatic guesses from a home-built prediction package.
Following Eddie Edwards (Edwards 1990), one may, however, also treat
the experiment as a critical assessment of techniques for protein
structure prediction and use the opportunity to demonstrate
reproducible strengths and weaknesses of one's own method. In this
spirit, the game was not played by instinct. It was regarded as an art
learnt by obedience to instruction and a complete disregard of self.
Our calculations were performed with the Cassandra package, a locally
written protein structure prediction package.
Alignments of target sequences to a library of protein folds were
generated by a two step threading approach (Huber & Torda 1999) with
scoring terms based on terms from z-score optimised force fields
(Huber & Torda 1998), PhD secondary structure predictions (Rost &
Sander 1993) and PAM250 sequence similarity scores. The weights of the
different scoring contributions and penalties for introducing gaps
into alignments have been rigorously optimised against a large,
statistically significant set of structurally aligned proteins with
low sequence similarity.
The template fold library was a set of only 893 protein structures.
The coordinates of the structures, however, were optimised so
as to score well with sequences of structurally similar
proteins and simultaneously penalise inappropriate sequences.
After ranking the predictions, side chains were placed using self-
consistent mean field optimisation (Huber, Torda & van Gunsteren
1994,1996).
Acknowledgement: We like to thank the inspirational Laurie Nichols
References:
1) Eddie Edwards (The Eagle) (1990),
"I think what my Olympic participation shows is that you don't have
to be the best in the world to be popular."
2) Huber, T. and Torda, A.E.,
Prot. Sci. 7 (1998) 1-8.
3) Huber, T. and Torda, A.E.,
J. Comp. Chem. 20 (1999) 1455-1467.
4) Rost B. and Sander C., JMB 232 (1993), 584-599.
5) van Gunsteren W.F., Huber T., Torda A.E.,
Proceedings of the 1st European Conference on Computational Chemistry,
American Institute of Physics Conf. Proc., New York (1994).
6) Huber T., Torda A.E. and van Gunsteren, W.F.,
Biopolymers 39 (1996), 103-114.
Torda-Andrew , 065
number of submitted models: 93
alignments without Boltzmann-based force fields
Abraham, M, Ayers, D, Dosztanyi, Z, Huber, T,
Procter, JB, Russel, AJ and Torda, AE
email: Andrew.Torda@anu.edu.au
Alignments were calculated and models ranked using the sausage
program [1]. Sidechains were fitted using a self-consistent
mean-field method [2].
Three force fields were used in three different steps
1. Sequence to structure alignments used a score function
which used the identity of only one interaction partner
[5]. This allowed us to use the Gotoh method [4] for speed,
while avoiding the frozen approximation or double dynamic
programming.
2. Ranking of models used a z-score optimised force field [3]
3. Fed by unbounded optimism or perhaps pure faith,
side-chains were placed on the models using a more
conventional, physically based, molecular mechanics style
force field.
The first two force fields may be knowledge-based, but they
were built in complete ignorance of Boltzmann
statistics. Instead, the parameters are optimised so as to
distinguish native coordinates from a mass of misfolded
structures.
A second series of optimisation calculations allowed us to
find weights for additional terms for secondary structure
predictions [6], sequence similarity and gap penalties.
Finally, the library of templates consisted not of simple
protein coordinates, but rather of precalculated fields due to
averaging over similar structures.
The alignment code and methodology is undisputably fast. It
may occasionally be correct.
For the last few targets, secondary structure predictions were
made using a neural net fed on the sausage alignment
calculations.
--------------------
[1] Huber T, Russell AJ, Ayers D, Torda AE (1999)
Bioinformatics, 15, 1064-1065.
Sausage: protein threading with flexible force fields.
and
http://www.rsc.anu.edu.au/~torda/sausage.html
[2] Huber T, Torda AE, van Gunsteren WF (1996), Biopolymers,
39, 103-114.
Optimization methods for conformational sampling using a
Boltzmann-weighted mean field approach.
[3] Huber, T and Torda, AE (1999) Protein Sci, 7, 142-149.
Protein fold recognition without Boltzmann statistics or
explicit physical basis.
[4] Gotoh, O. (1982) J Mol Biol, 162, 705-708.
An improved algorithm for matching biological sequences.
[5] Huber T, Torda AE (1998) J Comput Chem, 15, 1455-1467.
Protein sequence threading, the alignment problem, and a
two-step strategy.
[6] Rost B and Sander C. (1993) J Mol Biol, 232, 584-599.
Prediction of protein secondary structure at better than 70%
accuracy.
Mushegian , 473
number of submitted models: 4
A. Mushegian
email: mushegian@akkadix.com
Detection of remote homologs with the known structure and proper alignment of the
target and the template are crucial, if not only, determinants in successful fold
recognition. I asked how far one can get in protein fold recognition by using the
standard publicly available tools of database search and sequence alignment, PSI-BLAST
(Altshul et al., 1997) and MACAW (Schuler et al, 1991), and the on-line modeling service,
SWISS-MOD. I entered into the competition on Aug. 15th, and ignored the targets which
expired or were annotated as having the homologs with known structure. Additionally,
I disregarded a few short sequences and two targets which were predicted to consist
mostly of the long coiled coils sensu Lupas. This left 16 targets. Seven of those
turned out to have remote homologs with known structure, as judged by
1. using PSI-BLAST with the cutoff 0.05 to convergence, or
2. by collecting the homologs found in (1.) and using them as queries in new rounds of
PSI-BLAST or
3. using the checkpoint profile built in (1.) and searching the PDB-fasta.
Seven models were attempted based on the alignment of targets to these templates,
and four best ones were submitted.
Ram-Samudrala , 028
number of submitted models: 207
proteins using a statistical scoring function, graph theory, andexhaustive enumeration
techniques
Ram Samudrala and Michael Levitt
email: ram@csb.stanford.edu
The interconnected nature of interactions in protein structures,
thorough sampling of side chain and main chain conformations, and
devising a discriminatory function that can distinguish between
correct and incorrect conformations are the major hurdles preventing
the construction of accurate homology models. We present an algorithm
that uses graph theory to handle the problem of
interconnectedness. Sampling of side chain and main chain
conformations is accomplished by exhaustively enumerating all possible
choices using a discrete state model, including fragments from a
database of protein structures. The optimal combination of these
possibilities is selected using an all-atom scoring function aided by
the graph-theoretic approach.
Following is a brief description of the components and steps of this
method, which can be divided into: discriminatory function,
identification of template and generation of alignment, initial model
building, construction of variable main chain and side chain regions,
and moving models closer to the native conformation.
0. DISCRIMINATORY FUNCTION: the function used throughout generally is
an all-atom distance-dependent conditional probability discriminatory
function based on a statistical analysis of known protein
structure. The negative log of the conditional probability of
observing two atoms interact given a particular distance is used as a
``pseudo-energy'' term. Reference: J Mol Biol 275: 893-914 (1998).
1. IDENTIFICATION OF TEMPLATE AND GENERATION OF ALIGNMENT: The CAFASP
meta-server data were used to identify the proteins that a given
target sequence was related to (based on a consensus of all the hits
produced by the different servers). The alignments generated by the
different servers were then used to construct initial models. The
initial models were then ranked by our discriminatory function and the
models that ranked highest were used for further model-building.
2. INITIAL MODEL BUILDING: Following the sequence alignment, for each
parent structure, an initial model was generated by copying atomic
coordinates for the main chain (excluding any insertions) and for the
side chains of residues that are identical in the target and parent
structures. Residues that differ in type were constructed using a
minimum perturbation technique. The MP method changes a given amino
acid to the target amino acid preserving the values of equivalent chi
angles between the two side chains, where available. The other chi
angles are constructed by the MP method using an internally developed
library based on residue type.
3. CONSTRUCTION OF VARIABLE MAIN CHAIN AND SIDE CHAIN REGIONS:
Main chain sampling is performed using an exhaustive enumeration
technique based on discrete states of phi/psi angles. For longer main
chain regions, we use fragments (3-tuples) from a database of protein
structures to generate the discrete phi/psi angles.
Side chains possibilities are generated by selecting the most probable
side chain rotamers based on the interactions of a given rotamer with
the local main chain (evaluated using the discriminatory function
above). Reference: Samudrala R, Moult J. Prot. Eng. 11: 991-997,
1998.
We then use a graph-theoretic approach to assemble the sampled side
chain and main chain conformations together in a consistent manner.
Each possible conformation of a residue is represented using the
notion of a node in a graph. Each node is given a weight based on the
degree of the interaction between its side chain atoms and the local
main chain atoms. The weight is computed using a all-atom conditional
probability discriminatory function. Edges are then drawn between
pairs of residues/nodes that are consistent with each other (i.e.,
clash-free and satisfying geometrical constraints). The edges are also
weighted according to the probability of the interaction between atoms
in the two residues. Once the entire graph is constructed, all the
maximal sets of completely connected nodes (cliques) are found using a
clique-finding algorithm. The cliques with the best probabilities
represent the optimal combinations of mixing and matching between the
various possibilities, taking the respective environments into
account. Reference: J Mol Biol 279:287-302 (1998). Clique-finding is
accomplishing using the Bron and Kerbosch algorithm. Reference:
Communications of the ACM, 16: 575-577 (1973).
All models used were refined using ENCAD.
5. MOVING MODELS CLOSER TO THE NATIVE CONFORMATION:
Once we had generated a final model for each parent, we used
an off-lattice fourteen-state phi/psi model and a sequential
build-up algorithm to generate structures around the conformational
space of the final model. We then used our scoring function to select
the best ranking ones. The goal here is that some of the conformations
sampled would actually be closer to the native conformation and that
our scoring function will be able to select it.
We test how the above approach works in a comparative-modelling
scenario and assess the predictive power of this method by applying it
to properly controlled blind tests as part of the fourth meeting on
the Critical Assessment of protein Structure Prediction methods
(CASP4). Compared to CASP2 and CASP2, where a similar approach was
used, we have improved the method used to sample main chains and have
made minor enhancements to the other components of this approach
including the scoring function. The biggest change is in our attempt
to move models closer to the final answer. It remains to be seen how
the improvements in methodology correlate with model accuracy.
Skolnick-Kolinski-THD , 393
number of submitted models: 74
-refinement approach to structure prediction
A. Kolinski, D. Kihara,M. Bettancourt, Piotr Rotkiewicz,
M. Boniecki, and Jeffrey Skolnick
email: skolnick@danforthcenter.org
A hierarchical, generalized comparative
modeling method has been applied to predict the
tertiary structure. First, PROSPECTOR (1), a new
threading algorithm, is used to select the template
structure. Threading also provides predicted
secondary structure and tertiary contacts that are not
restricted to the template structure but can be
extracted from other structures. This allows the
possibility of fold prediction in those regions absent
in the alignment of the probe sequence to the
template structure. Next, the aligned parts of the
probe sequence were fitted to the template, and
pieces of the lattice chain were built by taking into
consideration the excluded volume of the model
chain and the necessity of "stretching" the chain
between the gaps in the template. Then, starting
from the shortest loop, the loops and nonaligned
chain ends were randomly inserted, again taking
into account the excluded volume. The proper
geometry of the model chain (avoiding non-physical
distances between side-groups close along the
chain) was preserved during the chain-building
procedure. Then, using the side chain, center of
mass based lattice model (SICHO) of Kolinski and
Skolnick (2), the structure is refined in the
neighborhood of the template fold; an early variant
of this method has been described previously (3),
but now the template is treated in a more permissive
manner. From a series of folding/structure
refinement simulations that employs parallel
tempering to explore conformational space, the
lowest energy structures are extracted and
to a two-part structure selection protocol. First, the
structures are clustered, and the resulting clustered
folds are selected to provide a set of predicted
structures. In parallel, distance geometry is used
to generate alternative representative structures. All
folds are locally relaxed using a more detailed off-
lattice model comprised of the alpha carbons and a
one or two center description of the side chains that
depends on the side chain size. Atomic detail is then
added and the resulting structures are reported.
1. J. Skolnick and D. Kihara, Defrosting the frozen approximation: PROSPECTOR: A new
approach to threading ,Proteins in press (2000).
2. A. Kolinski and S. A., Assembly of protein structure from sparse experimental
data: An efficient Monte Carlo Model ,Proteins 32 475-494 (1998).
3. A. Kolinski, P. Rotkiewicz, B. Ilkowski and J. Skolnick, A method for the
Improvement of threading-based protein models ,Proteins 37 592-610 (1999).
BMERC , 248
number of submitted models: 42
Jadwiga R. Bienkowska , Honxian He and Temple F. Smith
email: jadwiga@darwin.bu.edu
The methods used in the CASP4 prediction contest combined
two algorithms [Bienkowska et al. 2000] and [Das and
Smith. 2000]. The first algorithm takes into account the
high probability of sub-optimal sequence-to-structure
alignments in the fold recognition method. The second
algorithm is a profile based multiple sequence alignment
method. It was applied in the search for the best
sequence-to-structure alignment.
I. Fold recognition approach
Our approach is based on the DSM representation of protein
structures (Stultz et al. 1993). Mathematically a DSM is
represented as an HMM, with a distinction that DSMs are
designed rather than trained HMMs. The design of a
structural DSM relies on a prior knowledge about protein
structure and attempts to introduce a minimal bias among
alternative realizations of the same structural fold.
Each DSM for a fold represented by a PDB structure is
constructed hierarchically out of a set of substructure
models. These sub-models represent standard secondary
structures and include, via residue position and solvent
exposure parameterization, implied three dimensional
packing information. The secondary structure sub-models
are joined by loop/turn sub-models following the observed
arrangement of the secondary-structure elements along the
protein sequence. Each sub-model, or a DSM plex, is
constructed on the basis of a secondary-structure
assignment made for the parent structure by DSSP. The
secondary-structure plex is represented by a hidden Markov
chain that allows for the anticipated variation and DSSP
assignment uncertainty in length of plus or minus one
residue at both ends. Assignment of amino acid
probabilities to each hidden state in helix or strand is
based on position secondary structure and solvent
exposure. The states that correspond to loops are assigned
independently of the solvent exposure of the residues
observed in loops. Thus, only the structural information
about the model protein is taken in this approach. The
anticipated variation in the protein structure is
represented in the variation encoded in each secondary
structure plex and also in the C-terminal and N-terminal
loop plexes that allow an additional amphipathic helix to
be added at both ends of a structural domain.
The HMM representation of the protein 3D structures allows
the use of the forward-backward (Rabiner 1989) or filtering
(White 1988) algorithm for calculation of the total
probability (Bienkowska et al. 2000) that any given model
could have generated any given sequence. The filtering
algorithm is more sensitive than dynamic programming (or
Viterbi) algorithms, commonly used in sequence-sequence and
sequence-structure comparison methods, that calculate only
the most probable path trough the model (optimal
sequence-model alignment). The filtering algorithm
calculates P(Seq|Model), from which one can then calculate
P(Modeli|Seq) using the Bayesian relation. The posterior
probability is then relative to the entire set of competing
models. The DSM library contained 539 competing models that
represent 305 SCOP superfamilies. We employed a binary
decision approach and called a prediction when the
posterior probability of a model was greater than 0.5.
II. Alignment generation
The query sequences and the sequences representing the
functional family of the selected model structure were
subsequently submitted to the multiple alignment software
PIMA. PIMA algorithm uses the combined probability of the
amino acids aligned among the homolog sequences and the
prior probabilities of observing each pair of amino acids in
an alignment.
Bienkowska J.R., Yu L., Zarakhovich S., Rogers Jr R. and
Smith T.F. Protein Fold Recognition by Total Alignment
Probability. Proteins: Structure, Function and Genetics,
40(3): 451-464, 2000.
Das S. and Smith T. F. Identifying nature's protein lego
set. Advances in Protein Chemistry ed. Peer Bork vol 54:
159-183 Academic Press 2000.
Rabiner L. R. A tutorial on hidden Markov models and
selected applications in speech recognition. Proceedings
IEEE, 77:257-286, 1989.
Stultz C.M., White J.V., and Smith T.F. Structural analysis
based on state-space modeling. Protein Science, 2:305-314,
1993.
White J. V. Bayesian analysis on time series and dynamic
models, pages 255-283. Marcel Dekker, New York, NY USA,
1988.
Sternberg , 126
number of submitted models: 45
Kelley LA, MacCallum RM & Sternberg MJE
email: kelley@icrf.icnet.uk
The primary method used was that described in (Kelley et al., 2000),
using the program 3D-PSSM. One of the key features of this technique is the use
of multiple structural alignments of remote homologues to create extended sequence
profiles (3D-PSSMs). These profiles can capture the sequence characteristics of an entire
structural superfamily, and extend the range of profiles generated from sequence
similarity alone (e.g. PSI-Blast).
The method involves a three-pass dynamic programming algorithm against a library
of known folds taken from SCOP and the PDB. Each of the three passes of dynamic
programming uses sequence, secondary structure and solvation terms. Secondary
structure is matched between a known library structure and the predicted
secondary structure for the query. Secondary structure prediction was done
using PSI-Pred (Jones,1999). Our solvation model is knowledge-based and similar
to (Jones et al., 1992).
Each of the three passes differs in the sequence profile used. Two
of the sequence profiles (or PSSMs) are taken from PSI-Blast: i)the PSSM
for the query sequence and ii) the PSSM for the library structure. The third
sequence profile is generated from multiple structural alignments and so we
call these 3D-PSSMs. We use structural alignments of homologous proteins of
similar three-dimensional structure in the SCOP database to obtain a structural
equivalence of residues. These equivalences are used to extend multiply aligned
sequences obtained by PSI-Blast. The resulting large superfamily-based
multiple alignment is converted into a (3D)PSSM.
The final alignment produced by the algorithm is the highest scoring of these
three passes. Our web server (http://www/bmm.icnet.uk/~3dpssm) reports the top
20 highest scoring structures in our library, as calculated by 3D-PSSM. In addition
to the alignment score, we have incorporated a textual component to aid in functional
assignment. The program used is SAWTED for Structure Assignment With Text Description
(MacCallum et al.,2000; http:/www.bmm.icnet.uk/~sawted). This method compares the
comments and keywords between SWISS-PROT homologues of the query sequence and the
library sequence. Confident SAWTED scores are combined with the 3D-PSSM alignment
scores to reflect potential functional similarity between query and template. This
is intended to mimic, to a small degree, the human assessor's ability to gauge the
likelihood of a correct fold or superfamily assignment based on his or her knowledge
of the function of the query and template.
The above techniques were applied automatically in both the CASP4 and CAFASP-2
evaluations. However, for the CASP4 evaluation, we have additionally used manual
intervention in many cases.
For orphan targets, or targets with few varied homolgous sequences in the sequence
database, we would run tblastn at the NCBI against unfinished microbial genomes
(http://www.ncbi.nlm.nih.gov/Microb_blast/unfinishedgenome.html). This would sometimes
supply us with a sufficiently large and diverse set of sequences to improve secondary
structure prediction accuracy, and a more powerful sequence profile.
When we suspected a secondary structure prediction may be erroneous, we would
send the target sequence to the Jpred server (http://jura.ebi.ac.uk:8888/submit.html)
and look for consensus and, if necessary, run the 3D-PSSM server on a separately compiled
secondary structure prediction.
In cases where no confident 3D-PSSM hits could be found for a given target, we would
often use PFAM (http://pfam.wustl.edu/) alignments to either generate an alternative
secondary structure prediction, or analyse alternative sequences from the same PFAM family
as the target, using the protocol above.
In many cases, the automatic alignments produced by the 3D-PSSM server were manually
adjusted to meet a variety of criteria:
a)Maintenance of a hydrophobic core based on three-dimensional models generated
from the alignments.
b)Equivalencing of known core residues (as pre-calculated using a mutual
contact algorithm) with hydrophobic residues in the target.
c)Preservation of the continuity of secondary structure elements.
d)Maintenance of the spatial arrangements of residues suspected to form the active site.
e)Alignment of known motifs (such as the Walker A and B motifs in P-loops, or known
conserved residue types in OB-folds e.g.(Bycroft et al.,1997))
f)Maintenance of the spatial distances between cysteine residues believed
to form disulphide bridges.
Bycroft M. , Hubbard T. J. P. , Proctor M. , Freund S. M. V. Murzin A. G. (1997).
The solution structure of the S1 RNA binding domain: a member of an ancient nucleic
acid-binding fold. Cell , 88, 235-242
Jones D. T. , Taylor W. R. Thornton J. M. (1992). A new approach to fold recognition.
Nature , 358, 86-89.
Jones D. T. (1999). Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292, 195-202
Kelley LA, MacCallum RM & Sternberg MJE (2000). Enhanced Genome Annotation
using Structural Profiles in the Program 3D-PSSM. J. Mol. Biol. 299(2), 501-522.
MacCallum, R.M., Kelley, L.A. and Sternberg, M.J.E. (2000) Bioinformatics 16(2),
125-129.
123D+ , 389
number of submitted models: 214
Nickolai N. Alexandrov
email: nicka@ncifcrf.gov
123D+ server compares a target sequence with a set of protein domains from ASTRAL
non-redundant set (version 1.50, 50 0dentity list). For every residue in the
domain, the following information is derived from the PDB files: (i) residue
type (amino acid in SEQRES field), (ii) secondary structure, assigned by
Stride, and (iii) the number of contacts with other residues. Domain profiles
are created by psi-blast run against NR database. Similarly, psi-blast profile
is also created for a target sequence. Secondary structure of a target is
predicted by probabilistic approach from statistics of amino acid pairs in a
sliding window of 17 residues. Similarity score between position i in target
and position j in domain is computed as: log((Paa*Pss*Pcc)/(P'aa*P'ss*P'cc)),
where Paa is a probability to have the same amino acid in i and j, computed
from the psi-blast profiles; Pss is a probability to have the same secondary
structure; and Pcc is a probability to have the same number of contacts,
computed from the contact capacity potentials for every residue type. P'aa,
P'ss, and P'cc are correspondent expected probabilities. 123D+ uses dynamic
programming to find an optimal sequence-structure alignment. In addition to
standard events of match, deletion, and insertion, the algorithm features a
choice of residues not to be aligned, which helps to deal with different loop
conformations. As default alignment mode was used fit, where the whole domain
is required to be aligned with a part of the target sequence. 123D+ was
benchmarked with ASTRAL set of domains and outperformed psi-blast in fold
recognition. 123D+ is available at
http://www-lmmb.ncifcrf.gov/~nicka/run123D+.html.
Lomize-Andrei , 002
number of submitted models: 28
ab initio and fold recognition techniques.
Andrei L. Lomize, Irina D. Pogozheva, and Henry I. Mosberg
email: almz@umich.edu
3D models of protein cores (complexes of several interacting
alpha-helices and beta-sheets, excluding nonregular loops) have
been generated for 19 CASP4 targets with no detectable sequence
homology to proteins of known structure. The partially automatic
procedure described below reproduces main blocks of a large software
package that is under development in our group to test the validity
of the entire approach and its specific parts. The procedure
includes the following three steps.
STEP 1. Ab initio prediction of secondary and supersecondary
structure using two different methods:
(a) calculation of alpha-helices, alpha-hairpins, and beta-hairpins
in hydrophobically collapsed protein using the program Framework [1];
(b) identification of alpha-helices and beta-strands based on
hydrophobicity patterns in multiple sequence alignments [2];
Possible beta-sheet topologies and the structural class of the
target (beta-sandwich, beta-barrel, beta-helix, beta-prism, different
alpha+beta and alpha/beta structures, alpha-superhelix, or alpha-bundle)
were suggested based on a qualitative analysis of results produced by
both methods.
STEP 2. Fold recognition. The procedure included the following
three parts.
(1) Identification of related PDB structures using a library of
"supersecondary nuclei" in proteins [3], and the following criteria:
(a) similar secondary structures of the target and template,
including number, order, and lengths of alpha-helices and beta-
strands, and identical beta-sheet topologies,
(b) similar biological functions;
Twelve of the nineteen targets considered (T0088,T0094,T0098,T0100,
T0101,T0102,T0104,T0107,T0108,T0109,T0118, and T0126) satisfied
these criteria, and therefore were designated for fold recognition.
(2) Finding optimal alignment of secondary structures in the
target and template that maximizes formation of aliphatic, aromatic,
and polar clusters and burial of nonpolar side-chains.
(3) Adjustment of side-chain conformers and the spatial positions
of entire alpha-helices to improve close packing, burial of nonpolar
groups, and hydrogen bonding.
STEP 3. Ab initio assembly of 3D cores from alpha-helices and
beta-sheets - for targets that could not be assigned to any known
protein fold in STEP 2 (T0091, T0095, T0097, T0105, T0106, T0110,
and T0114). The docking of regular secondary structures (using
QUANTA and our unpublished software) sought to optimize burial
of nonpolar side-chains, segregation of aliphatic, aromatic, and polar
groups into separate clusters, close packing, and hydrogen bonding
in simultaneously constructed models of several homologous proteins
from the target family. Two different assembly strategies were tested
for all-alpha-helical domains: stepwise building of the core from
gradually growing structures (T0106 and models 2 of T0095 and T0097),
and formation of a nearly complete core (models 1 of T0095 and T0097).
[1] A.L.Lomize and H.I. Mosberg (1997) Thermodynamic model of
secondary structure for alpha-helical peptides and proteins.
Biopolymers, v.42, pp. 239-269
[2] A.L.Lomize, I.D. Pogozheva, and H.I. Mosberg (1999) Prediction
of protein structure: the problem of fold multiplicity. Proteins,
Suppl.3, pp.199-203
[3] A.L.Lomize, I.D. Pogozheva, and H.I. Mosberg (1999) Protein
structure assembly pathways. Protein Sci., v. 8, Suppl.1, p.86
Reva-Boris , 401
number of submitted models: 192
with averaging energies over homologs.
B.A.Reva, A.V.Finkelstein, D.S.Rykunov, M.Yu.Lobanov.
email: boris.reva@pharma.novatis.com
To compute an energy of a protein chain with loops in external
field we develop a model where
(i) a 3D position of any amino acid residue is given by a
position of its Ca atom; each of the amino acids of a target
sequence occupies a position either on a template or in a "loop", i.e.non-aligned region
of the sequence (loop structures are not defined; the energy
of a loop depends on the template positions of its ends, in
particular, the types of residues in the loop, and on the
number of residues in the loop)
(ii) a template is a limited set of positions for Ca-atoms
in 3D space; only backbones of proteins from the PDB are used
as structural templates;
(iii) an energy of interaction between a residue and a template
is given by the potential of an "external field" that depends
on the type of residue and on its position on the template;
(iv) residues of a target sequence can occupy any position
on a template.
For each target sequence all available pdb structures were used
as templates.
An external field acting on a residue is produced by summing
up all interactions of a given residue with residues of a template
structure. Local interactions between neighbor residues are
calculated explicitly. (Local interactions include interactions
between neighbors separated by 1, 2, 3 residues along a chain,
bending energy that depends on a type of a residue between
two interacting residues, and also chiral energy of a backbone
[1].)
A sequence-to-structure alignmnet is computed by dynamic
programing. The energy of the obtained structure is computed
and used to rank templates.
Averaging energy over homologs [2] was applied in energy
calculations described above. To this end, we computed sequence
alignment [3] of a target sequence and the corresponding
homologs extracted from the non-redundant sequence base.
Only homologs with low sequence similarity were used in
energy averaging.
To check the quality of obtained sequence-to-structure alignments
we computed Z-scores using fragmentarian gapless threading [4].
Typically the lowest energy structures give low Z-scores,
however when the energy based selection was ambigous we used
the Z-score based selection for the final submissions.
References
1.Reva,B., Finkelstein,A., Skolnick,J. Derivation and testing
residue-residue mean-force potentials for protein structure
recognition. In: Methods in Molecular Biology, vol.143;
Protein Structure prediction: Methods and Protocols. 2000, pp. 155-174.
2.Reva B.A., Skolnick J., Finkelstein A.V. -
Averaging of interaction energies over homologs improves
protein fold recognition in gapless threading. -
Proteins, 35: 353-359, 1999.
3.Altschul S.F., Madden T.L., Schiffer A.A.,Zhang J.,
Zhang Z., Miller W., Lipman D.J. - Gapped BLAST and
PSI-BLAST: a new generation of protein
database search programs -
Nucleic Acids Res. 25: 3389-3402, 1997.
4.Reva, B.A., Topiol, S.
Recognition of protein structure: determining the relative
energetic contribution of beta-strands, alpha-helices and loops.
Biocomputing. Proceedings of the Pacific Symposium 2000;
World Scientific Publishing Co. Pte. Ltd. pp.168-178.
Tatsuya , 329
number of submitted models: 115
Tatsuya Akutsu, Morihiro Hayashida, Yuichiro Horai, Kenta Nakai
email: takutsu@ims.u-tokyo.ac.jp
We used protein threading in which structure alignment results were used
as profiles. The prediction method was not automatic. The outline of the
prediction method is as follows:
(1) Candidates of possible structures are obtained using several tools:
SSEARCH33 (Smith-Waterman algorithm), PSIBLAST, PHD, PSIBLAST-based
search tool using intermediate sequences (developed by us), CAFASP
results.
(2) Structures similar to each candidate are searched from PDB, using
STRALIGN (pairwise structure alignment program developed by us) and
SCOP/ASTRAL database.
(3) For each candidate, a multiple structure alignment is computed
from pairwise structure alignments for similar structures
by using a method similar to the center star method.
(4) A protein threading (i.e., an alignment between a sequence and a
structure) is computed by using CLUSTALW and PSIBLAST. Then, candidate
structures are ranked based on human knowledge.
In order to compute pairwise structure alignment, we used STRALIGN [1].
STRALIGN computes a structure alignment between two C-alpha chains
by using dynamic programming, least-squares fitting (RMS fitting)
and iterative improvement.
In order to compute a multiple structure alignment from pairwise structure
alignments, we used the center star method, which was well known for
constructing a multiple sequence alignment from pairwise sequence alignments.
For applying the center star method, the center structure should be
determined. Since a candidate structure was given in our case, we used
the candidate structure as the center structure. Different from the standard
center star method, we did not allow insertions for the center structure.
Details of computation of multiple structure alignment is described in [2].
Threadings were computed based on sequence vs. profile alignment (i.e.,
alignment between the target sequence and the multiple structure alignment).
CLUSTALW and PSIBLAST (using -B option) were used for computing threadings.
[1] Tatsuya Akutsu, Protein structure alignment using dynamic programming
and iterative improvement, IEICE Trans. on Information and Systems,
E79-D:1629--1636, 1996.
[2] Tatsuya Akutsu and Kim Lan Sim, Protein threading based on multiple
protein structure alignment, in: Genome Informatics 1999 (Universal
Academy Press, Tokyo), 23--29, 1999.
Finkelstein , 204
number of submitted models: 31
with double averaging energies over target sequence homologs and structural neighbors.
D.S.Rykunov, B.A.Reva, M.Yu.Lobanov, A.V.Finkelstein.
email: rykunov@alpha.protres.ru
Our group (Finkelstein) and Dr. Reva (Reva-Boris) group use the same core
threading program, but different template libraries and the post-threading
prediction processing. Here we describe the core approach, our (Finkelstein)
template definition and final decision step.
To compute the energy of protein chain onto the template fold we develop a model where
(i) Each residue of the target sequence either occupies some position on the
template (then its 3D position is given by the template's Ca atom), or it is
in the "loop", i.e. in the non-aligned region of the sequence. The "loop"
structure is not defined; its energy depends on the template positions of its
ends, and on the number and the types of the loop residues.
(ii) The energy of interaction between each residue of the target and the
template is determined by the potential of an "external field", acting in the
given template position at the residue of a given type. Summing up all
interactions of the target residue with surrounding residues of the template
produces the external field potential. The local interactions between close
target residues are calculated explicitly from their types and coordinates at
the template (they include contact terms, the bend terms and the chiral terms) [1].
The optimal sequence-to-structure alignment is computed by dynamic
programming. The energy of the obtained structure is used to rank templates
for the target sequence. The averaging energy over homologs [2] is applied to
the energy calculations described above. To this end, we use BLAST [3] to
obtain the target sequence homologs and to build the multiple sequence
alignment. Only the homologs with a low and medium sequence similarity are
used for the energy averaging. The representative set of the SCOP (rel. 1.50)
[4] domains is used as the templates for each target sequence. The domains
have been clustered using the package STAMP [5], and the energies obtained for
the target sequence threaded onto each cluster member are averaged as
described in our recent paper [6] A human expertise of about 20 lowest-energy
structures is performed as the final step of the prediction process. It is
based on the visual evaluation of the predicted structure compactness, the
distribution of the hydrophobic/polar residues on the core/surface of the
template, the comparison of the secondary structure predicted by the obtained
sequence-to-structure alignment with the predictions obtained with the
independent secondary structure prediction tools (PHD [7] and ALB [8]) and on
any extra literature data on the target function/active site position, if
available.
Acknowledgements. This work was supported by the Russian Foundation for Basic Research
grant 98-04-49303, by the INTAS grant 99-01476 and by an International Research Scholar's
Award to A.V.F. from the Howard Hughes Medical Institute.
REFERENCES
1.Reva,B., Finkelstein,A., Skolnick,J. Derivation and testing residue-residue mean-force
potentials for protein structure recognition. In: Methods in Molecular Biology, vol.143;
Protein Structure prediction: Methods and Protocols. 2000, pp. 155-174.
2.Reva B.A., Skolnick J., Finkelstein A.V. - Averaging of interaction energies over homologs
improves protein fold recognition in gapless threading. - Proteins, 35: 353-359, 1999.
3.Altschul S.F., Madden T.L., Schäffer A.A.,Zhang J., Zhang Z., Miller W., Lipman D.J. -
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs - Nucleic
Acids Res. 25: 3389-3402, 1997.
4.Murzin A. G., Brenner S. E., Hubbard T., Chothia C. - SCOP: a structural classification
of proteins database for the investigation of sequences and structures. - J. Mol. Biol. :
247, 536-540, 1995.
5.Russell, R.B., Barton, G.J. STAMP: Structural Alignment of Multiple Proteins.
Proteins 14: 309-323, 1991.
6.Rykunov D.S.,Finkelstein A.V., Lobanov V.Yu. - Search for the most stable folds of
protein chains. III. Improvement in fold recognition by averaging over homologuous
sequences and 3D structures. - Proteins, 40: 494-501, 2000.
7.Rost B., Sander C. - Combining evolutionary information and neural networks to predict
protein secondary structure. - Proteins, 19: 55-72, 1994.
8.Ptitsyn O.B., Finkelstein A.V. - Theory of protein secondary structure and algorithm
of its prediction. - Biopolymers, 22: 15-25, 1983.
Walts-Wondrous-Wizards , 044
number of submitted models: 172
N. Alexandrov, V. Brover, M. Troukhan, W. Volkmuth
email: nicka@ceres-inc.com
Our prediction process consists of two steps: selecting a template structure
and making an alignment.
1. Template selection.
All target sequences were compared with a set of structural domains using the
123D+ program, which combines sequence similarity, secondary structure
prediction and contact capacity potentials to compute a similarity score. If
there was a hit with Z-score > 6, we made the selection based on the strongest
hit. When the hit covered only a part of the target sequence, we cut out the
remaining part and repeated the run. If 123D+ did not detect an obvious hit,
we predicted the fold anyway, because sampling of a random set of recently
predicted structures indicates that approximately 900f them are structurally
similar to already known folds, even if there is no strong sequence
similarity. Without a strong 123D+ hit, we used other available associative
information in an attempt to link the target with a protein with known
structure. We used literature search, known metabolic pathways, gene
expression data, position on the chromosome, operons, distribution of folds in
the organism, secondary structure prediction, predictions of transmembrane
helices and coiled coils. We demonstrated that there is a correlation between
protein folds and gene expression and between protein folds and location in
the chromosome. All these additional information gave us quite weak signals.
However, when consistent, these signals resulted in rather confident
predictions. This part of the prediction is analagous to playing charades,
where one discovers an unknown word using many inderect, independent hints.
Interestingly, we can compare the effectiveness of such an approach verses a
pure automated method, as 123D+ server also participated in the CAFASP section
of CASP4.
2. Alignment
Alignments were computed with 123D+ program and were in some cases manually
corrected. Manual intervention was limited to (i) placing deletions within the
target sequence so that their edges are close in space in 3D structure and
(ii) moving insertions in the target sequence to the surface of protein
structure.
Gibrat-Marin , 328
number of submitted models: 48
Antoine Marin, Joël Pothier, Karel Zimmermann, Jean-François Gibrat
email: gibrat@versailles.inra.fr
The success of BLAST is due to a large extent to the associated statistical
processing of the raw results that provides an objective way of judging the
significance of a match. It seems to us that current threading methods do not
pay a sufficient attention to this problem of significance. We have developed
a threading method and we have tried to address specifically the problem of
significance of a match.
Method:
Like most threading techniques, our method consists of 4 elements : a library of folds,
a score function, an algorithm to obtain the best sequence/structure alignment and
a measure of the significance of the best sequence/structure alignment score.
Database of folds:
We consider all complete proteins of the PDB having less than 35equence identity.
We do not divide proteins into structural domains. The core of the 3D structures consists
in conserved secondary structure elements. We require that residues of the query sequence
be aligned with residues in the core.
Score function:
The algorithm is a 2 stages procedure. The first stage (1D stage) uses a score
function that involves only 1 site of the template fold. Each site
(corresponding to the Ca position of a residue in the 3D structure) is
characterized by the residue that occupies this position in the template 3D
structure and by its structural state. A structural state is defined as the
combination of a secondary structure type (helix, strand or coil) and the
fact of being buried or exposed. We have developed BLOSUM-like substitution
matrices that take into account the structural state of the residues. The
second stage (3D stage) uses a score function that involves 2 sites of the
template fold in contact (i.e, positions in the template 3D structure whose
sidechains are in contact). The score function is made of 2 terms. The first
one measures the likelihood of replacing a pair of residues in contact in the
template 3D structure by a pair of residues of the query sequence (it is
similar to a sequence comparison algorithm but in 3 dimensions). The second
term measures the likelihood of positioning a pair of residues of the query
sequence at 2 sites in contact in the template 3D structure characterized by
given structural states. For instance this term measures the likelihood of
aligning a pair (Gly,Asp) at 2 sites in contact that are, say, for the first
a buried helix and for the second a buried strand.
Alignment method:
For the first stage since the score function depends only on 1 site we use a
modified dynamic programming algorithm. For the second stage the only exact
method to find the best alignment is a branch and bound algorithm. We have
implemented the Lathrop's branch and bound algorithm. However this algorithm
is too time consuming for the biggest cores in the database so we also
developped a heuristic algorithm based on a stochastic method.
Significance of the score:
The raw score value is in general useless to judge the significance of the
sequence/structure alignment because the value obtained depends on the length
of the query sequences, on the number of sites in the template core and,
above all, on the particularities of the 3D structure. In order to normalize
this raw score, for each template core we align N test sequences having the
same length as the query sequence. No pair of sequences amongst these N
sequences has more than 25% sequence identity and none of these test sequences
has more than 25 0dentity with the query sequence. These N sequences are real
protein sequences extracted from protein sequence databases because we fear
that shuffled sequences may lack the subtle characteristics of true protein
(this might be especially true for the 3D term of the score function). These N
(N=100) sequences define a distribution of scores for aligning real proteins
of length L with a particular fold. Since this distribution has been obtained
empirically we do not know its analytical form. To avoid making unwarranted
assumptions about the form of the distribution we normalize the query score as
follows. First we determine the distance between the score of the 25 quantile
and the score of the 75 quantile. Then we divide the distance between the
score of the 25 quantile and the query score by this first calculated
distance. We use the 25 and 75 quantiles because we consider that the score of
proteins before the 25 quantile or after the 75 quantile can be biased (for
instance proteins whose score appears after the 75 quantile might be related
to the core we are considering). Once the scores of the query sequence aligned
with all the cores have been normalized in this way we can rank the cores by
decreasing normalized distance.
Database of test:
It is crucial to be able to test our method under realistic conditions. For
this purpose we created a database that consists in pairs of proteins having
similar 3D structures but very different sequences. The pairs of structurally
similar proteins were obtained running VAST on the set of 1175 PDB proteins
with less than 35% sequence identity. We considered only protein pairs having
similar lengths (showing at most a variation in length of 25%) and for which
at least 650f the residues were included in the 3D alignment. The FASTA
program was run with the proteins of the selected pairs above and only those
pairs for which the FASTA expected value was greater than 1 were retained. We
thus obtained 334 pairs corresponding to 291 individual proteins. These 334
pairs include homologous proteins but also pairs of proteins for which no
evolutionary relationship has been demonstrated. This database allows us to
test under realistic conditions our method since we have pairs of structurally
similar proteins whose relationship cannot be found by usual sequence
comparison methods.
Results:
We used a subset of the 291 proteins (209 proteins with a length less than
250). Each query protein was run against the database of 1175 cores and the
cores ranked according to the normalized distance as explained in Method. The
1D stage is used as a filter for the 3D stage, in that only the first 10 cores
in the ordered 1D list are subjected to the 3D filter. The rank and normalized
score of the first true positive and first false positive found in the list
were recorded. For the 209 proteins we obtained the results described in
tables I and II.
Table I: Rank of the first true positive found in the list
1 5 10 15 20 more
Rank 1D: 61.2% 75.6% 82.8% 86.1% 89.5% 100.0%
Rank 3D: 53.6% 74.2% 81.8% 81.8% 81.8% 100.0%
Table II: Distribution of the true and false positives as a function of the normalized distance
1D 3D
Normalized dist True Pos. False Pos. True Pos. False Pos.
--------------- ---------------------- ------------------------
]+inf - 6.0] 19 9.1% 0 0.0% 3 1.4% 0 0.0%
] 6.0 - 5.5] 10 13.9% 0 0.0% 3 2.9% 0 0.0%
] 5.5 - 5.0] 11 19.1% 0 0.0% 8 6.7% 0 0.0%
] 5.0 - 4.5] 15 26.3% 0 0.0% 8 10.5% 1 0.5%
] 4.5 - 4.0] 17 34.4% 2 1.0% 6 13.4% 0 0.5%
] 4.0 - 3.5] 17 42.6% 19 10.0% 12 19.1% 3 1.9%
] 3.5 - 3.0] 24 54.1% 41 29.7% 19 28.2% 9 6.2%
] 3.0 - 2.5] 28 67.5% 84 69.9% 33 44.0% 16 13.9%
] 2.5 - 2.0] 20 77.0% 53 95.2% 20 53.6% 58 41.6%
] 2.0 - 1.5] 25 89.0% 8 99.0% 19 62.7% 74 77.0%
] 1.5 - 1.0] 15 96.2% 2 100.0% 18 71.3% 37 94.7%
] 1.0 - 0.5] 5 98.6% 0 100.0% 16 78.9% 8 98.6%
] 0.5 - 0.0] 2 99.5% 0 100.0% 5 81.3% 0 98.6%
] 0.0 --inf[ 1 100.0% 0 100.0% 39 100.0% 3 100.0%
Conclusion:
Table I shows that if there is a true positive in the database (in our case
this is always true), it has about 60% (resp. 50%) chance to appear in first
position in the 1D stage (resp. 3D stage). However the rank does not
constitute a good criterion for judging the significance of a sequence/
structure alignment since when there is no similar 3D structure in the
database there is still a core ranked first, a core ranked second, etc. The
normalized distance provides a far better criterion. According to table II if
the normalized distance is above 4 there is 1hance to have a false positive
but 35hance to have a true positive for the 1D stage. The 3D stage gives
slightly worst results (for a rate of false positive of 1% the coverage is
less than 20%) However we can combine the results of the 2 filters. Instead of
a single normalized distance the alignment of a core with the query sequence
is characterized by 2 normalized distances (1D, 3D). In 2 dimensions it is
possible to define a polygon where the number of false positives is less than
a given percentage. For instance plotting the 1D normalized distances along
the x axis and the 3D normalized distances along the y axis we can define a
polygon by the line x = 0, the line y = 0, the line x = 4.2, the line y = 4.0
and the line y = -x + 6.5, i.e., a square (more or less) with a upper right
corner that has been 'cut'. Outside this polygon there are 43% true positives
and 0.5 0.000000 alse positives. We think that this idea can be generalised to
more than 2 filters. The rational behind this approach is that a false
positive can, just by chance, get a good score for a particular filter but it
is less likely to get a good score for several different filters. On the other
hand a true positive can occasionally get a bad score for a given filter but
should fare better, on average, for the other filters.
Godzik , 197
number of submitted models: 158
L.Jaroszewski, A.Godzik
email: adam@ljcrf.edu
Abstract
We applied identical procedure for homology modeling targets and fold
recognition targets. It consists of three steps: A: Selection of the
template(s), B: Generation of suboptimal alignments, C: Model building and
evaluation. In the cases when FFAS z-score value indicated that the similarity
between the template and query is strong (z-score values higher than 15), the
step B was usually skipped and the model was built based on the alignment from
FFAS. This was the case for many of the homology modeling targets. The
prototype of this procedure called "Multiple Model Approach" was described and
evaluated in (4-5).
A. Selection of the template(s) - Fold & Function Assignment System (1,2).
FFAS profile-profile search was performed in PDB database. FFAS is based on
the sequence profile-profile matching with dynamic programming. The multiple
alignment is prepared based on the PSI-BLAST(8) output. Non-redundant
database of protein sequences was used for profile calculation. FFAS uses
sequences from PSI-Blast output with E-value below 0.01 and an elaborate
weighting scheme for the sequences included in the profile(1). Weights are
assigned based on the dissimilarity of the sequence in respect to the other
sequences in the family. In addition, FFAS performs a normalization of the
matrix containing the comparison scores between all positions of both aligned
profiles before the best path is searched for with dynamic programming
Smith-Watermann algorithm(8).
B. Calculation of suboptimal alignments.
A set of suboptimal (alternative) alignments was generated for the query
sequence and the template structure(s) selected from the PDB database in the
step A. After the calculation of the initial alignment based on the
profile-profile FFAS method, a1 similarity matrix was recalculated using
several combinations of threading terms (burial and local conformation terms
are used). The threading energy was calculated for the sequence profile
rather than for a single sequence, as it had been done in the classical
threading. Several gap penalty values were also explored. Gap penalties were
set higher within the secondary structure elements defined with the method
described in the separate publication(3). The resulting alignments were
clustered to avoid redundancy.
C. Model building and evaluation. The models based on the alignments
calculated in the step B were built and evaluated. We used MODELER(5) program
developed in A. Sali lab for model building. Model evaluation is based on the
threading energy using statistical potential and evolutionary information
encoded in sequence profiles (the threading energy was calculated for the
sequence profile rather than for a single sequence, as it had been done in the
classic threading - for example in MatchMaker program). The threading energy
per residue was the final criterion of the model quality.
References
1. Rychlewski, L., Jaroszewski, L., Li, W. & Godzik, A. (2000).
"Comparison of sequence profiles. Strategies for structural predictions using
sequence information". Protein Science 9, 232-241
2. Jaroszewski, L., Rychlewski, L. & Godzik, A. (2000).
"Improving the quality of twilight-zone alignments". Protein Science, 9, 1487-1496
3. Jaroszewski, L. & Godzik, A. (2000). Search for a New Description of
Protein Topology and Local Structure. ISMB 2000 - 8-th International
Conference on Intelligent Systems for Molecular Biology, San Diego 2000
4. Jaroszewski, L., Pawlowski, K. & Godzik, A. (1998).
"Multiple model approach: an extension of comparative modelling". Journal of
Molecular Modelling 4, 294-309
5. Pawlowski, K., Jaroszewski, L., Bierzynski, A. & Godzik, A. (1997).
"Multiple model approach - dealing with alignment ambiguities in comparative
protein modeling". In Biocomputing, 97 (Altman, R. B., Dunker, A. K., Hunter,
L. & Klein, T. E., eds.), pp. 328-339. World Scientific, Singapore.
6. Sali, A. and Blundell, T. L. (1993).
"Comparative protein modelling by satisfaction of spatial restraints". J. Mol.
Biol. 234, 779-815
7. Smith, T.F. and Waterman, M.S. (1981) "Identification of common molecular
subsequences". J Mol Biol 147:195-7
8. Altschul, S.F. et al. (1997) "Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs". Nucleic Acids Res 25:3389-402
Tsigelny , 274
number of submitted models: 210
Igor F. Tsigelny 1,2*, Yuriy Sharikov 2, Palmer Taylor 1, Lynn F. Ten Eyck 2
Eyck2(1 Department of Pharmacology, 2 San Diego Supercomputer Center)
email: itsigeln@ucsd.edu
A system based on Hidden Markov Models (HMMSPECTR) for finding of structural
homologs of proteins on the basis of their sequences is described. The system
receives a single probe sequence or sequences alignment and uses the sequence
or alignment to search the library of Hidden Markov Models (HMMs) on the basis
of structural alignments. The initial library of fold superfamilies was
constructed using SCOP folds classification [1]. We created structural
alignments, using CE algorithm [2], for each superfamily for six main classes
of folds: all alpha proteins, all beta proteins, alpha and beta proteins
(alpha/beta), alpha and beta proteins (alpha+beta), multi-domain proteins
(alpha and beta), small proteins. For the following folds we created
alignments for each family: EF hand-like (alpha), PHGase F-like (beta),
Supersandwich (beta); NAD(P)-binding Rossman-fold domains (alpha/beta);
Thiredoxin fold (alpha/beta); Puruvate-ferredoxin oxidoreductase (PFOR) domain
III (alpha/beta), IL8-like (alpha+beta), Zincin-like (alpha+beta). For the
fold Globin-like (alpha) we created alignments for all protein domains (2
families – Globins and Phycocyanins). In all cases such a division was needed
to cover by alignments, all SCOP proteins of specific subdivision. Overall
number of structural alignments created was about 1500. Average number of
proteins in each of these alignments is about 250. For each alignment we
produced 4 HMMs using the HAMMER package [3]. The first HMM was created on the
basis of the initial alignment and three differently trained HMMs. HMMSPECTR
works in three modes using three different libraries of alignments and HMMs.
Each of them is used in following order when the score of probe protein
decreases.
(1) After the finding of a HMM with the best score the system starts further
analysis of alignment of the HMM. It creates a subset of ‘cluster’ alignments
of protein sequences having close scores when aligned with the probe sequence.
On the basis of these sets of alignments the system creates a new set of HMMs
which is then analyzed. Then it extracts a ‘dominant’ protein in each HMM – a
protein having the best sequence alignment to the consensus sequence of the
corresponding HMM.
(2) There are some cases when a target protein is presented in some of HMMs
but have not enough relatives with solved structures. In such cases dividing
of large HMMs into smaller parts is very effective. In this mode the system
uses the library of HMM constructed on the basis of partial structural
alignments of folds.
(3) In the case of very low scores received even in the second mode the system
switches to the third mode. This mode uses the principles of self-organization
in the process of decision making. The systems uses a library of HMMs created
with the different ratios - gaps/letters and different number of participating
proteins sequences. The decision is made on the basis of optimization of
results on the basis of all parameters used in the calculations
REFERENCES
1. Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995). SCOP: a
structural classification of proteins database for the investigation of
sequences and structures. J. Mol. Biol. 247, 536-540.
2. Shindyalov I N, Bourne P E. (1998). Protein structure alignment by
incremental combinatorial extension (CE) of the optimal path. Protein
Engineering 11:739-747.
3. Eddy S. (1995). Hidden Markov Models of Proteins and DNA Sequence.
Washington Univ. School of Medicine.
BinToHes , 255
number of submitted models: 43
Eckart Bindewald, Silvio Tosatto, Jochen Maydt, Achim Trabold, Juergen
Hesser, Reinhard Maenner
email: bindewald@ti.uni-mannheim.de
If no proteins with known structure and homologous sequence were found for the
target protein (using PSI-BLAST [1]) we used our fold recognition system
MANIFOLD to suggest possible template proteins. It uses a database of
profiles of a set of 2083 representative protein structures of the FSSP
database [2]. For each database protein the sequence, secondary structure,
accessibility (defined with DSSP [3]), FSSP structure code and (if applicable)
enzyme code is stored. A similar profile is prepared for each CASP-4 target
protein, using the predicted secondary structure (based on the secondary
structure prediction programs JPRED [4], PHD [5] or SSPRO [6]) and predicted
accessibility (taken from the output of PHD [7]). The enzyme code was taken
from SWISS-PROT [8].
As the first step, structure-function rules are applied in order to exclude
template proteins which have a structure that is assumed to be incompatible
with the function of the target protein (this applies in our implementation
only to proteins with an enzyme code). In order to derive such rules, we made
a cross analysis between the FSSP structure codes and the enzyme codes of all
proteins which are representative structures in the FSSP database. If a set
of structures among the FSSP representatives was at least ten times, and with
no exception, associated with the same set of functions, it was used as a rule
that this association must hold also for proteins with unknown structure.
Possible template proteins that passed this set of rules were subsequently
ranked according to the following criteria: The main order criterion for the
fold recognition process is the secondary structure "block" similarity. The
length of helix, strand or loop regions is ignored, each helix or strand
region is represented by a single letter. The number of mutations needed to
convert the target/template secondary block structure yields a score which is
the main ordering criterion for the template structures. The template proteins
where further sorted according to a "jury" of other criteria.
The additional similarity criteria are:
1. the relative length (number of amino acids) similarity,
2. the score of a Needleman-Wunsch alignment of the combined secondary
structure/accessibility,
3. functional similarity as reflected in the enzyme-code (if applicable).
The functional similarity is taken to be the number of identical enzyme code
hierarchy levels. For each of these additional criteria a ranking list is
created. For each criterion a "jury point" is given, if the template structure
is found among the top ten considering this criterion alone, and another point
if the template structure is among the top five structures. The top scoring
template proteins were inspected. The used template protein was chosen
according to the most convincing sequence alignment (using ClustalW), combined
with a visual inspection of a manual alignment of the secondary structure. The
subsequent modeling of the protein structure given the template structure is
described in the CASP-4 comparative modeling abstract "Ab initio loop modeling
with precalculated synthetic loops and sidechain placement" (Tosatto,
Bindewald et al).
Referenzes:
[1] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D. Lippman:
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Research, 25(17): 3389-3402 (1997).
[2] L. Holm, C. Sander.
The FSSP database of structurally aligned protein fold families.
Nucleic Acids Research 22(17): 3600-3609. (1994).
[3] W. Kabsch, C. Sander:
Dictionary of protein secondary structure:
Pattern recognition of Hydrogen-Bonded and Geometrical Features.
Biopolymers 22, 2577-2637 (1983).
[4] J. A. Cuff, M. E. Clamp, A. S. Siddiqui, M. Finlay, G. J. Barton:
Jpred: a consensus secondary structure prediction server.
Bioinformatics 14:892-893 (1998).
[5] B. Rost, C.Sander:
Prediction of protein secondary structure at better than 70% accuracy.
J.Mol.Biol. 232: 584-599.(1993).
[6] P. Baldi, S. Brunak, P. Frasconi, G. Pollastri, G. Soda:
Exploiting the past and the future in protein secondary structure prediction.
Bioinformatics 15:937-946 (1999).
[7] B. Rost,C. Sander:
Conservation and Prediction of Solvent Accessibility in Protein Families.
Proteins 20: 216-226 (1994).
[8] A. Bairoch, R. Apweiler:
The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998.
Nucleic Acids Research 26(1) 38-42 (1998).
[9] J.D. Thompson, D.G. Higgins, T.J. Gibson:
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and weight matrix choice.
Nucleic Acids Res. 22: 4673-4680(1994).
Fox-Sheppard , 536
number of submitted models: 17
Brian Fox and Paul Sheppard
email: bfox@zgi.com
METHOD: The target sequence was blasted against all known proteins, each
reasonable hit was blasted against the sequence of all the proteins
in the PDB. The top reasonable hits to a PDB were used as a model.
Manual alignment between the PDB sequence and the target was used
to try and put the gaps in the sequence near the loop of the
3D structure. This method can be effective in finding more remote homologs than
only using a single round of BLAST.
Taylor , 390
number of submitted models: 56
William R. Taylor, Kuang Lin, Delmiro Fernandez
email: kxlin@nimr.mrc.ac.uk
Three-dimensional(3D) models of CASP4 targets were generated applying
a Cooperative approach. The probe sequences were firstly used in
iteratively sequence databank searching with templates and multiple
alignment.
With predicted secondary structures from the multiple sequence
alignments, the backbone structures were then built using the
multiple sequence threading program MST.
After manual selection, we used MaxSprout to construct full atom
structure models.
A simple 1D/3D matching program TUNE helped in the selection of
target structures.
Reference:
MULTAL W.R.Taylor (1988) J. Molec. Evol., 28:161--169
DOMS W.R.Taylor (1999) Prot. Engng., 12:203--216
QUEST W.R.Taylor (1998) J. Molec. Biol., 280:375--406
MST W.R.Taylor (1997) J. Molec. Biol., 269:902--943
MaxSprout L.Holm and C.Sander (1991) J. Mol. Biol. 218:183-194.
PREDATOR D.Frishman and P.Argos (1997) Proteins, 27, 329-335.
PSIPRED D.T.Jones (1999) J. Mol. Biol. 292:195-202.
CATH C.A.Orengo et al. (1997) Structure. 5. 8. 1093-1108.
MRIT-Onizuka , 052
number of submitted models: 235
Kentaro ONIZUKA
email: onizuka@mrit.mei.co.jp
The method used to identify the fold type of the given protein sequences
in CASP4 case has several features below.
1. Threading using mean-force potential of Sippl type.
2. The mean-force potentials are Multi-dimensional.
Actually they are 3D. Not only the residue-residue distance but the
direction is taken into account. To avoid the explosion of potential-
describing coefficients, linear compression technique is applied.
3. The mean-force potentials are not pairwise but singleton. The
energy of the pair is defined with respect only to one of the residue
type of the pair, and the other residue type is set to 'any.'
4. The template protein structures are compiled into sequence profiles where
the energy value with respect to each position and each amino residue-type
is assigned to each cell of the sequence profiles, summing up over the
energy of all possible interactions to the residue position.
5. The given amino-residue sequence of the prediction target is
threaded to the template protein sequence profiles, by the dynamic
programming algorithm.
6. The match score of each position is not the single residue to single
position match score but the windowed fragment-fragment match score.
The window width is 17 residues (-8 to +8 around the residue to match).
7. The gap scoring scheme consists of two ways, 1) continuation bonus,
and 2) extension penalty.
In this abstract, I briefly explain the features which are not
published.
[1] The Singleton Multi-dimensional Mean-Force Potentials
The mean-force potentials in general are pairwise potentials which
depend on the residue-types of both residues involved in the
interaction. The pairwise potentials, however, arise a very difficult
problem, when applied to the gapped structure-sequence threading. The
problem is how we can know the best structure-sequence alignment which
gives the best score. As a combinatorial optimization problem, this
structure-sequence alignment problem optimizing the summation of
pairwise potentials does not have rapid algorithm that gives the exact
optimal solution. There are two approximation technique, 1) frozen
approximation, and 2) defining singleton potential. Frozen
approximation looks promising but when sequentially remote the score
is quite unreliable. If the potentials are singleton depending only
one of the residue-type of the pair and other residues are set to any,
the optimization can be done by the dynamic programming as well as the
frozen approximation.
I assessed whether the pairwise potentials are superior to the
singleton ones in terms of self-recognition ratio under so-called
Sippl test (gapless self-structure recognition test). In the case of
single-dimensional potentials, the self-recognition ratios for
pairwise potentials are slightly better than singleton potentials,
while in the case of multi-dimensional (in 3D case) the singleton
potentials almost always marked slightly better recognition ratios
than pairwise potentials.
The multi-dimensional mean force potentials are defined with respect
to, other than sequential separation and residue-type, at most, the
six degrees of freedom of the relative configurations (relative
positions and relative orientations) between a pair of residues; i.e.,
not only the distance between the residues, but also the relative
directions and orientations. In CASP4 case, I adopted 3D potential
with respect to the distance and the direction.
The greatest difficulty in multi-dimensional statistical analysis is
the explosion in the number of coefficients to represent the
potential. To overcome the explosion of coefficients, the author
applied a linear compression technique to the multi-dimensional
distribution. Here the multi-dimensional distribution is linearly
expanded by a series of orthonormal bases into a set of coefficients,
where the flexible choice of the maximal order for the expansion in
each dimension controls the total number of coefficients representing
the distribution.
[2] The Dynamic-Programming-based Threading
Using Fragment-fragment Match and Continuation Bonus Scoring
It is relatively easy to obtain optimal alignment when the mean-force
potentials are singleton depending on the type of only one residue
involved in the interaction. After statistical analysis is done by
compiling the coefficients of mean force potentials from the learning
data-set, each template structures can be compiled into the sequence
profile where the potential value (summation of the
mean-force-potentials involving the residue at the position) for each
amino residue-type is assigned. And then each given protein sequence
is threaded to the sequence profile by the dynamic programming
algorithm.
The gap-scoring technique here adopted is a combination of 1) giving a
continuation bonus when gaps are not inserted, and 2) giving an
extension penalty when gaps extend. The continuation-bonus scoring is
beneficial because the total continuation-bonus depends on the length
of the shorter one of either the given sequence or the template
structure profile.
Regarding the match score, when residue 'a' of the given sequence is
matched to the position 'i' in the template structure profile, not
only the energy value of the residue type 'a' at the position 'i' but
the scores of neighboring position are considered. Thus the potential
value is not the residue-position match-score but the score of
sequential fragment to the structure fragment of the same length. In
the CASP4 case, the fragment length was 17 (-8 to +8). This
modification to the mean-force-potential-based threading contributes
greatly to improve the self-structure recognition ratio.
[Result]
The threading method used in CASP4 was found to be very sensitive
against the structural diversity. Thus, the self-structure recognition
ratio in gapped threading achieved over 70% accuracy in Sippl test,
while in the test to recognize the different protein but belonging to
the same SCOP fold class, the recognition ratio was very
disappointing. In CASP4 case, many targets are predicted to have
2.75 and 3.9 of SCOP fold class.
[Reference]
Sippl, 1990: Sippl M.J. (1990) Calculation of
Conformational Ensembles from Potentials of Mean Force: An Approach to
the Knowledge-based Prediction of Local Structure in Globular
Proteins. J. Mol. Biol., 213,859-883.
Matsuo et al., 1995: Matsuo Y., Nakamura H., Nishikawa
K. (1995) Detection of Side-Chain Packing and Electrostatic
Interactions. J. Biochem. 118,137-148
Alexandrov et al. 1996: Alexandrov N.N. Nussinov R. Zimmer R.M. (1996)
Fast Protein Fold Recognition via Sequence to Structure Alignment and
Contact Capacity Potentials. Proc. PSB 1996 53-72
Hendlich et al., 1990: Hendlich M., Lackner P., Weitckus S.,
Floeckner H., Froschauer R., Gottsbacher K., Casari G., Sippl
M.J. (1990) Identification of Native Protein Folds Amongst a Large
Number of Incorrect Models, The calculation of Low Energy Conformation
from Potentials of Mean Force. J. Mol. Biol.,216,167-180
Wallace, 1991: Wallace G.K. (1991) The JPEG Still
Picture Compression Standard. CACM 34, 34,30-44
GMD-SCAI , 361
number of submitted models: 98
Ralf Zimmer
Theo MevissenIngolf SommerThomas Lengauer
email: ralf.zimmer@gmd.de
A reduced candidate list is produced via ToPLign [1]
sequence/profile alignments, 123D [2] and RDP [3].
The top scoring candidates from this list and the
corresponding refined alignments have been produced
with RDP [3] using newly derived modifications of
contact capacity [2] and pair interaction potentials.
References:
[1] Heinz Mevissen and Ralf Thiele and Ralf Zimmer and Thomas Lengauer:
"Analysis of Protein Alignments -- The software environment ToPLign"
GMD, 1994-98 (http://cartan.gmd.de/ToPLign.html)
[2] Nick Alexandrov and Ruth Nussinov and Ralf Zimmer:
"Fast Protein Fold Recognition via Sequence to Structure Alignment
and Contact Capacity Potentials",
Pacific Symposium on Biocomputing'96, World Scientific Publ. Co.,
1996, 53--72.
[3] Ralf Thiele and Ralf Zimmer and Thomas Lengauer:
"Recursive Dynamic Programming for Adaptive Sequence
and Structure Alignment",
ISMB'95, C. Rawlings et al. (Eds.), AAAI Press, 384--392.
[4] Ralf Thiele and Ralf Zimmer and Thomas Lengauer:
"Protein Threading by Recursive Dynamic Programming",
JMB,757-779,290,1999
[5] Ralf Zimmer and Marko Woehler and Ralf Thiele:
"New Scoring Schemes for Protein Fold Recognition
based on Voronoi Contacts",
Bioinformatics, 14, 3, 295--308, 1998.
[6] Alexander Zien and Ralf Zimmer and Thomas Lengauer:
"A simple iterative approach to parameter optimization",
JCB, 2000, in press.
SBfold , 381
number of submitted models: 68
Kristin K Koretke, Robert Russell, Autumn L Sutherlin, Craig Volker, Michael J Bower,
Ajita Bhat, Maxwell D Cummings, and Andrei N Lupas
email: Andrei_N_Lupas@sbphrd.com
Summary
All CASP4 targets were submitted to the sensitive search routine program
SENSER, as described in detail in the abstract for the SBauto submission.
Secondary structure predictions were gathered from the JPred server.
Additional sequence searches were done using regular expression patterns and
HMMs. If a protein of known structure appeared to match the properties of the
target, alignments were generated using MACAW or HMMer.
Details
Details on the operation of SENSER are given in the SBauto abstract.
If SENSER identified a potential template structure, its match with the target
was evaluated using predicted secondary structure, the occurrence of sequence
patterns, and biochemical information. The aligment was generated using MACAW
or HMMer.
If SENSER did not identify a potential template structure, regular expression
patterns, predicted secondary structure, and biochemical information were used
to search for possible templates. In addition, in cases where the target was
only a fragment of a larger protein, the entire protein was used in sequence
searches. If a template was judged to match the properties of the target, an
alignment was produced using MACAW, HMMer, Clustal, or a combination of these
methods, to produce the alignment that seemed most plausible to us based on
conserved residues, hydrophobicity, and secondary structure.
SBauto , 382
number of submitted models: 74
Kristin K Koretke, Robert Russell, Autumn L Sutherlin, Craig Volker,
Michael J Bower, Ajita Bhat, Maxwell D Cummings, and Andrei N Lupas
email: Kristin_K_Koretke@sbphrd.com
Summary
All CASP4 targets were submitted to the sensitive search routine program
SENSER, which is based on PSI-Blast and HMMer. SENSER runs through three
different search strategies, using PSI-Blast as its search engine, to
identifiy a relationship with a sequence of known structure. As soon as a fold
is identified, an alignment between the CASP target sequence and the sequence
with a known fold is generarated using HMMer. If a relationship between the
CASP target and a sequence with a known structure was not identified, a
prediction of "novel fold" was submitted.
Details
In the first step SENSER performs a PSI-Blast search with the target
sequence. Proteins identified in the search are divided into a significant
sequence space, containing those sequences with an E value lower than 10-3,
and a 'trailing end' of sequences between 10-3 and 10. Because some of the
proteins detected may contain unrelated domains, all proteins are trimmed to
the actual region detected in the PSI-Blast run.
In the second step transitive searches are used to expand the significant
sequence space. Only proteins within the significant sequence space that have
less than 25 0dentity to the target sequence are used as starting points for
further PSI-Blast searches, in order to avoid redundant searches, i.e. those
that produce similar profiles and sequence spaces. This value was chosen as it
is a frequently quoted threshold for the 'twilight zone', below which
sequences can not be confidently said to be homologous.
In the third step trailing-end sequences are tested for their ability to
back-validate, i.e. detect any sequence of the significant sequence space of
the target in PSI-Blast. Because several PSI-Blast searches were performed to
establish the significant sequence space, trailing-end sequences are pooled
and ranked first by number of occurrences and second by E-value, before being
tested. If a trailing-end sequence back-validates, its significant sequence
space is added to that of the target. The process is then repeated until no
further sequences are detected.
The steps above can connect proteins that are far apart in sequence space,
however, beyond the first PSI-Blast search, they do not directly provide an
alignment of the target to the sequences detected. Moreover, even for
sequences detected in the first step, PSI-Blast generally provides only
partial alignments. For these reasons, we introduced an alignment strategy
based on HMMer. After the first PSI-Blast search, we build a target HMM from
the proteins in the significant sequence space, as aligned by PSI-Blast. Any
sequence detected at this step is aligned to the target sequence using the
target HMM. Any sequence detected at a subsequent step is aligned in a five
part process: (1) a PSI-Blast search is run for the untrimmed sequence, (2) a
multiple alignment is extracted, (3) this alignment is combined with the
sequences of the target HMM to produce a global alignment, using the target
HMM as a template, (4) a final HMM is built from this global alignment, and
(5) this HMM is used to align the detected sequence to the target.
Fugue-Cam , 103
number of submitted models: 215
Jiye Shi, Tom L. Blundell and Kenji Mizuguchi
email: kenji@cryst.bioc.cam.ac.uk
We have attempted to predict the structures of all the CASP4
targets using the fully automatic method FUGUE (Shi et al.,
submitted). FUGUE has been developed for recognizing distant
homologues by sequence-structure comparison and producing
reliable alignments. It has three key features: (1) Improved
environment-specific substitution tables (Johnson et al., 1993;
Overington et al., 1990). Substitutions of an amino acid in a
protein structure are constrained by its local structural
environment, which can be defined in terms of secondary
structure, solvent accessibility, and hydrogen bonding status
(Burke et al., 1999; Mizuguchi et al., 1998a). The environment-
specific substitution tables have been derived from 177
structural alignments in the HOMSTRAD database (Mizuguchi et
al., 1998b). (2) Automatic selection of alignment algorithm
with detailed structure-dependent gap penalties. FUGUE uses the
global-local algorithm to align a sequence-structure pair when
they greatly differ in length and uses the global algorithm in
other cases. The gap penalty at each position of the structure
is determined according to its solvent accessibility, its
position relative to the secondary structure elements (SSEs)
and the conservation of the SSEs. (3) Combined information from
both multiple sequences and multiple structures. FUGUE is
designed to align multiple sequences against multiple structures
to enrich the conservation/variation information. For a given
query sequence, FUGUE calls PSI-BLAST to collect sequence
homologues from the NCBI non-redundant sequence database and
calculates a sequence profile from refined PSI-BLAST alignment.
This sequence profile is then used to search against a
structural profile library, which is derived from the HOMSTRAD
structural alignments using environment-specific substitution
tables and structure-dependent gap penalties. Z-score is
calculated to evaluate the similarity of each sequence-structure
pair.
References:
Burke, D. F., Deane, C. M., Nagarajaram, H. A., Campillo, N.,
Martin-Martinez, M., Mendes, J., Molina, F., Perry, J., Reddy,
B. V., Soares, C. M., Steward, R. E., Williams, M., Carrondo,
M. A., Blundell, T. L. & Mizuguchi, K. (1999). An iterative
structure-assisted approach to sequence alignment and
comparative modeling. Proteins Suppl(3), 55-60.
Johnson, M. S., Overington, J. P. & Blundell, T. L. (1993).
Alignment and searching for common protein folds using a data
bank of structural templates. J Mol Biol 231(3), 735-52.
Mizuguchi, K., Deane, C. M., Blundell, T. L., Johnson, M. S. &
Overington, J. P. (1998a). JOY: protein sequence-structure
representation and analysis. Bioinformatics 14(7), 617-23.
Mizuguchi, K., Deane, C. M., Blundell, T. L. & Overington,
J. P. (1998b). HOMSTRAD: a database of protein structure
alignments for homologous families. Protein Sci 7(11), 2469-71.
Overington, J., Johnson, M. S., Sali, A. & Blundell, T. L.
(1990). Tertiary structural constraints on protein evolutionary
diversity: templates, key residues and structure prediction.
Proc R Soc Lond B Biol Sci 241(1301), 132-45.
Murzin , 384
number of submitted models: 21
Alexey G. Murzin and Alex Bateman
email: agm@mrc-lmb.cam.ac.uk
Since our team’s last performance in CASP2 four years ago, we have been
working on the methods that could extend the superfamilies of known structure
in SCOP to the sequence families of unknown structure in Pfam and other
sequence libraries. We entered CASP4 hoping that this prediction experiment
would provide an opportunity to test our new methods. A systematic work on the
extension of SCOP superfamilies has already resulted in the structural
assignment of many sequence families of unknown structure and, often, unknown
function. Indeed, in CASP3, there were at least three targets predictable by
this approach. Disappointedly, however, none of the CASP4 targets turned out
to be in our list of protein families with already assigned structures.
Therefore, in CASP4 we used essentially the same approach as developed for
CASP2 (Murzin A.G. and Bateman A. Distant homology recognition using
structural classification of proteins. Proteins, Suppl. 1:105-112, 1997). We
searched for probable homologues of the target sequences and available
biochemical information on the target protein and/or its sequence family and
used the predicted secondary structure to shortlist the SCOP superfamilies, to
which each attempted target may belong. Predictions were based on the
discovery of superfamily specific characters. The experience and expertise
gained from our working on SCOP and Pfam databases were of a great help in
this knowledge-based approach. Also, we tried our knowledge-based approach in
the two other prediction categories. We used superfamily specific features to
improve the alignments in some of the comparative modelling targets. For
several targets, predicted by our approach to be not related to any of the
SCOP superfamilies, we attempted the fold prediction using the conservation
patterns in the target sequence families, the available biochemical data
and/or the empirical folding rules derived from known protein structures.
The choice of prediction format, TS, and the target selection were influenced
by the CASP3 Fold Recognition assessment experience (Murzin A.G. Structure
Classification-Based Assessment of CASP3 Predictions for the Fold Recognition
Targets. Proteins Suppl. 3:88-108, 1999). To ensure the detection of (partly)
correct predictions by both sequence-dependent and sequence-independent
numerical evaluation procedures, each of our predictions was composed of the
regions of confident structure and alignment, the regions of confident
structure but tentative alignment, and the regions of tentative structure. The
3D coordinates for the most of the target atoms were the best way to represent
this structural mosaic in a single format. As one of us strongly opposed to
the NONE prediction, this option was not used. Therefore, in the absence of
predicted homologous structure, we either built a 3D model of our prediction
‘ab initio’, or had it dropped. Only one model was submitted for each of the
completed predictions. Apart from the two targets whose structures were known
to us before they were submitted to CASP4, we did not attempt the large,
presumably multi-domain targets without apparent domain boundaries. Because of
time limitations, we also ignored late comparative modelling targets including
all but one of the predicted members of the P-loop hydrolase superfamily. Due
to the presence of characteristic P-loop motifs in their sequences, their
homology recognition seemed straightforward, and the actual challenge was the
alignment. All other targets were attempted but six or so of them were dropped
eventually. In total, we submitted predictions for 21 targets. This include
four Comparative Modelling targets, T0090, T0092, T0093(!) and T0103; ten
Distant Homology Recognition targets, T0088, T0096_1, T0098, T0100, T0101,
T0104, T0108, T0109, T0118 and T0121_2; three targets with predicted known
folds (there may or may not be a distant homology), T0095, T0102 and T0114;
and four targets with predicted (probably) novel folds, T0086, T0091, T0094
and T0110.
Many of the Distant Homology Recognition predictions were based on the result
of previous analysis of SCOP superfamilies, for example the pectate lyase
beta-helix fold of T0100 and T0101 (Chothia C. and Murzin A.G. New folds for
all-beta proteins. Structure 1, 217-222, 1993). There were several cases of
déja vu. T0108 had the same characteristic feature as the CASP4 target T0038
and was modelled on the experimental structure of the latter. In T0121_2,
there was the OB-fold signature similar to one we derived for the prediction
of T0004. For the fold prediction of T0102, we used the same pseudo ‘ab
initio’ approach as we used for the CASP2 target T0042. Incidentally, the
predicted fold of T0102 was found to be similar to the experimental fold of
T0042. In T0086, there was a probable tandem repeat of two
(alpha)-alpha-beta-beta-beta motifs, detected by the analysis of its extended
sequence family, analogous to the approach that detected the internal
duplication in T0002_2. Similarly, a tandem repeat of two
beta-alpha-beta-alpha-beta motifs was detected in the extended T0094 sequence
family. Unlike T0002_2, there was no SCOP superfamily assigned for either
T0086 or T0094. Both target structures were modelled ‘ab initio’.
One of our CASP2 techniques, not credited properly at the time because it had
been used only for the late target T0026, was in great use through most of our
CASP4 predictions. For almost every target predicted to belong to a large
superfamily with many known structures, a composite template structure was
assembled from different fragments of several superfamily structures
superimposed onto their common fold. It allowed the selection of the most
suitable parts from different structures. In particular, the predicted
structure of the P-loop hydrolase T0104 was assembled from the fragments of
several topologically distinct members of this very diverse superfamily to
generate a novel topological variant. For a number of our predictions, we
also created hybrid templates including fragments of non-homologous structures
to model the ‘missing’ parts in the parent structure or even to construct the
whole fold. Then we used Modeller to generate the 3D coordinates,
automatically sealing the gaps and fixing the stereochemistry of the joints.
Braun-UTMB , 223
number of submitted models: 104
OF STRUCTURE PREDICTION COMPETITION (CASP4)
V.S. Mathura, K.V. Soman, C.H. Schein, Y. Xu and
W. Braun
email: ksoman@nmr.utmb.edu
The Human Genome Project has revealed many proteins of unknown
function. Classification of these sequences can best be done by
accurate prediction of their structures, and concurrent assignment to
families of known function. We have developed a set of tools for
homology modeling of proteins(1,2), based on self-correcting distance
geometry (DIAMOD)(3,4,5), multiple sequence alignment (MASIA (6))and
energy minimization (FANTOM(7)), that can be used even when the identity
to the target is very low(8) (300r less(9)). CASP4 provided us with an
opportunity to evaluate our methods impartially and objectively. We
submitted a total of 100 models for 27 of the 43 targets, with 15 based
on sequence homology. Models for five targets were generated ab initio.
The rest used a combination of fold recognition with multiple alignment
to improve the sequence register between the target and selected
template.
Homology or comparative modeling (CM)
When a suitable template was identified in the Protein Data Bank for a
target, our comparative modeling procedure was to: (1) Align the target
sequence with one or more template sequences using the program CLUSTALW
or alignments suggested by the fold recognition servers(CAFASP) with
minimal manual adjustment; (2) extract distance and dihedral constraints
with our in-house program EXDIS; (3) build initial models with DIAMOD;
and (4) energy minimize using the FANTOM program. For T90, a consensus
aligment was prepared manually from the 3D-PSSM, BIOINBGU, FUGUE,
GENTHREADER, and Karplus HMM98 and SAM99 results. FANTOM energy
contributions and exposed apolar surface areas calculated with the
program GETAREA were used for ranking multiple models for the same
target. Where information was available for important residues in the
template, such as those within the active site or areas of substrate
binding, we compared their location visually in the model structure.
Fold recognition (FR)
When there was not high enough sequence homology with any protein of
known structure, threading (fold recognition) was attempted, using the
web servers mentioned above and others (PSI-BLAST, 123D and FFAS). Where
several methods suggested the same template, a consensus alignment was
prepared manually. Manual corrections/adjustments were also used to
insure that secondary structures and active site or other critical
residues were aligned. For T91, an alignment from 3D-PSSM was manually
edited to improve the sequence alignment. We also used multiple sequence
alignment of protein families where a fold seemed clear cut. For
example, fold recognition identified T88 as a probable Greek key fold
and selected yeast killer toxin (1wkt) as a template. Another template
structure,1A45, that more closely resembled T88, was selected from a
multiple alignment with 57 b/g-crystallins. The indicated gapping
pattern from the multiple alignment was used to generate a model.
Ab initio modeling
When a suitable template could not be identified based on homology or
threading, but there were clear indications of conserved secondary
structure elements based on sequence alignments with related proteins,
we prepared ab initio models. The steps for generating ab initio models
for T88, T91, T97, T104, and T106 were: (1) Predict secondary structures
and exposed/buried residues of the protein from aligned sequences with
JPRED and MASIA; (2) convert this information into distance and dihedral
angle constraints using the program TRANSLATE; (3) add other constraints
derived from any available experimental data for the protein; (4) build
models from constraints with DIAMOD; (5)refine initial models by energy
minimization FANTOM. We also submitted models based on fold recognition
methods for T88 and T91.
Ab initio constraints were used in several other models where
appropriate. Di-sulfide bond constraints were added during the modeling
of T123 and T125. In another example, for T86, a monomer, a trimeric
template of very low identity was identified based on functional
similarity and conservation of key active site residues. A multiple
alignment with target homologs was used to place probable gaps between
the template and target sequences and constraints were extracted from
the template according to our usual methods. Ab initio constraints were
added at the C-terminal to replace inter-subunit contacts present in the
trimer.
Multiple alignments help in FR and CM
We combined these techniques in preparing alignments where the identity
between the target and template was very low (such as T86 and T88), when
the target had a clear sequence relationship to several templates, or
when several sequences related to the target were known. For T101, which
had about the same degree of sequence identity/similarity to 6 known
protein structures(12-18%), a CLUSTALW multiple alignment of related
proteins of the pectate/pectin lyase family was used. This agreed with
the DALI alignment of the pectate lyases but not of a structurally
related protein, chondroitinase(1DBG). We made models based on the B.
subtilis pectate lyase(1BN8) using the multiple alignment to adjust
gapping. Other models were based on the fold recognition results for
1DBG (where there was no real consensus for most of the protein).
In keeping with our efforts to use genomic data efficiently in modeling,
we used the homologous sequences available for a templates or targets to improve
the alignment.
For T118, PDB- BLAST detected similarity of the C-terminal with 1DDQ-A.
The 1DDQ-A sequence and related bacterial and fungal polymerase alpha-factors
were aligned with T118 to obtain the gapping used in the submitted alignment.
PDB-BLAST also recognized a weak pattern of
identity between T126 and 1DMS and 1EG9. Individual multiple aligments
of T126 with other olfactory factors and these templates was used to
generate the alignments submitted.
1 Soman, K.V., Midoro-Horiuti, T., Ferreon, J.C., Goldblum, R.M.,
Brooks, E.G., Kurosky, A., Braun, W. and Schein, C.H. (2000) Biophysical
Journal 79:1601-1609
2 Soman, K.V., Schein, C.H., Zhu, H. and Braun, W.A. (2000) Homology
Modeling and Simulations of Nuclease Structures. In Methods in Molecular
Biology (Humana Press, Totowa, N.J.; editor C.H. Schein) 160(in press
for December, 2000).
3 Zhu, H., Schein,C.H. and Braun,W. (1999). J. Mol. Modeling, 5,302-316.
4 Mumenthaler, Ch. and Braun, W. (1995) Protein Science 4, 863-871
5 Zhu, H. and Braun, W. Protein Sci. 1999, 8, 326-342
6 Zhu, H., Schein,C.H. and Braun,W. (2000) MASIA: a program to recognize
common patterns and properties in multiple aligned protein sequences.
Bioinformatics 16: in press
7 Fraczkiewicz, R. and Braun, W. (1998) J. Comp. Chem. 19, 319-333.
8 Mumenthaler, Ch., Schneider, U., Buchholz, Ch.J., Koller, D., Braun,
W. and Cattaneo, R.(1997 ).Protein Sci 6, 588-597.
9 Buchholz, C.J., Koller, D., Devaux, P., Mumenthaler, Ch.,
Schneider-Shaulis, J., Braun, W., Gerlier, D. and Cattaneo, R. (1997).
J. Biol. Chem. 272, 22072-22079
Levitt , 012
number of submitted models: 180
Michael Levitt
email: michael.levitt@stanford.edu
The methods used for Comparative modeling and Fold-Recognition were the same
and what follows is the same in both abstracts. This work was greatly aided
by the availability of the output of all the 30 or so servers participating in
CAFASP on the CAFASP web site at http://cafasp.bioinfo.pl/target. In general
these results were available within hours of the target sequence announcement
and we never felt the need to consult the original servers in any way.
We first used the freeware program "wget" to download all the files for any
new targets. Then he parsed all these files using a large Perl script. This
script collected together the results from all the servers to give consensus
secondary structure predictions, consensus fold-recognition results and every
alignment produced. The script also converted all the proteins recognized by
the different servers into SCOP version 1.50 superfamily codes and the counted
how often the different codes occurred. Initially, we used the results for
over 20 servers but then found it more accurate to concentrate on eight that
seemed to perform most consistently. These were: ffas, foldfit, fugue,
genthreader, inbgu, mgenthreader, pdbblast, and target99. As may have been
expected, the groups behind each of these eight servers were generally the
experts who had done well in fold-recognition at previous CASP events (Godzik,
FFAS and PDB-Blast; Sternberg, foldfit or 3D-PSSM; Mizuguchi/Blundell, FUGUE;
Fischer, INBGU; Jones genTHREADER and mGenTHREADER; and Karplus, SAM-T99 or
target99). Unlike the CAFASP compilation released on the web by Danny Fischer
(http://www.cs.bgu.ac.il/~dfischer/CAFASP2/summaries/), no manual intervention
was used in parsing these raw results. For each target we produced a summary
file that listed:
(1) The fold recognition hits in decreasing order of significance with the PDB
entry name, the significance scores and the SCOP 1.50 ID. In some cases the
raw significance score given by the server was modified so that scores were on
the same scale (-100 for highest significance to small positive numbers for no
significance). For example:
T0099_ffas_hit_1 1bu1a -33.2 2.32.2
T0099_ffas_hit_2 1ark -30.7 2.32.2
(2) All the alignments produced by each method together with information on
the sequence match. For example:
T0099_ffas_al_2-a.mas_1ark 2.32.2 EFIAIYDYKAETEEDLTIKKGEKLEIIEK-EGDWWKAKAIGSGEIGYIPANYIAAA
T0099_ffas_al_2-b.sla_1ark 2.32.2 IFRAMYDYMAADADEVSFKDGDAIINVQAIDEGWMYGTVQRTGRTGMLPANYVEAI
T0099_ffas_al_2-x.par_1ark 2.32.2 nMAT=55, pID=28, nDEL=1, nINS=0, nCov=55/56, spaci=-99.000
(3) A Consensus summary allowing the fold to be recognized. For each SCOP
superfamily we collect the number of hits, the mean significance score, the
method and rank, the SCOP title and the PDB domain names with their SPACI
scores (Brenner, Koehl and Levitt, 2000). For example:
%T0099 4.77.1 -78.4 3 genthreader_1 mgenthreader_2 pdbblast_9
%T0099 (Alpha and beta (a+b),SH2-like,SH2 domain)
%T0099 1fmk 0.578, 2src 0.540,
%T0099 4.123.1 -59.9 6 genthreader_1 mgenthreader_2 pdbblast_6 pdbblast_7
%T0099 (Alpha and beta (a+b),Protein kinase-like (PK-like))
%T0099 1fmk 0.578, 2src 0.540, 1qcfa 0.431, 1ad5a 0.258,
%T0099 2.32.2 -45.9 60 ffas_1 ffas_2 ffas_3 ffas_4 ffas_5 ffas_6 ffas_7
%T0099 (All beta,SH3-like barrel,SH3-domain)
%T0099 1ckaa 0.665, 1fmk1 0.578, 2src 0.540,
For more complete results see our "private" site at
http://csb.stanford.edu/levitt/casp1234. During the CASP event, information
contained in that site was updated regularly by Levitt and shared with the
different CASP4 groups in my lab headed by Samudrala, Xia, Fain and Koehl
respectively. This is the only information that was shared. Each group then
went on to make their own comparative models (Samudrala, Koehl and Levitt)
and/or ab initio models (Fain, Levitt, Samudrala, and Xia). There was no
comparison of models, as each individual preferred to use CASP as an
opportunity to prefect their methods rather than to "win" CASP.
Overall we felt very confident (perhaps wrongly so) about recognizing an
appropriate template in the comparative modeling and fold recognition parts of
CASP4. We considered 17 targets to be Comparative Modeling targets (T0089,
T0090, T0092, T0099, T0101, T0103, T0111, T0112, T0113, T0117, T0119, T0121,
T0122, T0123, T0125, T0127, T0128) and did them all. Of the remaining 26
targets, we considered 18 to be Fold-Recognition targets and 8 to be Ab Initio
targets. For those targets that we considered to be fold-recognition targets,
9 were considered easy as their was very clear sequence similarity (T0087,
T0088, T0093, T0096, T0098, T0100, T0104, T0109, T0116), and 7 were considered
difficult and could not have been done without the consensus use of the
servers participating in CAFASP (T0094, T0095, T0107, T0108, T0115, T0118,
T0126), and 2 were considered to have no recognizable fold (T0120, T0124).
They were also too large for ab initio modeling so no results were submitted
for these. In all cases we submitted all-atom models.
In the predictions done by Levitt group, all the alignments for targets
submitted after 15 August were re-aligned using the structure of the template
to modify normal dynamic programming. This was done as follows: (a) The cost
of deleting residues from the template was proportional to the distance across
the gap in three-dimensions (measured between the CA atoms adjacent to the
gap). (b) The cost of inserting residues depended on how buried the residues
adjacent to the insertion were. (c) Buried residues were given greater weight
in the scoring. Each of these measures has associated with it a weight and
not having time to optimize these weights on known structural alignments, we
used 25 combinations of parameters and generated alignments for every one.
All the alignments taken from CAFASP before 15 August or re-aligned as
described above, we then used with our well-established automatic modeling
methods, SegMod and Encad, to generate stereochemically acceptable all-atom
models for each alignment (see Levitt, M. Accurate Modelling of Protein
Conformation by Automatic Segment Matching. J. Mol. Biol. 226, 507-533 (1992)
and Levitt, M. Energy Refinement of Hen Egg-White Lysozyme. J. Mol. Biol. 82,
393-420 (1974)).
Finally the best models were selected as follows. Use the rapdf probability
score (Samudrala, R & Moult, J. An All-atom Distance-dependent Conditional
Probability Discriminatory Function for Protein Structure Prediction. J. Mol.
Biol., 275: 893-914, (1998)) to choose the best 1000 models (it there are that
many). Cluster all these 1000 or fewer models into 10 clusters (using
bottom-up hierarchical clustering based on inter-structure CA coordinate RMS
deviation). For each model we use the rapdf score, Samudrala's HCF
hydrophobic compactness score, Keasar's surface energy, and the number of
hydrogen bonds to rank the conformations in each cluster. Finally choose the
five lowest energy models never including more than one model from a given
cluster. Occasionally manual intervention was used in deciding the rank of the
models in the official submission to CASP. For this we viewed the models to
judge general protein like shape and also used the coverage. For example, a
model with a less favorable energy score may be ranked above a model with
better score if the first model covered more of the target sequence