003 , Gerloff
012 , Levitt
017 , Yang-Ansuei
022 , InforMax
023 , Jones
028 , Ram-Samudrala
032 , Wolynes
042 , Honig-Barry
044 , Walts-Wondrous-Wizards
047 , kitasato-univ.
058 , Harrison-Weber
065 , Torda-Andrew
088 , ORNL-PROSPECT
090 , Hogue-Feldman
095 , blundell-tl
126 , Sternberg
133 , CBC-FOLD
155 , TUDELFT
169 , Dunbrack
173 , Barton
186 , SDSC1
187 , SDSC2:Reddy-Bourne
197 , Godzik
218 , LAMBERT-Christophe
223 , Braun-UTMB
237 , Sali-Andrej
241 , Vajda
255 , BinToHes
273 , WXW
330 , Zemla-Joanna
342 , SBI-AT
354 , baker
363 , Moult
381 , SBfold
382 , SBauto
384 , Murzin
389 , 123D+
406 , VENCLOVAS
414 , Friesner
429 , CHEN-WENDY
444 , MOE-CCG
447 , MSI
447 , MSI
457 , SBI-GR
465 , YASARA
482 , MSI-GA
486 , Shoshana-Wodak
501 , Hovmoeller-Zhou
520 , Scheib-Holger
526 , Ginalski
535 , shankari
Automatic generation of alignments for comparative modelling using THREADER Brunel University
Sequence to structure alignments without Australian National University
Homology modeling of CASP4 targets T0119 and T0123 GlaxoWellcome Experimental Research S.A.
Comparative modeling of protein tertiary structures based on secondary structure alignment WXW Info., Inc
Handling interconnected structural changes in comparative modelling of Stanford University
ESyPred3D: an Expert System for the Prediction of the protein 3D structure University of Namur, Belgium
COMPLETION AND REFINEMENT OF 3D HOMOLOGY MODELS WITH RESTRICTIVE Delft University of Technology
Comparative protein structure modeling by Modeller-6 The Rockefeller University
SEquence-structure alignment with 123D+ server. Ceres, Inc.
Sequence-structure alignment selection by 3D structure evaluation Lawrence Livermore National Laboratory
On the knowledge of comparative modeling The Scripps Research Institute
Comparative Modeling using FAMS -Full Automatic Modeling System Kitasato University
Model Building by Comparison: Selecting and Improving Algorithms via Imperial Cancer Research Fund
Comparative Modelling using Maps of Conformational Space Samuel Lunenfeld Research Institute, Mount Sinai Hospital
A Homology Modeling Algorithm for Protein Tertiary Structure Prediction InforMax Inc.
Comparative Modeling of Selected CASP4 Target Proteins Department of Biophysics, Institute of Experimental Physics, University of Warsaw
Ab initio loop modeling with precalculated synthetic loops and sidechain placement Informatik V, Uni Mannheim
Comparative modeling with PSI-BLAST, Modeller, and SCWRL Fox Chase Cancer Center
Playing protein fold charades Ceres, Inc
FFAS+ server for homology modeling The Burnham Institute
Randomized and Multiple Model Approaches to Homology Modeling and Ab Initio Modeling. TJU
PrISM: Protein Informatics System for modeling Columbia University, Dept of Pharmacology
Comparative modelling incorporating structural department of biochemistry, cambridge UK.`
Homology modeling method for CASP4 Universite Libre des Bruxelles
Structure prediction using sequence profiles and predicted secondary structure SBI-AT
Distant Homology Recognition and Fold Prediction by a knowledge-based approach using SCOP and Pfam Centre for Protein Engineering, Cambridge, UK
Comparative Modeling by Building Many Alternative All-Atom Models Stanford University
Comparative Modeling using a Combination of Threading and Restrained Energy Minimization Columbia University
Torsion angles for protein folding prediction Structural Chemistry Stockholm University
Comparative modeling with multiple models Boston University
Methods for Comparitive Modeling for group SBAuto SmithKline Beecham Pharmaceuticals
Methods for Comparitive Modeling for group SBFold SmithKline Beecham Pharmaceuticals
A Prediction Experience in CASP4 Using the PROSPECT Prediction Program Oak Ridge National Laboratory
3-D MODELING OF PROTEIN TARGETS FOR THE CRITICAL ASSESSMENT OF STRUCTURE University of Texas Medical Branch
Comparative Modeling using Rosetta University of Washington
Intelligent approach to structure prediction Columbia University
Modeling proteins with a transgenic algorithm Center of Molecular and Biomolecular Informatics
Protein modeling based on homology and other methods Structural Bioinformatics Inc
Comparative Modeling Using Funneled Energy Functions University of Illinois - Urbana-Champaign
Automated Homology Modeling Methods I. Protein modeling Molecular Simulations Inc.
Homology-Based Model Building of Proteins. Structural Bioinformatics Inc.
Automated Homology Modeling Methods II. Protein modeling Molecular Simulations Inc.
Structure Prediction by Methods of Comparative Modeling and Fold Recognition IBM T. J. Watson Research Center
Multiple Structure Alignment and Homology Modelling Chemical Computing Group Inc
Identification of template structures and EMBL - European Bioinformatics Institute
Improving alignments in HM protocol with intermediate sequences San Diego Supercomputer Center
Comparative modeling using new tools: MAPA and LGA independent participant
Incorporation of human-derived constraints from active/functional site models in protein tertiary structure assembly University of Edinburgh
Procedure used for Modeling CASP4 targets using CE defined similar structures. San Diego Supercomputer Center (SDSC-2)
Methods for Comparative Modeling and Fold Recognition Center for Advanced Research in Biotechnology (CASP)
Homology Modelling of the Alpha Subunit of Pyrococcus furiosus Tryptophan Synthase Molecular Simulations
Jones , 023
number of submitted models: 56
David T. Jones
email: David.Jones@brunel.ac.uk
In overview, our comparative modelling procedure involves the use of
THREADER3 (in comparative modelling mode) to generate an alignment
between the target sequence and a single selected template structure. This
alignment is then fed into the MODELLER4 program (A. Sali and T.
Blundell, J.Mol.Biol. 234, 779, 1993) for the final modelling stage. In
detail, the procedure is as follows:
1. Template selection. The GenTHREADER & mGenTHREADER (D.T.
Jones, J. Mol. Biol. 287, 797-815, 1999) results stored on the CAFASP2
server were selected to find the template structure which produced the
highest score. Where the confidence of the best match was not either
HIGH or CERT, the target was classed as a fold recognition target and
was not processed further in this category.
2. Alignment generation. PSIPRED (D.T. Jones, J. Mol. Biol. 292, 195-
202, 1999) secondary structure predictions were generated and
THREADER3 was used with both sequence and secondary structure
weighting. Alignments were generated with different sequence similarity
weights (-S option) in the range 50-400. Depending on the degree of
sequence similarity reported, 1-3 alignments were selected on the
basis of threading energy Z-scores (i.e. only 1 was selected if the
equence identity was > 50% and 3 were selected if %ID < 35%)
3. Each of the alignments (or just one in the case of trivial homologues)
was then fed into the MODELLER4 program and the final structures
were evaluated in terms of overall threading energy Z-scores using 100 sequence shuffles.
The model with the highest combined pairwise/solvation energy Z-score was submitted.
Although no significant human intervention was used at any
point, some hand editing of the input files was sometimes
needed in order to get MODELLER4 to run (chain labelling issues for example).
Torda-Andrew , 065
number of submitted models: 93
Boltzmann based force fields
Abraham, M, Ayers, DJ, Dosztanyi, Huber, T,
Procter, JB, Russell, AJ and Torda, AE
email: Andrew.Torda@anu.edu.au
Alignments were calculated and models ranked using the sausage
program [1]. Sidechains were fitted using a self-consistent
mean-field method [2].
Three force fields were used in three different steps:
1. Sequence to structure alignments used a score function
which used the identity of only one interaction partner
[5]. This allowed us to use the Gotoh method [4] for speed,
while avoiding the frozen approximation or double dynamic
programming.
2. Ranking of models used a z-score optimised force field [3]
3. Fed by unbounded optimism or perhaps pure faith,
side-chains were placed on the models using a more
conventional, physically based, molecular mechanics style
force field.
The first two force fields may be knowledge-based, but they
were built in complete ignorance of Boltzmann
statistics. Instead, the parameters are optimised so as to
distinguish native coordinates from a mass of misfolded
structures.
A second series of optimisation calculations allowed us to
find weights for additional terms for secondary structure
predictions [6], sequence similarity and gap penalties.
Finally, the library of templates consisted not of simple
protein coordinates, but rather of precalculated fields due to
averaging over similar structures.
The alignment code and methodology is undisputably fast. It
may occasionally be correct.
For the last few targets, secondary structure predictions were
made using a neural net fed on the sausage alignment
calculations.
-------------------
[1] Huber T, Russell AJ, Ayers D, Torda AE (1999)
Bioinformatics, 15, 1064-1065.
Sausage: protein threading with flexible force fields.
[2] Huber T, Torda AE, van Gunsteren WF (1996), Biopolymers,
39, 103-114.
Optimization methods for conformational sampling using a
Boltzmann-weighted mean field approach.
[3] Huber, T and Torda, AE (1999) Protein Sci, 7, 142-149.
Protein fold recognition without Boltzmann statistics or
explicit physical basis.
[4] Gotoh, O. (1982) J Mol Biol, 162, 705-708.
An improved algorithm for matching biological sequences.
[5] Huber T, Torda AE (1998) J Comput Chem, 15, 1455-1467.
Protein sequence threading, the alignment problem, and a
two-step strategy.
[6] Rost B and Sander C. (1993) J Mol Biol, 232, 584-599.
Prediction of protein secondary structure at better than 70%
accuracy.
Scheib-Holger , 520
number of submitted models: 2
Holger Scheib
email: hys14462@glaxowellcome.co.uk
The target sequence for T0119 (Benzoate Dioxygenase Reductase) was found in
SwissProt (Accession Code P07771). Initially, the SwissModel First Approach
Mode (http://www.expasy.ch/swissmod/SWISS-MODEL.html) without preselected
template files was applied with the resulting model structure containing 2
domains with a missing linker sequence from K103 to A114. Also, the optional
WhatCheck and Predict Protein reports were obtained as well as results from
3D-PSSM prediction.The 3D-model structure domain 1 constitutes of 2FE/2S
ferredoxin, domain 2 of ferredoxin-NAD reductase. Domain 2 can be further
subdivided into 2 subdomains. The SwissModel result was analyzed by comparing
the model structure to the WhatCheck report and PredictProtein results. Using
PredictProtein, the 3D-structure was checked for secondary structure and
accessibility.Although the overall model seemed to be reasonable, the side
chain positions of the following residues were manually modified to reduce
sterical repulsion or to increase hydrophobic or polar contacts:
Domain 1: E24
Domain 2: F139, S186, K235, E279.
In domain 1, the loop between E67 and A75 was created applying the SwissModel
loop building tool. Finally, both domains were manually brought into close
neighborhood hinting for a putative 3D orientation in the connected domains.
For target T0123 (b-lactoglobulin), the target sequence was sent to SwissModel
(http://www.expasy.ch/swissmod/SWISS-MODEL.html). SwissModel First Approach
Mode was performed without preselected template files. The results retrieved
besides the 3D model structure included both the WhatCheck and PredictProtein
report, and 3D-PSSM prediction data.
The 3D structural model was build with 3BLG, 1BSOA, 1CJ5A, 1BST, and 2BLG as
template structures. Differences among the templates were more in structure
rather than sequence, since the template structures were resolved at various
pH-values. From the additional information coming along with the T0123 target,
one could extract that the crystallization was carried out at pH 3.2
indicating a closed conformation of the loop between residues 85 and 90.
Also, from the respective publication of Qin et al. in Biochemistry (see
reference below), one could conclude that pig b-lactoglobulin is most likely
monomeric.
From PredictProtein, the suggested template structures concerning the sequence
identity to T0123 were 1B0O, 1QAB, 1MUP, 1AQB, 1FEN, and 1WDC. Due to the
differences between template structures selected by SwissModel and
PredictProtein, the results for secondary structure prediction were ignored,
since there is no correlation at all.
The resulting alignment generated by SwissModel was manually corrected at the
C-terminus by shifting residues P151, A152, and Q153 two positions downstream
placing the two residue gap in the alignment (of the target sequence with the
template structure sequences) between L150 and P151. A loop was build using
L150 and C158 as anchor points scanning the SPDBViewer loop database. Energies
of all loops found were calculated using the Gromos Force Field by van
Gunsteren and coworkers. The lowest energy loop was selected originating from
4GCR with 1.47 Å resolution, sequence pattern ITDDCPS. The respective energy
was calculated to 3542.2 kJ.
The following residues were found to clash either with other side chains or
the T0123 backbone:
P5, K35, S37, K40, R61, Q91, F93, L94, H102, L105, L126, V128, D130, I132,
R133, P146, P151, and E155.
The side chains of the model structure were then energetically minimized using
the Simulated Annealing algorithm implemented into the SPDBViewer applying the
following parameters:
Heating: 20 steps, initial T: 1000 K, final T: 1000 K
Annealing: 200 steps, initial T: 1000 K, final T: 300 K
Equilibration: 100 steps, initial T: 300 K, final T: 300 K
Random seed: 0.000
After applying the Simulated Annealing algorithm to the model structure, the
side chains of residues H102, L105, and R159 were manually modified, since
they still repulsively interacted with their environment.
From the resulting model, the following characteristics could be extracted
coinciding with literature data by Qin and coworkers:
1. the structure must be in closed conformation, since crystallization was
carried out in acidic milieu.
2. disulfide bonds occur between C66 and C158 with Ca-Ca distance of 6.04 Å
(as compared to 5.91 Å from literature) and C106 and C119 with Ca-Ca
distance of 3.90 Å (as compared to 3.83 Å), respectively.
3. a salt bridge is possible between V1 and E108
4. the putative substrate binding site consists of L10, T12,V15, A41 ,V43,
L46, L54, I56, L58, L71, A73, A80, F82, I84, L92, L94, L103, L105, M107
differing in 6 positions from bovine b-lactoglobulin but not in the four
highly conserved positions among lipocalins (L10, L54, L58, F82).
5. in the closed conformation, an H-bond is likely to occur between OE2 of
E89 which is the key residue in loop movement, and O of S116.
Reference for bovine b-lactoglobulin structure
Qin, B.Y., Bewley, M.C., Creamer, L.K., Baker, H.M., Baker, E.N. and Jameson,
G.B. (1998) "Structural basis of the Tanford transition of bovine
beta-lactoglobulin" Biochemistry 37:14014-14023.
References for SwissProt
Bairoch A. and Apweiler R. (2000) "The SWISS-PROT protein sequence database
and its supplement TrEMBL in 2000". Nucleic Acids Res. 28:45-48.
Bairoch A. and Apweiler R. (1997) "The SWISS-PROT protein sequence database:
its relevance to human molecular medical research". J. Mol. Med. 75:312-316.
Apweiler R., Gateau A., Contrino S., Martin M.J., Junker V., O'Donovan C.,
Lang F., Mitaritonna N., Kappus S. and Bairoch A. (1997) "Protein sequence
annotation in the genome era: the annotation concept of SWISS-PROT + TREMBL".
In: ISMB-97; Proceedings 5th International Conference on Intelligent Systems
for Molecular Biology, pp33-43, AAAI Press, Menlo Park, CA, USA.
Bairoch A. (1997) "Proteome databases". In: Proteome research: new frontiers
in functional genomics, Wilkins M.R., Williams K.L, Appel R.D., Hochstrasser
D.H, Eds., pp93-132, Springer Verlag, Heidelberg. ISBN: 3-540-62775-8.
Moller S., Leser U., Fleischmann W. and Apweiler R. (1999) "EDITtoTrEMBL: a
distributed approach to high-quality automated protein sequence annotation".
Bioinformatics 15:219-227.
Fleischmann W., Moller S., Gateau A. and Apweiler R. (1999) "A novel method
for automatic functional annotation of proteins". Bioinformatics 15:228-233.
O'Donovan C., Jesus Martin M., Glemet E., Codani J.J. and Apweiler R. (1999)
"Removing redundancy in SWISS-PROT and TrEMBL". Bioinformatics 15:258-259.
References for SwissModel
Peitsch MC (1995) "ProMod: automated knowledge-based protein modelling tool."
PDB Quarterly Newsletter 72:4.
Peitsch MC (1995) "Protein modelling by E-Mail". Bio/Technology 13:658-660.
Peitsch MC (1996) "ProMod and Swiss-Model: Internet-based tools for automated
comparative protein modelling". Biochem. Soc. Trans. 24:274-279.
Peitsch MC and Guex N (1997) "Large-scale comparative protein modelling". In:
Proteome research: new frontiers in functional genomics, p 177-186, Wilkins
MR, Williams KL, Appel RO, Hochstrasser DF eds., Springer Verlag, Heidelberg.
ISBN: 3-540-62775-8..
Guex N and Peitsch MC (1997) "SWISS-MODEL and the Swiss-PdbViewer: An
environment for comparative protein modelling". Electrophoresis 18:2714-2723.
Guex N and Peitsch MC (1999) "Molecular modelling of proteins". Immunology News 6:132-134.
Guex N, Diemand A and Peitsch MC (1999) "Protein modelling for all". TiBS 24:364-367.
References for Swiss-PDBViewer
Guex N, Diemand A and Peitsch MC (1999) "Protein modelling for all". TiBS 24:364-367.
Guex, N. and Peitsch, M.C. (1997) "SWISS-MODEL and the Swiss-PdbViewer: An
environment for comparative protein modeling". Electrophoresis 18, 2714-2723.
Guex, N and Peitsch, M.C.(1996) "Swiss-PdbViewer: A Fast and Easy-to-use PDB
Viewer for Macintosh and PC". Protein Data Bank Quaterly Newsletter 77, 7.
Guex, N.(1996) "Swiss-PdbViewer: A new fast and easy to use PDB viewer for the
Macintosh". Experientia 52, A26.
Reference for WhatCheck
Vriend G. http://www.sander.embl-heidelberg.de/whatcheck/
References for Predict Protein
Rost, B. (1996) "PHD: predicting one-dimensional protein structure by profile
based neural networks". Methods Enzymol., 266:525-539.
Rost, B. and Sander, C. (1993) "Prediction of protein secondary structure at
better than 70% accuracy". J. Mol. Biol., 232:584-599.
Rost, B. and Sander, C. (1994) "Combining evolutionary information and neural
networks to predict protein secondary structure". Proteins, 19:55-77.
Rost, B. and Sander, C. (1994) "Conservation and prediction of solvent
accessibility in protein families". Proteins, 20:216-226.
Reference for 3D-PSSM
Kelley LA, MacCallum RM & Sternberg MJE (2000) "Enhanced Genome Annotation
using Structural Profiles in the Program 3D-PSSM". J. Mol. Biol. 299(2),
501-522.
Reference for Gromos
van Gunsteren, W.F. and Berendsen, H.J.C. (1987) "Groningen molecular
simulation (GROMOS) library manual". Groningen. biomos.
WXW , 273
number of submitted models: 139
Xiongwu Wu
email: wxw@giccs.georgetown.edu
The target protein sequence is aligned with the protein secondary structural
segments taken from the protein listed in the selected protein list provided
by Dr. Uwe Hobohm. Secondary structures are derived from best alignment
sequences for each residue.
Based on the secondary structures derived in above approach, the target
sequence is aligned to the selected proteins based on the secondary structural
identities and the best match is used as the templet for comparative modeling.
Ram-Samudrala , 028
number of submitted models: 207
proteins using a statistical scoring function, graph theory, andexhaustive enumeration techniques
Ram Samudrala and Michael Levitt
email: ram@csb.stanford.edu
The interconnected nature of interactions in protein structures,
thorough sampling of side chain and main chain conformations, and
devising a discriminatory function that can distinguish between
correct and incorrect conformations are the major hurdles preventing
the construction of accurate homology models. We present an algorithm
that uses graph theory to handle the problem of
interconnectedness. Sampling of side chain and main chain
conformations is accomplished by exhaustively enumerating all possible
choices using a discrete state model, including fragments from a
database of protein structures. The optimal combination of these
possibilities is selected using an all-atom scoring function aided by
the graph-theoretic approach.
Following is a brief description of the components and steps of this
method, which can be divided into: discriminatory function,
identification of template and generation of alignment, initial model
building, construction of variable main chain and side chain regions,
and moving models closer to the native conformation.
0. DISCRIMINATORY FUNCTION: the function used throughout generally is
an all-atom distance-dependent conditional probability discriminatory
function based on a statistical analysis of known protein
structure. The negative log of the conditional probability of
observing two atoms interact given a particular distance is used as a
``pseudo-energy'' term. Reference: J Mol Biol 275: 893-914 (1998).
1. IDENTIFICATION OF TEMPLATE AND GENERATION OF ALIGNMENT: The CAFASP
meta-server data were used to identify the proteins that a given
target sequence was related to (based on a consensus of all the hits
produced by the different servers). The alignments generated by the
different servers were then used to construct initial models. The
initial models were then ranked by our discriminatory function and the
models that ranked highest were used for further model-building.
2. INITIAL MODEL BUILDING: Following the sequence alignment, for each
parent structure, an initial model was generated by copying atomic
coordinates for the main chain (excluding any insertions) and for the
side chains of residues that are identical in the target and parent
structures. Residues that differ in type were constructed using a
minimum perturbation technique. The MP method changes a given amino
acid to the target amino acid preserving the values of equivalent chi
angles between the two side chains, where available. The other chi
angles are constructed by the MP method using an internally developed
library based on residue type.
3. CONSTRUCTION OF VARIABLE MAIN CHAIN AND SIDE CHAIN REGIONS:
Main chain sampling is performed using an exhaustive enumeration
technique based on discrete states of phi/psi angles. For longer main
chain regions, we use fragments (3-tuples) from a database of protein
structures to generate the discrete phi/psi angles.
Side chains possibilities are generated by selecting the most probable
side chain rotamers based on the interactions of a given rotamer with
the local main chain (evaluated using the discriminatory function
above). Reference: Samudrala R, Moult J. Prot. Eng. 11: 991-997,
1998.
We then use a graph-theoretic approach to assemble the sampled side
chain and main chain conformations together in a consistent manner.
Each possible conformation of a residue is represented using the
notion of a node in a graph. Each node is given a weight based on the
degree of the interaction between its side chain atoms and the local
main chain atoms. The weight is computed using a all-atom conditional
probability discriminatory function. Edges are then drawn between
pairs of residues/nodes that are consistent with each other (i.e.,
clash-free and satisfying geometrical constraints). The edges are also
weighted according to the probability of the interaction between atoms
in the two residues. Once the entire graph is constructed, all the
maximal sets of completely connected nodes (cliques) are found using a
clique-finding algorithm. The cliques with the best probabilities
represent the optimal combinations of mixing and matching between the
various possibilities, taking the respective environments into
account. Reference: J Mol Biol 279:287-302 (1998). Clique-finding is
accomplishing using the Bron and Kerbosch algorithm. Reference:
Communications of the ACM, 16: 575-577 (1973).
All models used were refined using ENCAD.
5. MOVING MODELS CLOSER TO THE NATIVE CONFORMATION:
Once we had generated a final model for each parent, we used
an off-lattice fourteen-state phi/psi model and a sequential
build-up algorithm to generate structures around the conformational
space of the final model. We then used our scoring function to select
the best ranking ones. The goal here is that some of the conformations
sampled would actually be closer to the native conformation and that
our scoring function will be able to select it.
We test how the above approach works in a comparative-modelling
scenario and assess the predictive power of this method by applying it
to properly controlled blind tests as part of the fourth meeting on
the Critical Assessment of protein Structure Prediction methods
(CASP4). Compared to CASP2 and CASP2, where a similar approach was
used, we have improved the method used to sample main chains and have
made minor enhancements to the other components of this approach
including the scoring function. The biggest change is in our attempt
to move models closer to the final answer. It remains to be seen how
the improvements in methodology correlate with model accuracy.
LAMBERT-Christophe , 218
number of submitted models: 45
C. Lambert , N. Léonard, K. de Fays and E. Depiereux
email: christophe.lambert@fundp.ac.be
The aim of our work is to propose a reliable automatic method for homology
modeling, especially when the protein of interest shares a low percentage of
identities (20-30%) with the chosen template.
Our strategy consists in the usual steps for homology modeling: search for the
template in databanks, target-template alignment and modeling. Actually, our
method does not provide any assessment of the model.
For the search for template in databank, we used four iterations of
PSIBLAST[1] on the non redundant protein database (nr) of the NCBI. All
sequences having a expected value lower than 0.001 are included in the profile
building. The template is chosen as the sequence of known structure (PDB) that
has the lower expected value. The search in the nr databank also give us a
large number of similar sequences.
As far as possible, two sets of sequences are built. The first one contains
the 50 best hits below the expected value cutoff of 0.001. The second one
contains a subset of the sequences, after dropping too redundant ones. This
method aims at creating different conditions to run multiple alignment
programs and extracting different consensus and in order to raise the
confidence of the sequence-structure alignment.
The two sets are then submitted to five alignment programs: ClustalW[7],
Dialign2[5], Match-Box[3], Multalin[2] and PRRP [4]. A pairwise alignment
between the target and template sequences is extracted from each multiple
alignment and the final sequence-structure alignment is obtained from the
consensus between all the pairwise alignments including the one provided by
PSI-BLAST. A tri-dimensional model is built using MODELLER[6] version 4 on
this final alignment.
For the purpose of the CASP experience, two other models were built:
- one from the rough sequence-structure alignment provided by PSI-BLAST[1]
- one from the consensus of all alignment methods expected PSI-BLAST.
1. Altschul SF, et al. (1997). Nucleid Acids Research 25(17): 3389-3402
2. Corpet F (1988) Nucl. Acids Res. 16:10881-10890.
3. Depiereux E, et al. (1997). Comput. Appl. Biosci. 13(3): 249-256.
4. Gotoh O (1996) J. Mol. Biol. 264:823-838
5. Morgenstern, B. (1999). Bioinformatics 15(3): 211-8.
6. Sali A and Blundell TL (1993). Journal of Molecular Biology 234(3): 779-815.
7. Thompson JD, et al. (1994). Nucleic Acid Research 22(22): 4673-4680.
TUDELFT , 155
number of submitted models: 44
MOLECULAR DYNAMICS
Jaap A. Flohil and Simon W. de Leeuw
email: j.a.flohil@tn.tudelft.nl
A method is presented to refine models built by homology by the use of
restrictive molecular dynamics (MD) techniques. The basic idea behind this
method is that structure validation software is used to determine for each
residue the likelihood that it is correctly modeled. This information is used to
determine restraints in the MD simulation, which is used for model refinement.
Residues that are likely to be positioned correctly according to the validation
software should be strongly constrained or restrained in the MD simulation,
whereas residues likely to be positioned inappropriate, should be kept free.
The BLAST2P (Altschul et al, 1990; J. Mol. Biol. 215:403-410) server at the
EMBL was used to find a template that show at least 50equence identity to the
target sequence.
After side-chain modeling, artifacts of the modeling process have been
detected by automated procedures based on the structure validation modules of
WHAT IF (Vriend, 1990; J. Mol. Graph. 8, 52-56).
If the alignment procedure indicated that insertions had to be made, glycines
were inserted consecutively inserted by a shrink-insert-expand procedure
(applied with a similar procedure as used for T0058, but now with multiple loop residues),
followed by an energy minimization after each inserted glycine. After expansion
of the final glycine, the formed loop was mutated into the target sequence,
and added to the selection of free moving residues.
The initial model was created after an energy minimization (EM) of 100 steps
steepest descent with GROMACS (Berendsen et al, 1995; Comp. Phys. Comm. 95
pp. 43-56), to remove Van der Waals overlap and to adapt to the GROMOS96 force
field.
The quality of residues according to the validation software is used to
determine the strength of restraints in the 1ns restricted MD simulation.
Depending on the magnitude of the modeling errors reported in
the model checks, and position of modeled indels, residues were selected to move
free.
After adding water and a short additional EM, a 10 ps run with position
restraints (1000 kJmol-1nm-2) on the protein was done to equillibrate the
water. A 1-5 ns refinement run was performed in explicit water, with harmonic
restraints of 10000 kJmol-1nm-2 on all heavy atomic coordinates. Every
picosecond, a frame in the MD trajectory was analyzed on the formation of
non-bonded backbone-backbone contacts between free and restrained groups, and
internal protein contacts. All frames were clustered into RMSD groups, and the
average frame of each cluster was selected for submission. The model priority
was given by the ranking of interatomic contacts. B-factors were calculated
from RMS fluctuations along the trajectory.
Protein fragments seem to be sufficiently compliant to find a native-like
state for incorrectly modeled fragments in a strongly constraint framework. At
presence, model refinement with molecular dynamics generally leads to a model
that is less like the experimental structure (Levitt et al, 1999; Nature
structural biology, Vol.6 nr. 2, February 1999), but extrapolating from the
results a significant model quality improvement might be possible in the near
future.
Sali-Andrej , 237
number of submitted models: 13
Andras Fiser, Marc Marti-Renom, Ash Stuart, Andrej Sali
email: fisera@rockefeller.edu
Template structures were identified primarily by programs Psi-Blast
(1) and MODELLER (2). In the difficult cases, additional information
was obtained by threading programs including GenThreader (3) and
3D-PSSM (4). In general, several template combinations were explored,
finally selecting those template combinations that resulted in the
best model, as assessed by ProsaII (5).
Initial alignments were generated with programs Psi-Blast (1), ALIGN
(2), Align2D (2), ClustalW (6),and ALIGN4D (7). ALIGN2D and ALIGN4D
align profiles with profiles. In the difficult cases, multiple
sequence profiles were generated manualy for the target and the
templates, and then aligned by ALIGN2D. In the case of several
comparable templates, the multiple sequence alignment derived from the
structural superposition of the templates guided the profile for the
template structures. For structural superposition, either MALIGN3D (2)
or CE (8) was used. Alignments were generally hand-edited. In
addition, a new approach that automatically combines the best few
alignments using a self-consistent field method was applied to
optimize the alignments (9). In the low template-target sequence
identity cases (~10-15equence identity), predicted secondary
structure (JPRED (10)) was taken into account to refine the final
alignment.
The models were built by Modeller-6. The input to the program was the
alignment between the target sequence and template structure(s). The
output obtained without any user intervention was a model with all
non-hydrogen atoms. When possible, additional restraints from ligand
binding were included. Insertions and deletions that required a
refinement were modeled by an automated ab initio loop modeling method
(11).
All models were checked by PROCHECK (12) for proper stereochemistry
and by PROSA (5) for energetically favorable non-bonded contacts. When
unfavorable stereochemistry or non-bonded contacts were identified,
the loop modeling protocol (11) was used to refine the offending
segments.
(1) S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang Zhang, W. Miller and
D.J. Lipman (1997) Nucl. Acids Res., 25,3389-3402
(2) A. Sali and T. L. Blundell, (1993) J. Mol. Biol. 234,779-815
(3) Jones, D.T. (1999) J.Mol Biol. 287: 797-815
(4) Kelley L.A, MacCallum R.M. and Sternberg M.J.E (2000). J. Mol. Biol. 299,
501-522
(5) M. J. Sippl, (1993) Proteins, 17,355-362
(6) J.D. Thompson, D.G. Higgins and T.J. Gibson,(1994) Nucl. Acids
Res.,22,4673-4680,
(7) M. Marti-Renom, M. S. Madhusudhan and A. Sali, in preparation
(8) Shindyalov I.N. and Bourne P.E. (1998) Prot. Eng., 11, 739-747.
(9) A. Fiser and A. Sali, in preparation
(10)Cuff J.A., Clamp M.E., Siddiqui A.S., Finlay M., Barton G,J., (1998)
Bioinformatics, 14, 892-893,
(11)A. Fiser, RKG. Do and A. Sali (2000) Prot. Sci. 9, 1753-1773
(12)R.A. Laskowski, M.W. McArthur, D.S. Moss and J.M. Thornton,(1993) J. Appl.
Cryst.,26,283-291
123D+ , 389
number of submitted models: 214
Nickolai N. Alexandrov
email: nicka@ceres-inc.com
123D+ server compares a target sequence with a set of protein domains from ASTRAL
non-redundant set (version 1.50, 50 0dentity list). For every residue in the domain,
the following information is derived from the PDB files: (i) residue type (amino
acid in SEQRES field), (ii) secondary structure, assigned by Stride, and (iii) the
number of contacts with other residues. Domain profiles are created by psi-blast
run against NR database. Similarly, psi-blast profile is also created for a target
sequence. Secondary structure of a target is predicted by probabilistic approach
from statistics of amino acid pairs in a sliding window of 17 residues. Similarity
score between position i in target and position j in domain is computed as:
log((Paa*Pss*Pcc)/(P'aa*P'ss*P'cc)), where Paa is a probability to have the same
amino acid in i and j, computed from the psi-blast profiles; Pss is a probability
to have the same secondary structure; and Pcc is a probability to have the same
number of contacts, computed from the contact capacity potentials for every
residue type. P'aa, P'ss, and P'cc are correspondent expected probabilities.
123D+ uses dynamic programming to find an optimal sequence-structure alignment.
In addition to standard events of match, deletion, and insertion, the algorithm
features a choice of residues not to be aligned, which helps to deal with
different loop conformations. As default alignment mode was used fit, where the
whole domain is required to be aligned with a part of the target sequence.
123D+ was benchmarked with ASTRAL set of domains and outperformed psi-blast in
fold recognition. 123D+ is available at
http://www-lmmb.ncifcrf.gov/~nicka/run123D+.html.
VENCLOVAS , 406
number of submitted models: 13
Ceslovas Venclovas
email: venclovas@llnl.gov
Comparative modeling method used to build models for CASP4 is a modification
of one used at CASP3 and described in more detail in the special Proteins
issue (Venclovas et al., 1999). What follows is an attempt to briefly describe
major steps in this procedure.
Parent (template) selection
PDB templates were identified either using the Smith-Waterman (Smith &
Waterman, 1981) search against PDB for high homology targets, or using
PSI-BLAST (Altschul et al., 1997) search against non-redundant NCBI sequence
database. Usually more than one template was used to build models.
Sequence-structure alignments
Sequence-structure alignments were generated and tested both at the sequence
level as well as at the 3D level. For high homology targets, where structural
template(s) were among closely related sequences, multiple sequence alignment
analysis was used first. This step consisted of producing series of multiple
sequence alignments for the same set of sequences using systematic variation
of parameters. The regions where variation of the parameters did not affect
the alignment were tabulated and alignment within these regions was used to
build a model. In the case of distant homology targets, results of initial
PSI-BLAST search were used for intermediate sequence search procedure as a
first step towards generating sequence-structure alignment. In this procedure,
a set of sequences that bridge sequence space between target sequence and
template(s) were used as probes to do search against non-redundant sequence
database. Target-template sequence alignments were extracted from resulting
search data and their consistency was analyzed. For regions where one dominant
alignment variant was produced, this variant was used to build a model. If
there were several variants, all of them were tested by building model and
evaluating consistency with the 3D structure. Alignments for some regions that
were expected to be structurally conserved, but could not be aligned by
PSI-BLAST, were derived manually using PSIPRED (Jones, 1999) secondary
structure predictions as a guide.
Selecting sequence-structure alignments by model evaluation
Final sequence-structure alignments were selected by building and evaluating
3D models. If sequence methods suggested several major alternative alignments
for specific region, all of them were tested by building and evaluating
corresponding models. In most cases models not only for target protein, but
also for its close homologs were built, in attempt to make a better judgement
regarding correct alignment in the questionable regions. Evaluation of 3D
models was done using ProsaII (Sippl, 1993) (comparing Z-scores of models for
several homologs generated using alternative alignments) and by visual
inspection with emphasis on detecting buried charged and hydrophylic side
chains.
Loop modeling
Regions that were expected to differ in target structure compared to the
template(s) were defined as loops. Coordinates for these regions were assigned
after suitable fragments from PDB structures were found. The preference was
given to the proteins related to the target. Otherwise the conformation which
was dominant in results of fragment search was assigned to the targeted
region.
Generating 3D structures
Models for evaluation purposes were built either with Homology module
(InsightII) or with MODELLER (Sali & Blundell, 1993). The final models were
generated using MODELLER with subsequent side chain rebuilding using SCWRL
(Bower et al., 1997). The model structures were then verified with Whatcheck
function of WHATIF package (Vriend, 1990) and detected severe steric clashes
were relieved. No energy minimization procedures were used.
References:
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller,
W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res 25(17), 3389-3402.
Bower, M. J., Cohen, F. E. & Dunbrack, R. L., Jr. (1997). Prediction of
protein side-chain rotamers from a backbone-dependent rotamer library: a new
homology modeling tool. J Mol Biol 267(5), 1268-1282.
Jones, D. T. (1999). Protein secondary structure prediction based on
position-specific scoring matrices. J Mol Biol 292(2), 195-202.
Sali, A. & Blundell, T. L. (1993). Comparative protein modelling by
satisfaction of spatial restraints. J Mol Biol 234(3), 779-815.
Sippl, M. J. (1993). Recognition of errors in three-dimensional structures of
proteins. Proteins 17(4), 355-362.
Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular
subsequences. J Mol Biol 147(1), 195-197.
Venclovas, C., Ginalski, K. & Fidelis, K. (1999). Addressing the issue of
sequence-to-structure alignments in comparative modeling of CASP3 target
proteins. Proteins Suppl. 3, 73-80.
Vriend, G. (1990). WHAT IF: a molecular modeling and drug design program.
J Mol Graph 8(1), 52-56, 29.
CHEN-WENDY , 429
number of submitted models: 14
Shu-wen W. Chen and Jean-Luc Pellequer
email: pelleque@scripps.edu
Our contribution to the CASP4 experiment consisted in testing the extend of
our structural knowledge underlying model building. Models presented at CASP4
were built using partially automated programs and computational graphics.
To identify structural templates, we used either a Blast search or fold
recognition programs. The most frequently used fold recognition program was
mGenTHREADER (Jones, 1999) but others found at the CAFASP2 experiment web site
were also used. When multiple structural templates were proposed, we used
known experimental data such as co-factor binding, substrate/product structure
to assist our selection. In case multiple structural templates were selected,
we manually aligned these structures onto each other. To select the final
template (we only mixed small regions from various templates) we looked at
crystallographic resolution, refinement statistics, and phi,psi dihedral
angles distribution.
Sequence alignments were initially performed by the BESFIT or GAP program (GCG
Inc, Madison, WI) with the BLOSUM62 substitution matrix (Henikoff and
Henikoff, 1992). Gap weights were fitted to obtain the longest alignment with
the smallest number of gaps. Final alignments were manually adjusted by
examining available structural templates. Attention was focused on Gly
replacements, the presence of side-chain to main-chain hydrogen bonds, and
irregularities in secondary structure elements.
The structural template backbone was transferred to the target sequence. Side
chains were substituted using the top rotamer from the library of Tuffery et
al. (1991). Side chains of conserved residues were kept rigid. Substituted
side-chain conformations were optimized using a self-consistent rotamer search
procedure and subsequently refined in a torsional angle space by Nelder-Meads
Simplex minimization algorithm (Chen and Pellequer, unpublished). The Charmm22
all atom force field parameters were used for scoring (Brooks et al., 1983).
Deletions and insertions were modeled using a self-consistent loop closure
algorithm with or without side chain flexibility (Chen and Pellequer,
unpublished). A discrete representation of the phi,psi conformational space
was used. Then, several cycles of manual rebuilding and refining were carried
out using Turbo-Frodo (Roussel and Cambillau, 1989). When possible, co-factors
and substrates were included in modeling (without attempt to refine them). To
further remove steric clashes and refine the final geometry, all side-chain
atoms were energy-optimized using Xplor 3.8 (Brünger, 1992) followed by a
brief all atoms minimization.
kitasato-univ. , 047
number of submitted models: 122
Mitsuo Iwadate, Kazuyoshi Ebisawa, Youji Kurihara, Mayuko Takeda-Shitaka and Hideaki Umeyama
email: umeyamah@pharm.kitasato-u.ac.jp
We introduce a method of homology modeling consisting of database searches and
simulated annealing. The method involves searches for homologous proteins,
alignment, construction of Ca atoms, construction of main-chain atoms, and the
construction of side-chain atoms. All processes after alignment are performed
automatically. Searches for homologous proteins and alignment are based on
PSI-BLAST raw output. Then the raw output is modified taking the hydrophobic
core and secondary structure in account. In this method, main-chain
conformations are generated from the main-chain coordinates in reference
protein. Weighting function is defined by the local space homology
representing the similarity of environmental residues at reference protein.
Side-chain conformations are generated for constructed main-chain atoms by
database searches, and main-chain atoms are optimized for the fixed side-chain
conformations. These two processes, i.e., the side-chain generation and
main-chain optimization, are repeated several times. This type of
construction provides a structure similar to the X-ray structure, in
particular, main-chain and side-chain atoms in the residues belonging to the
structurally conserved regions (SCRs). To examine the accuracy of our method,
we predicted fourteen proteins whose structures are known. The average root
mean square deviation between models and X-ray structures was 2.29 A for all
atoms, and the percentage of chi1 angles within 30 degree was 72.6 0.000000or
SCRs residues. Some models were in good agreement with their respective X-ray
structures, but not with the reference structures for homology modeling.
Sternberg , 126
number of submitted models: 45
Expert Knowledge
Paul A. Bates and Michael J.E. Sternberg
email: paul.bates@icrf.icnet.uk
Fully automated comparative model building procedures are generally less
accurate than procedures using some human intervention. Nevertheless, fully
automated procedures are essential for large-scale genome modeling. We are
trying to understand which algorithms are the best to use at each stage in the
model building process. Towards this aim a fully automatic model building
program called 3D-JIGSAW (http://ww.bmm.icnet.uk/3djigsaw) has been written
and entered into CAFASP2. This program is currently designed to work at
levels of no less than 40equence identity with the closest parent. The
program is modular with each module centring around a particular algorithm
required in the modeling process. The program produces intermediate files at
critical modeling steps. For the 16 targets model built for CASP4 (targets easily
assigned to parents of known structure by the program PSI-BLAST (Altschul SF
et al., 1997, Nucl. Acids Res., 25, 3389-3402) intermediate files were inspected,
altered if thought not to be optimal, and the program restarted from the
appropriate point. In addition, one of the program modules, the critical
alignment module, was changed from that used in the fully automatic version of
3D-JIGSAW. The modules used in the model building process are similar to
those reported previously (Bates PA and Sternberg MJE ,1999, Proteins Suppl
3, 47-54) and are:
1. Selection of parents: Parent target sequences are selected from a local
sequence database (database consisting of the NCBI sequences, nr,
plus PDB sequences; annotated with data quality parameters such as
resolution and numbers of missing atoms) using the program PSI-BLAST.
Up to five parents are selected using a balance of sequence similarity
and data quality.
2. Extraction of relevant sequences: A selection of sequences are taken
between the target and parent sequences and hierarchically aligned
(Barton GJ and Sternberg MJE ,1987, J. Mol. Biol., 20,327-37).
3. Superpose parents: The selected parents are superimposed via a
multiple structure alignment algorithm.
4. Align target to parent sequences: The profile of sequences from step 2
are aligned to a profile of sequences from step 3. This strategy worked
quite well for alignments of target to parent above 40 equence
identity but as most of the targets were below this level a different
module was used that aligned the best parent PSSM (position-specific
scoring matrix; generated by PSI-BLAST) and target PSSM with
adjustments to the metric dependant on the local agreement of known
and predicted secondary structure. Predicted secondary structure for the
target was obtained from program PSIPRED (Jones DJ ,1999, J.Mol.Biol.,
292, 195-202).
5. Selection of loops to change: All loops are considered for replacement.
The boundaries of the loops are taken from the ends of the secondary
structure elements of the multiple structure alignment. All loops, and all
regions with incompatible backbone angles with the target sequence
were modeled via database fragment searches. Three databases were
searched in the order: (i) homologous/analogous structures, (ii) loop
classification database (Olivia B et al., 1997, J. Mol . Biol. 266, 814-830)
and (iiii) non-redundant database, 600 protein chains (sequence similarity
of less than 25%, R-factor <= 2.5). Fragments were selected automatically
and were chosen on the basis of good sequence similarity with the target
and how well the fragments fitted to the take-off points both in terms of RMSD
fit on the Ca atoms used for the take-off points and the difference in C=O
angles of the backbones between parent and fragment. A number of loop
conformations were selected for each gap that joined all pairs of
superimposed parents.
6. Mean-field calculations on fragments: From an ensemble of secondary
structure elements and connecting loops a mean-field calculation is
performed to select a single element or loop for all sections of the
target. The algorithm used is a modification of the self consistent
mean field approach to gap closure (Koehl P and Delarue M ,1995, Nat.
Struct. Biol., 2,163-170) .
7. Selection of side-chain rotamers: Side-chains are built by tracing the
path of the parent side-chain. The maximum number of bond lengths,
angles and torsion angles are taken from the parent side-chain that are
compatible with the new side-chain. Additional internal co-ordinates to
complete the side-chain are taken from the secondary structure dependent
rotamer library (McGregor MJ et al., 1987, J. Mol. Biol., 198, 295-310).
After the replacement of all side-chains and the assignment of a single
rotamer for each, this parent rotamer plus rotamers from a side-chain
rotamer library (Tuffery P et al. ,1991, J. Biomol. Struct. Dyn. 8,
1267-1289) are built at each residue position. A second mean-field
calculation is performed to select the most probable rotamer (Koehl P and
Delarue M ,1994, J. Mol. Biol., 23, 249-275). The force field used in
the calculations consisted of a soft atom pair potentials term,
parameters taken from (Lee C and Subbiah S, 1991, J. Mol. Biol. 217,
373-388) and a hydrogen bonding potential term.
8. Energy refinement : Because the loops are modeled via database searches
they do not fit perfectly to the take-off points. Thus, torsion angles
were adjusted within the loop to give good geometry within the take-off
regions; A modification of the tweak algorithm (Shenkin PS et al., 1987,
Biopolymers, 26, 2053-2085) was used for this purpose. To remove the
small number of steric clashes remaining in the models 100 steps of
steepest descents energy minimization (unrestrained) were run using
the program CHARMM (Brooks BR et al., 1983, J. Comp. Chem., 4, 187-217).
Hogue-Feldman , 090
number of submitted models: 22
Howard J Feldman, Thanh-Van T Le, John J Salama and Christopher W V Hogue
email: feldman@mshri.on.ca
For targets identified as homology modelling targets, similar sequences
were identified through a BLAST search.
For T0123, we chose 1CJ5 as our template since it had the highest identity
(65%) to T0123 and only one gap. The angles between consecutive alpha
carbons in the template structure as well as virtual dihedrals between
sets of four carbons were recorded. For the area near the deletion
(LPAQ) the backbone was allowed greater conformational freedom.
For T0099, we chose 1SHF as our initial template since it had the highest
identity (64%) to T0099 of those found. However, an alignment using
ClustalX v1.81 showed that there was a single residue deletion
near residue 32. The angles between consecutive alpha carbons
in the template structure as well as virtual dihedrals between
sets of four carbons were recorded as above. For the area near the
deletion (EKEGD) the backbone from a different template was used (1LCK).
When placed on the same multiple alignment as the other SH3 domains,
1LCK does not have any indels near this turn.
These alpha-carbon "trajectory" angles were then plotted on "trajectory
distributions" and some Gaussian noise was added, effectively making the
backbone slightly flexible.
Next, using our FOLDTRAJ algorithm(1), approximately 200 structures were
generated by Calpha walk, using the coordinates recorded in the trajectory
distributions. The rest of the structure was built using the FOLDTRAJ
algorithm described in the above reference. Briefly, N, C and O are
placed to minimize errors in bond angles and bond lengths. Beta carbons
are placed according to a look up table dependent on residue and adjacent
alpha carbon positions. Sidechains are placed probabilistically using
Dunbrack's backbone dependent rotamer library(2). All residues are chirally
and sterically valid, have a minimum of non-hydrogen van der Waal
collisions.
Finally, from the pool of generated structures (all very similar in
backbone but with different rotamer packings), various statistics
were collected including radius of gyration, exposed surface area,
exposed hydrophobic surface area, and empircal energy score according
to two different scoring functions: an atom-based one (3) and a
residue-based one (4).
The best structures were chosen based on their energy scores, radii
of gyration and exposed surface area. The latter two were expected to
be comparable to the same measures on the template structure(s). This
latter step along with BLAST and selection of the template were the only
non-automated, subjectuve steps.
REFERENCES
1. Feldman HJ and Hogue CWV. (2000) A Fast to Sample Real
Protein Conformational Space. Proteins. 39(2): 112-131.
2. Dunbrack RLJ and Karplus M. (1993) Backbone-dependent rotamer
library for proteins. Application to sidechain prediction.
J. Nol. Biol. 230: 543-574.
3. Zhang C, Vasmatzis G, Cornette JL and DeLisi C. (1997)
Determination of Atomic Desolvation Energies From the
Structures of Crystallized Proteins. J. Mol. Biol. 267:
707-726.
4. Bryant SH and Lawrence CE. (1993) An Empirical Energy
Function for Threading Protein Sequence Through the Folding
Motif. Proteins. 16: 92-112.
InforMax , 022
number of submitted models: 4
Feodor Tereshchenko, Nikolai Daraselia
email: feodor@informaxinc.com
The homology modeling starts from an alignment This alignment must be done in
such a way that the sequences of target aligned to the secondary structure
elements (alpha-helices and beta-strands) of the template are not interrupted
by gaps.
The 3D coordinates of the target backbone alpha-helices and beta-strands are
generated (copied) from the atomic coordinates of the template backbone. The
user may choose to copy loops of equal length or to model them ab initio.
Loops which connect secondary structure elements are modeled (ab initio)
using the downhill simplex minimization algorithm (1).
An energy function incorporating distance-dependent residue-residue potentials
(2), validity of valent and dihedral angles formed between the last C-terminal
loop residue and the first N-terminal secondary structure residue, and the
distance between the last loop atom and the first secondary structure atom
allows the loop to close and minimizes its energy at the same time.
E=k1*SUM(R) + k2*D + alpha (D, v), where
R - residue-residue potential for each pair of contacting amino acids (2);
alpha - penalty function;
D - distance between the last C-terminal atom of the loop to be closed and the
first N-terminal atom of the next secondary structure element;
v -valent angle between the last bond of the loop and the first bond of the
secondary structure.
The next group of algorithms is used to place side-chains on the resulting
target backbone. The dihedral angles of the amino acid side-chains (except
those that have no Chi-angles) were extracted from PDB and are stored in a
backbone-dependent rotamer library (3). Each valid PhiPsi combination for each
amino acid has a corresponding distribution of probability of Chi1 or Chi2
side-chain dihedral angles. The distribution of Chi3 angles is set as a
function of Chi1Chi2 combination, and that of Chi4 - of Chi2Chi3 combination.
The placement of side-chain rotamers starts with those amino acids which are
identical in both the target and the template. Such rotamers are copied and
left unchanged if possible.
The dihedral angles of other side-chains are set the following way: the
selection of rotamers proceeds from low-index Chi-angles to higher index
Chi-angles (where available). The angles with the same indexes are set at the
same time for all side-chains which were not predicted at the previous step.
The decision to select any particular rotamer from the library is based on the
Chi-angles probability distribution. For each Chi-angle, a mode of the
distribution is selected. After setting up all side-chain Chi-angles, the
algorithm checks for clashes with the backbone and then with the neighboring
side-chains.
The clashes with the backbone are resolved first by the rotamers with lesser
probability selecting from the library, or in the case of multimodal rotamer
distribution, rotamers with different mode. The clashes between side-chains
are resolved after that in the same manner. If this procedure does not resolve
all clashes, the clashing rotamers are joined in the cluster and a complete
search of backbone-dependent rotamer library is performed.
1. Nelder, J.A., Mead, R. (1965) A simplex method for function minimization.
Computer J., 7:308-313.
2. Bahar, I., Jernigan, R.L. (1997) Inter-residue potentials in globular
proteins and the dominance of highly specific hydrophilic interactions at
close separation. J. Mol. Biol., 266:195-214.
3. Bower M. J, Cohen F.E., Dunbrack R.L. Jr.(1997) Prediction of protein
side-chain rotamers from a backbone-dependent rotamer library: a new homology
modeling tool. J Mol. Biol., 267:1268-1282.
Ginalski , 526
number of submitted models: 4
Krzysztof Ginalski
email: kginal@icm.edu.pl
For the fourth round of Critical Assessment of Techniques for Protein
Structure Prediction (CASP4), four target proteins were modeled using
comparative modeling technique: 1) beta-lactoglobulin from pig (target T0123;
64equence identity with template closest by sequence), 2) manganese superoxide
dismutase homolog from P. aerophilum (target T0128; 54equence identity), 3)
tryptophan synthase alpha subunit from P. furiosus (target T0122; 33equence
identity), 4) Sp18 from H. fulgens (target T0125; 18equence identity). This
set of target proteins (18-64equence identity) was chosen to represent
different levels of difficulty in comparative modeling. The main emphasis was
on generating sequence-to-structure alignments of target sequences with their
respective parent structures. As shown in previous rounds of CASP, this part
of the modeling procedure is the major source of errors.
Initially, related proteins with known structures were identified with
PSI-BLAST searches [1] performed against the non-redundant protein sequence
database until profile convergence. Additionally, homologous sequences that
matched the targets were also collected. The CLUSTAL W program [2] was used to
generate multiple sequence alignments for sets of sequences containing target,
templates and other related proteins with unknown structure. Opening and
extension gap penalties were systematically changed, and all of the obtained
alignments were inspected for both variability and violation of structural
integrity. Possible sequence-to-structure alignments variants were tested by
building 3D molecular models for the target sequences with the Homology module
of InsightII (MSI Inc., San Diego, CA, USA). Backbone conformation was taken
from the template structure closest by sequence, and only side-chains were
substituted. Modeling of insertion and deletion regions was skipped for the
structures that were built to test the fitness of different alignment
variants. Models were then subjected to detailed evaluation, mainly by visual
inspection of structural consistency and using ProsaII energy profiles [3].
Such a 3D evaluation procedure enabled selection of final
sequence-to-structure alignments.
Final models of target proteins were built using the MODELLER program [4].
Where possible, more than one template protein was used, after superimposition
of their molecular structures. In some cases after coordinates were assigned
to the target sequence, side-chains were rebuilt using the SCWRL program with
a backbone conformation-dependent rotamer library [5]. To preserve conserved
contacts, and maximize the electrostatic and hydrophobic interactions, the
positions of several side-chains were adjusted manually. Final models were
subjected to energy minimization (100 steps) to remove remaining steric
clashes and improve stereochemistry. All energy optimizations were performed
in Amber forcefield [6] with the Discover module of InsightII, using steepest
descent and conjugate gradient methods. The overall quality of each modeled
structure was checked in detail with the WHAT_CHECK program [7].
[1] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman,
D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25, 3389-3402.
[2] Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22,
4673-4680.
[3] Sippl, M.J. (1993) Recognition of errors in three-dimensional structures of proteins.
Proteins 17, 355-362.
[4] Sali, A., Blundell, T.L. (1993) Comparative protein modelling by satisfaction of spatial
restraints. J. Mol. Biol. 234, 779-815.
[5] Bower, M.J., Cohen, F.E., Dunbrack, R.L. Jr. (1997) Prediction of protein side-chain
rotamers from a backbone-dependent rotamer library: a new homology modeling tool.
J. Mol. Biol. 267, 1268-1282.
[6] Weiner, S.J., Kollman, P.A., Case, D.A., Singh, U.C., Ghio, C., Alagona, G., Profeta, S.
Jr., Weiner, P. (1984) A new forcefield for molecular mechanical simulation of nucleic
acids and proteins. J. Am. Chem. Soc. 106, 765-784.
[7] Hooft, R.W., Vriend, G., Sander, C., Abola, E.E. (1996) Errors in protein structures.
Nature 381, 272.
BinToHes , 255
number of submitted models: 43
Silvio Tosatto, Eckart Bindewald, Jochen Maydt, Achim Trabold, Juergen Hesser, Reinhard Maenner
email: silvio@rumms.uni-mannheim.de
We started the search for a template sequence by using PSI-BLAST [1]. The
resulting top candidates were inspected. If no significant hit was found we
applied the fold recognition method described in the CASP-4 fold recognition
abstract "Secondary structure and function based protein fold recognition"
(Bindewald, Tosatto et al). The most probable candidates were submitted to a
CLUSTALW [2] alignment. This alignment was manually modified to reduce the
impact of insertions and deletions. A raw model of the target without indels
was created using our program MOLEGO, which simply copies the backbone angle
information from the template protein according to the sequence alignment. The
insertions and deletions were subsequently modeled using NAZGUL, our fast
ab-initio loop modeling tool. We typically allowed between one and three
residues flanking an indel to be modified during loop modeling. The NAZGUL
algorithm evaluates a database of precalculated synthetic loops. This database
was created by recursively concatenating small polypeptide fragments, starting
from a Ramachandran distribution of phi and psi angles with rigid rod
geometry. Larger fragments are assembled from smaller ones by means of
geometric transformations. These are all stored according to loop length and
evaluated during the modeling step. Possible loops are evaluated and ranked
according to their geometric fit on single residue anchor regions. These are
then filtered for chain continuity, that is deviation from idealized bond
length and bond angle values, and inter-atomic clashes. The solution was
selected through visual inspection among the top scoring proposals. Insertions
at the beginning or end of the chain were modeled using MOLEGO's ab initio
method [3], which samples a discrete set of torsion angles using a
combinatorial search approach. As a final step the side chains were placed
with PESO, our implementation of the dead-end elimination algorithm [4] using
the AMBER [5] non-bonded potential and a set of backbone independent rotamers.
References:
[1] Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman
DJ. Gapped blast and psi-blast: a new generation of protein database search
programs. Nucleic Acids Research, 25(17):3389-3402, 1997.
[2] Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: Improving the sensitivity
of progressive multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice. Nucleic Acids
Research, 22(22): 4673-4680, 1994.
[3] Bindewald E, Hesser J, Männer R. Protein Structure Optimization using a
Combinatorial Search Algorithm. Proc. of the Int. Conf. on Mathem. and
Engineering Techniques in Medicine and Biological Sciences (METMBS'00),
233-238, 2000.
[4] Desmet J, DeMaeyer M, Hazes B, Lasters I. The dead-end elimination theorem
and its use in protein side chain positioning. Nature 356: 539-542, 1992.
[5] Weiner SJ, Kollman PA, Nguyen DT, Case DA. An all-atom force field for
simulations of proteins and nucleic acids. Phys Rev E 7:230-252, 1986.
Dunbrack , 169
number of submitted models: 20
J. Michael Sauder and Roland L. Dunbrack, Jr.
email: RL_Dunbrack@fccc.edu
We performed the following steps in modeling
comparative modeling targets in CASP4:
1) PSI-BLAST was used to identify homologous proteins in the PDB. This
was accomplished by using PSI-BLAST with the target sequence as query
on the non-redundant (nr) sequence database available from NCBI. This
database had been filtered of low-complexity sequences with the
program seg with a window size of 20 (higher than the default 12). The
PSI-BLAST matrix is saved every other iteration. Each PSI-BLAST matrix
was then used to search a database of PDB sequences that we derive
from PDB files (these sequences differ from what RCSB puts out).
2) We chose a parent from the PDB based on sequence identity,
length of alignment, relative paucity of gaps, and resolution.
3) The sequence alignments were examined in light of the
parent structure and some manual adjustments were made to
move gaps to the most likely coil regions. IN some cases,
we also examined an alignment from the Threader program
of David Jones.
4) We used our program "blast2model" to take the alignment
and parent structure to produce a PDB file with the backbone
coordinates renumbered and residue type changed to the
target sequence, given the alignment. We preserved the
coordinates of residues that are identical in the parent
and target according to the alignment. blast2model also
outputs a sequence file that can act as input to the scwrl
program to predict sidechains (below).
5) In some cases, especially in very low sequence identity alignments,
we did not build insertions and deletion regions, but rather left the
parent backbone unaltered (just renamed and renumbered). We then used
the program SCWRL (Bower,Cohen,Dunbrack 1997; Dunbrack 1999) to
rebuild the missing sidechains (those different between parent and
target). SCWRL uses a backbone-dependent rotamer library, followed by
clustering of potentially clashing sidechains (which preserves the
lowest energy conformation of sidechains which do not clash; this is
effectively a dead-end elimination step). These clusters are solved
by a branch-and-bound algorithm. If clusters are too large to be
solved rapidly, one residue is identified that when removed will break
the cluster in two parts. Each part is solved separately for each
rotamer of this keystone residue, and then the energies of the two
clusters summed for each rotamer, and the lowest energy configuration
is the prediction. So for a 12 residue cluster, this means that
N=n**12 without breaking the cluster and N=n(n**6 + n**6) for the two
residue cluster, where N is the number of combinations and n the
number of rotamers per sidechain on average.
The potential function is a statistical one for local
backbone/sidechain interactions in the form of -log(prot(phi,psi,res))
and a simple linear steric interaction for sidechain-sidechain and
non-local-backbone/sidechain interactions. prot is determined
from the Bayesian statistical analysis in the backbone-dependent
rotamer library (Dunbrack and Cohen, 1997). We believe
this kind of statistical potential function produces better
predictions when the backbone conformation is not precise (i.e.,
derived from another protein structure) than a molecular-mechanics
type function that is very sensitive to atom positions. Hence
for modeling purposes prediction rates are probably higher.
6) In a number of cases, especially for the last few submissions in
September, we used a loop prediction algorithm in the Modeller5
program developed by Andrej Sali and kindly provided by him in advance
of publication (it has now been published: Fiser,Do,Sali, 2000). We
first removed 2-3 residues on either side of each gap and let SCWRL
replace the sidechains for all mutated residues in the alignment. We
then let Modeller model each missing loop in turn. We then used SCWRL
to replace sidechains in the whole structure (not including conserved
residues).
References:
M. Bower, F. E. Cohen, and R. L. Dunbrack, Jr. Sidechain prediction
from a backbone-dependent rotamer library: A new tool
for homology modeling. J. Mol. Biol. 267, 1268-1282 (1997).
R. L. Dunbrack, Jr. and F. E. Cohen. Bayesian statistical analysis of
protein sidechain rotamer preferences. Protein Science 6, 1661-1681
(1997).
R. L. Dunbrack, Jr. Comparative modeling of CASP3 targets using
PSI-BLAST and SCWRL. Proteins: Structure, Function, Genetics,
Suppl. 3, 81-87 (1999).
A. Fiser, R.L. Do, A. Sali
Modeling of loops in protein structures.
Protein Sci. 9, 1753-73 (2000).
Walts-Wondrous-Wizards , 044
number of submitted models: 172
N. Alexandrov, V. Brover, M. Troukhan, W. Volkmuth
email: nicka@ceres-inc.com
Our prediction process consists of two steps: selecting a template structure
and making an alignment.
1. Template selection.
All target sequences were compared with a set of structural domains using the
123D+ program, which combines sequence similarity, secondary structure
prediction and contact capacity potentials to compute a similarity score. If
there was a hit with Z-score > 6, we made the selection based on the strongest
hit. When the hit covered only a part of the target sequence, we cut out the
remaining part and repeated the run. If 123D+ did not detect an obvious hit,
we predicted the fold anyway, because sampling of a random set of recently
predicted structures indicates that approximately 900f them are structurally
similar to already known folds, even if there is no strong sequence
similarity. Without a strong 123D+ hit, we used other available associative
information in an attempt to link the target with a protein with known
structure. We used literature search, known metabolic pathways, gene
expression data, position on the chromosome, operons, distribution of folds in
the organism, secondary structure prediction, predictions of transmembrane
helices and coiled coils. We demonstrated that there is a correlation between
protein folds and gene expression and between protein folds and location in
the chromosome. All these additional information gave us quite weak signals.
However, when consistent, these signals resulted in rather confident
predictions. This part of the prediction is analagous to playing charades,
where one discovers an unknown word using many inderect, independent hints.
Interestingly, we can compare the effectiveness of such an approach verses a
pure automated method, as 123D+ server also participated in the CAFASP section
of CASP4.
2. Alignment
Alignments were computed with 123D+ program and were in some cases manually
corrected. Manual intervention was limited to (i) placing deletions within the
target sequence so that their edges are close in space in 3D structure and
(ii) moving insertions in the target sequence to the surface of protein
structure.
Godzik , 197
number of submitted models: 158
L.Jaroszewski, A.Godzik
email: adam@ljcrf.edu
We applied idenitcal procedures for homology modeling targets and fold
recognition targets. It consists of three steps: A: Selection of the
template(s), B: Generation of suboptimal alignments, C: Model building and
evaluation. In the cases when FFAS z-score value indicated that the similarity
between the template and query is strong (z-score values higher than 15), the
step B was usually skipped and the model was built based on the alignment from
FFAS. This was the case for many of the homology modeling targets. The
prototype of this procedure called "Multiple Model Approach" was described and
evaluated in (4-5).
A. Selection of the template(s) - Fold & Function Assignment System (1,2).
FFAS profile-profile search was performed in PDB database. FFAS is based on
the sequence profile-profile matching with dynamic programming. The multiple
alignment is prepared based on the PSI-BLAST(8) output. Non-redundant
database of protein sequences was used for profile calculation. FFAS uses
sequences from PSI-Blast output with E-value below 0.01 and an elaborate
weighting scheme for the sequences included in the profile(1). Weights are
assigned based on the dissimilarity of the sequence in respect to the other
sequences in the family. In addition, FFAS performs a normalization of the
matrix containing the comparison scores between all positions of both aligned
profiles before the best path is searched for with dynamic programming
Smith-Watermann algorithm(8).
B. Calculation of suboptimal alignments.
A set of suboptimal (alternative) alignments was generated for the query
sequence and the template structure(s) selected from the PDB database in the
step A. After the calculation of the initial alignment based on the
profile-profile FFAS method, a1 similarity matrix was recalculated using
several combinations of threading terms (burial and local conformation terms
are used). The threading energy was calculated for the sequence profile
rather than for a single sequence, as it had been done in the classical
threading. Several gap penalty values were also explored. Gap penalties were
set higher within the secondary structure elements defined with the method
described in the separate publication(3). The resulting alignments were
clustered to avoid redundancy.
C. Model building and evaluation.
The models based on the alignments calculated in the step B were built and
evaluated.. We used MODELER(5) program developed in A. Sali lab for model
building. Model evaluation is based on the threading energy using statistical
potential and evolutionary information encoded in sequence profiles (the
threading energy was calculated for the sequence profile rather than for a
single sequence, as it had been done in the classic threading - for example in
MatchMaker program). The threading energy per residue was the final criterion
of the model quality.
References
1. Rychlewski, L., Jaroszewski, L., Li, W. & Godzik, A. (2000).
"Comparison of sequence profiles. Strategies for structural predictions using
sequence information". Protein Science 9, 232-241
2. Jaroszewski, L., Rychlewski, L. & Godzik, A. (2000).
"Improving the quality of twilight-zone alignments". Protein Science, 9,
1487-1496
3. Jaroszewski, L. & Godzik, A. (2000). Search for a New Description of
Protein Topology and Local Structure. ISMB 2000 - 8-th International
Conference on Intelligent Systems for Molecular Biology, San Diego 2000
4. Jaroszewski, L., Pawlowski, K. & Godzik, A. (1998).
"Multiple model approach: an extension of comparative modelling". Journal of
Molecular Modelling 4, 294-309
5. Pawlowski, K., Jaroszewski, L., Bierzynski, A. & Godzik, A. (1997).
"Multiple model approach - dealing with alignment ambiguities in comparative
protein modeling". In Biocomputing, 97 (Altman, R. B., Dunker, A. K., Hunter,
L. & Klein, T. E., eds.), pp. 328-339. World Scientific, Singapore.
6. Sali, A. and Blundell, T. L. (1993).
"Comparative protein modelling by satisfaction of spatial restraints". J. Mol.
Biol. 234, 779-815
7. Smith, T.F. and Waterman, M.S. (1981) "Identification of common molecular
subsequences". J Mol Biol 147:195-7
8. Altschul, S.F. et al. (1997) "Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs". Nucleic Acids Res 25:3389-402
Harrison-Weber , 058
number of submitted models: 91
Ivan Y. Torshin, Irene T. Weber and Robert W. Harrison
email: robert.harrison@acm.org
Molecular modeling is a combinatorial, multiple minimum optimization problem.
In homology modeling, the known homolog serves as a good starting point for
the search, while in ab initio folding there are only limited geometric data.
Two complimentary classes of algorithms were explored in our CASP-4
predictions: randomized algorithms, and multiple modeling algorithms.
Randomized algorithms, either based on the Kohonen self-assembling neural
network or an analytic solution for simultaneous circular equations, were used
to explore conformational space and delineate regions of allowed molecular
geometry. These algorithms are computationally efficient; it was possible to
fold most of the CASP-4 ab initio targets several hundred times in a few CPU
hours. Multiple models from independent runs of the randomized procedures
were used to extract conformations that occurred repeatedly, as this improved
the reliability in tests. Hundreds of models were used for ab initio
predictions and ten models for homology modeling. AMMP (Harrison, 1999) was
used to predict 12 ab initio targets and 30 homology modeling targets.
Randomized Algorithms
Our major focus has been to explore new algorithms for building molecular
models and searching conformation space. One general class of
algorithms, randomized algorithms, is especially interesting because
these algorithms can efficiently find or approximate the solutions to
combinatorial and geometric problems (Hertz 1991, de Berg 1997) and
can be implemented efficiently on a parallel computer (JaJa 1992).
The general idea behind randomized algorithms is to use a set of
independent identically distributed random variables to limit the
solution to an acceptably small range. Rather than attempt to converge
to an exact solution of a mathematical problem, which may not exist or
may not be meaningful in the context of protein structure, randomized
approaches define a sequence of ever-closer bounds on the ranges of
solutions. Two randomized approaches were tested, these were a
modified Kohonen neural network with a distance metric and a
randomized analytic solution to distance restraints (Harrison 1999).
Multiple models were constructed using the distance restraints that
were derived from homologous structures or sequences. Averages over
the models were then used to develop a single model for submission.
Homology Modeling
Protein folds were recognized using the FFAS server (Rychlewski et al.
2000), the 3D-PSSM server (Kelley et al. 2000), and the screening
method we used for ab initio folding. Clustal (Thompson et al. 1994)
was used for multiple sequence alignments when possible. The thirty targets
86-90,92,93,99-101,103,104,106,107,109,111-113, 115-123, 125,127, and 128
were modeled. Ten models were generated from each template using either the
Kohonen algorithm or the analytic approach (Harrison 1999) coupled with
energy minimization and a short run of molecular dynamics. The averaged model
was energy minimized to generate the final model. The final models were
subjected to 3ps runs of molecular dynamics, which may degrade the accuracy
for the high homology examples. The variation among the models was calculated
for each atom and used as an estimate of the uncertainty in the positions.
Ab Initio Folding
Ab Initio folding was used for targets
91,94,95,96,97,98,102,105,108,110,114,124 and 126. A simple
hydrophobicity-electrostatics potential was supplemented by a
sequence-specific empirical potential to improve the stereochemistry of the
prediction. Inter-residue distances were estimated by searching the protein
database for short stretches of homology from different and unrelated
proteins. Simply finding the best local fit for each overlapping window of
amino acids does not result in a good self-consistent set of distances.
However, when the requirement for chain continuity is enforced, the problem
of identifying a self-consistent set of inter-residue distances becomes akin
to a convolutional error correcting code which is readily solvable by dynamic
programming (Viterbi 1967). This continuity condition is an inherent property
of all polymers and provides a significant gain in prediction accuracy.
Potential templates were identified for homology modeling by using the
proteins that had the most fits for each sequence.
The models were generated in three steps.
1) 200 models were generated with the original potential functions, using
C-alpha-only models. Then inter-residue distances (C-alpha-C-alpha) were
averaged over all the models. Those distances where the standard deviation
was less than 2 angstroms were extracted.
2) A single model was generated that both satisfied the new distance
information and minimized the hydrophobicity-electrostatics potential. This
model was achiral and can represent either the left or right-handed solution.
Secondary structure was identified visually and used to define additional
distance restraints (published experimental data on helical locations were
used for target 102).
3) All-atom models were built for both the right and left-handed solutions.
The best models had right-handed helices.
References
de Berg M., van Kreveld M., Overmars M., and Schwarzkopf, O. (1997)
Computational Geometry Springer-Verlag
Harrison, R.W (1999), A Self-Assembling Neural Network for Modeling Polymers
J. Math. Chem. 26,125-137
Hertz J., Krogh, A., Palmer R.G. (1991) Introduction to the theory of neural
computation, Sante Fe Institute studies in complexity lecture notes vol. 1.
Addison-Wesley pp244-246,
JaJa J. (1992) An Introduction to Parallel Algorithms, Addison-Wesley pp 433-484,
Kelley LA, MacCallum RM, Sternberg MJ (2000), Enhanced genome annotation using
structural profiles in the program 3D-pssm, J Mol Biol 299(2):499-520
Rychlewski L, Jaroszewski L, Li W, Godzik A (2000),Comparison of sequence
profiles. Strategies for structural predictions using sequence information
Protein Sci 9(2):232-41
Thompson JD, Higgins DG, Gibson TJ (1994), CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence
weighting, position-specific gap penalties and weight matrix choice, Nucleic
Acids Res 22(22):4673-80
Viterbi A.J. (1967) Error bounds for convolutional codes and an asymptotically
optimum decoding algorithm IEEE trans inf theory IT-13,260-269
Yang-Ansuei , 017
number of submitted models: 127
An-Suei Yang
email: yanga@ps7yang3.cpmc.columbia.edu
All alignments and 3D structural models for the targets in CASP4
were produced with the PrISM program. PrISM (Protein Informatics System
for Modeling) is a sequence/structure analysis/modeling system that can be used
either interactively or automatically to produce 3D structures of
proteins from their amino acid sequences (Yang & Honig, Proteins
suppl. 3, 66-72 (1999), J. Mol. Biol. 301(3), 665-678, 679-690, 691-712 (2000)).
PrISM has been released to the public domain and can be downloaded
from the web site: http://www.columbia.edu/~ay1/.
PrISM consists of a variety of integrated computational modules and databases,
including the facilities to carry out structure topology analysis, sequence homology
search/alignment and statistics, structure-structure alignment,
multiple sequence/structure alignment, sequence/structure profile
analysis, fold recognition, comparative model building, sidechain and
loop modeling, and model structure assessment. At present, PrISM makes use
of the NCBI-nr and PDB as data resources. NCBI-nr is used without
further modification. PDB entries are divided into structural domains with
PrISM structure topology analysis tools to form structure domain libraries.
PrISM's sequence search and analysis tools, based on either the
Smith-Waterman or PSI-BLAST sequence comparison algorithms, in
conjunction with statistics based on the theory of extreme value
distribution, can perform pairwise sequence similarity searches,
pairwise or multiple sequence alignments, sequence family clustering,
and sequence profile searches over sequence databases. Structure search
and analysis tools use an algorithm which is built upon double dynamic
programming and rigid-body superimposition methods. This algorithm is
capable of performing pairwise structure alignments, multiple structure
alignments, structure similarity searches and clustering of similar
protein structures. The functions of the sequence and structure analysis modules
are to identify the most suitable structural template(s) and to predict
the best sequence-to-structure alignments, which are then used in the
protein structure modeling modules for model building.
Structure templates are recognized first by a dynamic programming
alignment score calculated with the BLOSUM 62 substitution matrix and
then normalized using the extreme value distribution theory. If the sequence
similarity score between a query sequence and a template is less
than the empirically determined cut-off of p-value=10E-6, the alignment and the
template are used to produce a homology model for the query sequence
with the PrISM structure modeling modules. PSI-BLAST is also used to determine
the most suitable structure templates. If sequence alignment methods fail to
relate a query sequence to any structure in the PDB, a fold recognition
procedure is applied. This procedure is started by constructing models
(backbone plus carbon beta) of the query sequence based on the
predicted alignments of the sequence to all possible templates in the
PDB using a sequence-to-structure mapping algorithm. The most likely
models for the sequence are then decided by a subsequent model ranking
procedure based on a structure fitness score. The structure fitness
score is a sum of individual residue scores which are calculated using
statistically derived parameters. These parameters are designed to
evaluate these simplified models based on secondary structure
propensities and the number and chemical properties of the contacting
neighbors of each residue.
PrISM's structure modeling modules build protein structures using one
or more templates that are simultaneously aligned to the query
sequence. When more than one template are used, an automatic procedure
first divides templates into secondary structure segments, and then
selects the most suitable segment templates for model building, segment
by segment. Mainchains are built by using the template conformation
when possible. Insertion-deletion regions, usually loops, are then
rebuilt using ab initio methods. Sidechain torsion angles are either
taken from the templates or predicted based on the mainchain torsion
angles with a neural network algorithm. The model building and the
alignment procedures can iterate until a reasonable model structure is
arrived. Our model structures for targets in CASP4 have not been refined.
PrISM contains a model assessment module, which is used to assess the
quality of a predicted model as the experimental structure becomes
available. The assessment procedure is started by carrying out a
structure alignment to align the model and the experimental structure.
This is followed by the RMSD calculation, the evaluation of the predicted
alignment on which the model is built, and the evaluation of the
predicted mainchain and sidechain torsion angles. These results
provide statistical indicators for the quality of the predicted model.
Using PrISM, we have built one model for each of the 42 CASP4
targets. We did not make prediction for the target 116 (811 residues).
The modeling strategy varies from one target to another
because the protocol that is used depends on the amount and quality of
information extracted from the sequence and structure databases.
Overall, PrISM provides a flexible computational environment
which has been used in a wide range of modeling challenges.
blundell-tl , 095
number of submitted models: 23
features and environmental properties
David F. Burke, Nuria Campillo,Charlotte Deane,Paul de Bakker
,Lan Chen ,Axel Innis, Simon Lovell,Joerg Mueller,Kenji Mizuguchi,H.G.Nagendra,Ricardo Nunez,Jiye Shi, Hiroki Shirai ,Mark G Williams and Tom L. Blundell
email: dave@cryst.bioc.cam.ac.uk
Prediction of suitable structural homologues was performed,
using the program FUGUE, by searching position specific
environmental substitution tables generated from the HOMSTRAD
database of homologous structures (Mizuguchi et al. 1998).
These predictions were validated using a combination of visual
inspection of the resulting alignment using the program JOY
(Mizuguchi et al. 1998), comparisons of secondary structure
predictions and a survey of the literature. Identification of
homologues with known structure was also aided by PSI-BLAST
searches and results from the CAFASP servers. The predicted
alignment was then either rejected or manually edited if it was
thought necessary.
The 'core' structure of the target sequence was built using both
MODELLER and the new comparative modelling algorithm SCORE
(Deane et al. submitted) which builds segments of the structure
which it predicts to be structurally conserved. The
structurally variable regions were predicted using the programs
CODA (Deane and Blundell, submitted) and SLoop (Burke et al. 2000,
Rufino et al. 1997; Donate, et al. 1996). Sidechains were then
added using the program CELIAN.
Validation was performed by superimposing all of the predicted
models onto the initial template structures, using the
structural alignment program COMPARER. The models were
inspected for structural features that were seen to be
conserved among the template structures and suspect regions
were re-modelled.
Shoshana-Wodak , 486
number of submitted models: 16
Koji Ogata and Shoshona J. Wodak
email: koji@ucmb.ulb.ac.be
1. Selection of a template protein
To determine if the structure of a target protein can be predicted using
homology modeling methods, we carried out a PSI-BLAST search with the target
sequence against the sequences of proteins in the PDB. This was performed
using tools and default setting available at NCBI server. When sequence
similarity was detected with one or more protein entries in the PDB, homology
modeling was undertaken. For the 7 targets for which we performed homology
modeling, sequence identify levels ranges from 20-50%.
To align the target sequence to those of the candidate templates the following
procedure was used. The structures of all thePDB entries, identified as
displaying sequence similarity to the target, were aligned using structure
superposition procedures [1], applied to the backbone atoms. This structural
alignment was used to derive a multiple sequence alignment for the
corresponding proteins, to which the target sequence was then aligned. This
alignment was computed using Smith-Watermans algorithm and the GONNET matrix,
applying a length dependent gap penalty, with values of 12.0 and 1.0 for gap
creation and gap extension, respectively. Gaps positioned in secondary
structure elements were manually displaced, towards appropriate positions in
nearby loop regions. The template protein to be used for model building was
chosen from amongst the identified PDB entries as the protein with the highest
sequence similarity to the target.
2. Modeling regions with insertions and deletions
To model regions with insertion and deletions in the alignment, we searched
for suitable fragments from fragment databases. These databases contained all
overlapping fragments ranging in lengths from 5 to 16 residues, respectively,
and belonging to all proteins in the PDB with less than 90equence identity
[2]. Each of these databases included more than 450 000 fragments. For a
given indel region, suitable fragments were selected from these databases by
specifying the fragment length and requiring that the chosen fragment match
the backbone of the 2 residues preceding and following the indel region to
within 1.0Å rmsd. When a large number of candidate fragments were identified
for a given indel region in a region, those with most similar spatial
orientation the loop region in template to that of protein were selected.
When no candidate fragments were identified for a region, the alignment was
modified, and the fragment selection procedure repeated. Finally, the
selected suitable fragments were modeled into the template backbone.
3. Side-chain modeling and optimization
For a given backbone structure side-chain conformations were selected using
the following procedure. First, lowest energy sidechain conformations were
selected from a library of conformations derived from known protein structures
[3]. This library typically contained many thousands conformations for each
side chain type. The selection procedure was performed using the Metropolis
Monte Carlo sampling method coupled to the AMBER force field. In a second
step, the lowest energy conformation was subjected to Monte-Carlo sampling in
Cartesian space in order to eliminate residual strain. In a last step the
energy of the resulting structure was further relaxed using energy
minimization.
4. Modeling for T0119 (Special case)
Comparing the sequence of T0119 with that of Phthalate dioxygenase reductase
(E.C.1.18.1.) (PDB_ID 2PIA), we found that the N-terminal portion of T0119
(residues 1 to 90) was similar to the C-terminal portion of 2PIA (residues 228
to 321), and that the N-terminal portion of 2PIA (residues 1-227) was similar
to the C-terminal portion of T0119. The two proteins thus appeared to be
circular permutations of each other, with a short (10 residues) insertion
between the 2 domains in T0119 relative to 2PIA. Attempts to model the
insertion using our fragment databases, failed however, as no suitable
fragments could be found. We then performed a PSI-BLAST search using the
C-terminal domain of 2PIA as the probe sequence. This led to the
identification of Ferredoxin (PDB_ID 1QOA_A) as being similar to this portion
of 2PIA. The corresponding structure was superimposed onto C-terminal portion
of 2PIA, substituted for it in the template, and used instead to search once
more for a suitable fragment bridging the 2 domains. This time a suitable
fragment could be identified in the one of our fragment databases. Hence in
modeling the T0119 target a template consisting of a chimera of the 1QOA_A and
2PIA backbones was used.
References
[1] Russell, RB and Barton, GJ., Proteins, 14, 309-323, 1992.
[2] Ogata, K. and Umeyama, H., J. Mol. Graph. Model., 18, 258-72, 2000.
[3] Ogata, K. and Umeyama, H., Protein Eng. 10, 353-359, 1997.
SBI-AT , 342
number of submitted models: 68
Tomas Nordahl Petersen, Claus Lundegaard, Morten Nielsen, Anne Marie Munk Jørgensen, Henrik Bohr, Jakob Bohr, Søren Brunak, Garry P. Gippert, Ole Lund. Structural Bioinformatics Advanced Technologies A/S. Hørsholm, Denmark
email: olund@strubix.dk
The team of scientists working at SBI Advanced Technologies A/S (SBI-AT) is
developing novel technologies for protein structure prediction. The goal of
this work is to be able to make accurate tertiary structure models for as many
protein sequences as possible. The work covers diverse areas such as
prediction of protein secondary structure using neural networks, construction
of improved sequence profiles, hidden Markov models using sequence and
structure profiles, construction of non-redundant data sets, and construction
of novel force fields. An algorithm for predicting secondary structure of
proteins at 80% accuracy has been developed [1]. Accurate secondary structure
predictions significantly enhance the capability to make accurate protein
models. The secondary structure predictions can be used to recognize folds and
find templates for remote homology modeling by identifying other proteins with
the same composition and sequential order of secondary structure units. They
can also be used to increase alignment accuracy, as well as aid in finding
fragments for ab initio structure prediction. The secondary structure
predictions will enable SBI to make more accurate protein models and create
models for proteins that were previously too non-similar from any known
protein structure. In turn, this will facilitate the search for small
molecules that will bind to these proteins. Use of up to 800 predictions of
differently trained neural networks, and the ability to combine the networks
in an efficient manner, lead to a more accurate prediction than that of any of
the individual networks. A novel technique: output expansion that predicts the
secondary structure for more than one residue at a time is also a key element
of the new method. This improves the prediction accuracy by teaching the
neural network about the structural context of its secondary structure
predictions. The method not only calculates the most likely secondary
structure for a given residue, but also calculates the probability that a
residue is in any of the three secondary structure conformations. This type of
output is much more useful as input to probabilistic methods such as hidden
Markov models. Using these new technologies the secondary structure of
proteins can be predicted with an unprecedented 80% accuracy rate, thus
improving the state-of-the-art in this very competitive field.
[1] Prediction of protein secondary structure at 80% accuracy. Petersen TN,
Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, Gippert GP, Lund O.
Structural Bioinformatics Advanced Technologies A/S, Hørsholm, Denmark.
tnordahl@strubix.dk. Proteins 2000 41: 17-20.
Murzin , 384
number of submitted models: 21
Alexey G. Murzin and Alex Bateman
email: agm@mrc-lmb.cam.ac.uk
As submitted in the Fold Recognition category
Since our teams last performance in CASP2 four years ago, we have been working
on the methods that could extend the superfamilies of known structure in SCOP
to the sequence families of unknown structure in Pfam and other sequence
libraries. We entered CASP4 hoping that this prediction experiment would
provide an opportunity to test our new methods. A systematic work on the
extension of SCOP superfamilies has already resulted in the structural
assignment of many sequence families of unknown structure and, often, unknown
function. Indeed, in CASP3, there were at least three targets predictable by
this approach. Disappointedly, however, none of the CASP4 targets turned out
to be in our list of protein families with already assigned structures.
Therefore, in CASP4 we used essentially the same approach as developed for
CASP2 (Murzin A.G. and Bateman A. Distant homology recognition using
structural classification of proteins. Proteins, Suppl. 1:105-112, 1997). We
searched for probable homologues of the target sequences and available
biochemical information on the target protein and/or its sequence family and
used the predicted secondary structure to shortlist the SCOP superfamilies, to
which each attempted target may belong. Predictions were based on the
discovery of superfamily specific characters. The experience and expertise
gained from our working on SCOP and Pfam databases were of a great help in
this knowledge-based approach. Also, we tried our knowledge-based approach in
the two other prediction categories. We used superfamily specific features to
improve the alignments in some of the comparative modelling targets. For
several targets, predicted by our approach to be not related to any of the
SCOP superfamilies, we attempted the fold prediction using the conservation
patterns in the target sequence families, the available biochemical data
and/or the empirical folding rules derived from known protein structures.
The choice of prediction format, TS, and the target selection were influenced
by the CASP3 Fold Recognition assessment experience (Murzin A.G. Structure
Classification-Based Assessment of CASP3 Predictions for the Fold Recognition
Targets. Proteins Suppl. 3:88-108, 1999). To ensure the detection of (partly)
correct predictions by both sequence-dependent and sequence-independent
numerical evaluation procedures, each of our predictions was composed of the
regions of confident structure and alignment, the regions of confident
structure but tentative alignment, and the regions of tentative structure. The
3D coordinates for the most of the target atoms were the best way to represent
this structural mosaic in a single format. As one of us strongly opposed to
the NONE prediction, this option was not used. Therefore, in the absence of
predicted homologous structure, we either built a 3D model of our prediction
ab initio, or had it dropped. Only one model was submitted for each of the
completed predictions. Apart from the two targets whose structures were known
to us before they were submitted to CASP4, we did not attempt the large,
presumably multi-domain targets without apparent domain boundaries. Because of
time limitations, we also ignored late comparative modelling targets including
all but one of the predicted members of the P-loop hydrolase superfamily. Due
to the presence of characteristic P-loop motifs in their sequences, their
homology recognition seemed straightforward, and the actual challenge was the
alignment. All other targets were attempted but six or so of them were dropped
eventually. In total, we submitted predictions for 21 targets. This include
four Comparative Modelling targets, T0090, T0092, T0093(!) and T0103; ten
Distant Homology Recognition targets, T0088, T0096_1, T0098, T0100, T0101,
T0104, T0108, T0109, T0118 and T0121_2; three targets with predicted known
folds (there may or may not be a distant homology), T0095, T0102 and T0114;
and four targets with predicted (probably) novel folds, T0086, T0091, T0094
and T0110.
Many of the Distant Homology Recognition predictions were based on the result
of previous analysis of SCOP superfamilies, for example the pectate lyase
beta-helix fold of T0100 and T0101 (Chothia C. and Murzin A.G. New folds for
all-beta proteins. Structure 1, 217-222, 1993). There were several cases of
déja vu. T0108 had the same characteristic feature as the CASP4 target T0038
and was modelled on the experimental structure of the latter. In T0121_2,
there was the OB-fold signature similar to one we derived for the prediction
of T0004. For the fold prediction of T0102, we used the same pseudo ab initio
approach as we used for the CASP2 target T0042. Incidentally, the predicted
fold of T0102 was found to be similar to the experimental fold of T0042. In
T0086, there was a probable tandem repeat of two (alpha)-alpha-beta-beta-beta
motifs, detected by the analysis of its extended sequence family, analogous to
the approach that detected the internal duplication in T0002_2. Similarly, a
tandem repeat of two beta-alpha-beta-alpha-beta motifs was detected in the
extended T0094 sequence family. Unlike T0002_2, there was no SCOP superfamily
assigned for either T0086 or T0094. Both target structures were modelled ab
initio.
One of our CASP2 techniques, not credited properly at the time because it had
been used only for the late target T0026, was in great use through most of our
CASP4 predictions. For almost every target predicted to belong to a large
superfamily with many known structures, a composite template structure was
assembled from different fragments of several superfamily structures
superimposed onto their common fold. It allowed the selection of the most
suitable parts from different structures. In particular, the predicted
structure of the P-loop hydrolase T0104 was assembled from the fragments of
several topologically distinct members of this very diverse superfamily to
generate a novel topological variant. For a number of our predictions, we
also created hybrid templates including fragments of non-homologous structures
to model the missing parts in the parent structure or even to construct the
whole fold. Then we used Modeller to generate the 3D coordinates,
automatically sealing the gaps and fixing the stereochemistry of the joints.
Levitt , 012
number of submitted models: 180
Michael Levitt
email: michael.levitt@stanford.edu
The methods used for Comparative modeling and Fold-Recognition were the same
and what follows is the same in both abstracts. This work was greatly aided
by the availability of the output of all the 30 or so servers participating in
CAFASP on the CAFASP web site at http://cafasp.bioinfo.pl/target. In general
these results were available within hours of the target sequence announcement
and we never felt the need to consult the original servers in any way.
We first used the freeware program "wget"to download all the files for any new
targets. Then we parsed all these files using a large Perl script. This
script collected together the results from all the servers to give consensus
secondary structure predictions, consensus fold-recognition results and every
alignment produced. The script also converted all the proteins recognized by
the different servers into SCOP version 1.50 superfamily codes and the counted
how often the different codes occurred. Initially, we used the results for
over 20 servers but then found it more accurate to concentrate on eight that
seemed to perform most consistently. These were: ffas, foldfit, fugue,
genthreader, inbgu, mgenthreader, pdbblast, and target99. As may have been
expected, the groups behind each of these eight servers were generally the
experts who had done well in fold-recognition at previous CASP events (Godzik,
FFAS and PDB-Blast; Sternberg, foldfit or 3D-PSSM; Mizuguchi/Blundell, FUGUE;
Fischer, INBGU; Jones genTHREADER and mGenTHREADER; and Karplus, SAM-T99 or
target99). Unlike the CAFASP compilation released on the web by Danny Fischer
(http://www.cs.bgu.ac.il/~dfischer/CAFASP2/summaries/), no manual intervention
was used in parsing these raw results. For each target we produced a summary
file that listed:
(1) The fold recognition hits in decreasing order of significance with the PDB
entry name, the significance scores and the SCOP 1.50 ID. In some cases the
raw significance score given by the server was modified so that scores were on
the same scale (-100 for highest significance to small positive numbers for no
significance).. For example:
T0099_ffas_hit_1 1bu1a -33.2 2.32.2
T0099_ffas_hit_2 1ark -30.7 2.32.2
(2) All the alignments produced by each method together with information on
the sequence match. For example:
T0099_ffas_al_2-a.mas_1ark 2.32.2 EFIAIYDYKAETEEDLTIKKGEKLEIIEK-EGDWWKAKAIGSGEIGYIPANYIAAA
T0099_ffas_al_2-b.sla_1ark 2.32.2 IFRAMYDYMAADADEVSFKDGDAIINVQAIDEGWMYGTVQRTGRTGMLPANYVEAI
T0099_ffas_al_2-x.par_1ark 2.32.2 nMAT=55, pID=28, nDEL=1, nINS=0, nCov=55/56, spaci=-99.000
(3) A Consensus summary allowing the fold to be recognized. For each SCOP
superfamily we collect the number of hits, the mean significance score, the
method and rank, the SCOP title and the PDB domain names with their SPACI
scores (Brenner, Koehl and Levitt, 2000). For example:
%T0099 4.77.1 -78.4 3 genthreader_1 mgenthreader_2 pdbblast_9
%T0099 (Alpha and beta (a+b),SH2-like,SH2 domain)
%T0099 1fmk 0.578, 2src 0.540,
%T0099 4.123.1 -59.9 6 genthreader_1 mgenthreader_2 pdbblast_6 pdbblast_7
%T0099 (Alpha and beta (a+b),Protein kinase-like (PK-like))
%T0099 1fmk 0.578, 2src 0.540, 1qcfa 0.431, 1ad5a 0.258,
%T0099 2.32.2 -45.9 60 ffas_1 ffas_2 ffas_3 ffas_4 ffas_5 ffas_6 ffas_7
%T0099 (All beta,SH3-like barrel,SH3-domain)
%T0099 1ckaa 0.665, 1fmk1 0.578, 2src 0.540,
For more complete results see our "private" site at: http://csb.stanford.edu/levitt/casp1234 .
During the CASP event, information contained in that site was updated
regularly by Levitt and shared with the different CASP4 groups in my lab
headed by Samudrala, Xia, Fain and Koehl respectively. This is the only
information that was shared. Each group then went on to make their own
comparative models (Samudrala, Koehl and Levitt) and/or ab initio models
(Fain, Levitt, Samudrala, and Xia). There was no comparison of models, as
each individual preferred to use CASP as an opportunity to prefect their
methods rather than to "win" CASP.
Overall we felt very confident (perhaps wrongly so) about recognizing an
appropriate template in the comparative modeling and fold recognition parts of
CASP4. We considered 17 targets to be Comparative Modeling targets (T0089,
T0090, T0092, T0099, T0101, T0103, T0111, T0112, T0113, T0117, T0119, T0121,
T0122, T0123, T0125, T0127, T0128) and did them all. Of the remaining 26
targets, we considered 18 to be Fold-Recognition targets and 8 to be Ab Initio
targets. For those targets that we considered to be fold-recognition targets,
9 were considered easy as their was very clear sequence similarity (T0087,
T0088, T0093, T0096, T0098, T0100, T0104, T0109, T0116), and 7 were considered
difficult and could not have been done without the consensus use of the
servers participating in CAFASP (T0094, T0095, T0107, T0108, T0115, T0118,
T0126), and 2 were considered to have no recognizable fold (T0120, T0124).
They were also too large for ab initio modeling so no results were submitted
for these.
In the predictions done by Levitt group, all the alignments for targets
submitted after 15 August were re-aligned using the structure of the template
to modify normal dynamic programming. This was done as follows: (a) The cost
of deleting residues from the template was proportional to the distance across
the gap in three-dimensions (measured between the CA atoms adjacent to the
gap). (b) The cost of inserting residues depended on how buried the residues
adjacent to the insertion were. (c) Buried residues were given greater weight
in the scoring. Each of these measures has associated with it a weight and
not having time to optimize these weights on known structural alignments, we
used 25 combinations of parameters and generated alignments for every one.
All the alignments taken from CAFASP before 15 August or re-aligned as
described above , we then used with our well-established automatic modeling
methods, SegMod and Encad, to generate stereochemically acceptable all-atom
models for each alignment (see Levitt, M. Accurate Modelling of Protein
Conformation by Automatic Segment Matching. J. Mol. Biol. 226, 507-533 (1992)
and Levitt, M. Energy Refinement of Hen Egg-White Lysozyme. J. Mol. Biol. 82,
393-420 (1974)).
Finally the best models were selected as follows. Use the rapdf probability
score (Samudrala, R & Moult, J. An All-atom Distance-dependent Conditional
Probability Discriminatory Function for Protein Structure Prediction. J. Mol.
Biol., 275: 893-914, (1998)) to choose the best 1000 models (it there are that
many). Cluster all these 1000 or fewer models into 10 clusters (using
bottom-up hierarchical clustering based on inter-structure CA coordinate RMS
deviation). For each model we use the rapdf score, Samudrala's HCF
hydrophobic compactness score, Keasar's surface energy, and the number of
hydrogen bonds to rank the conformations in each cluster. Finally choose the
five lowest energy models never including more than one model from a given
cluster. Occasionally manual intervention was used in deciding the rank of the
models in the official submission to CASP. For this we viewed the models to
judge general protein like shape and also used the coverage. For example, a
model with a less favorable energy score may be ranked above a model with
better score if the first model covered more of the target sequence.
Friesner , 414
number of submitted models: 150
An, Y.
Eyrich, V.A.Gunn, J.Pincus, D.L.Standley, D.M.Friesner, R.A.
email: rich@chem.columbia.edu
We carried out comparative modeling in cases where an obvious homologue could
be identified using our fold recognition techniques. Based on the alignment,
we imposed geometrical constraints from the template in appropriate locations.
Some regions were unconstrained (e.g. insertions or deletions) and the final
geometries were determined in the course of running a tertiary folding
simulation. When different alignments or templates were possible, the tertiary
folding energy, similarity to the template, and template function were used to
select the structure that was finally submitted.
Following is a brief description of the core technology used to identify
homologues, build alignments, and carry out simulations:
A. Core Technologies
(1) Secondary structure prediction: predictions of the target sequence were
obtained from four public servers: PSIPRED, JPRED, SSPRO, and PHD.
(2) Alignment of the target sequence to sequences in the PDB: a dynamic
programming algorithm incorporating predicted secondary structure from step
(1) was used to produce a short list of proteins whose sequence
identity/secondary structure pattern indicated that they were plausible
candidates for remote homologues to the target. The scoring function used in
the alignment was optimized against a training-set from the PDB. All four
secondary structure prediction methods were used in this step, as well as
combinations of segments from various methods; the number of such combinations
depended upon the variability in the secondary structure prediction results.
The PDB sequences were decomposed into domains when possible; we also computed
the radius of gyration of the aligned part of the template in order to
eliminate alignments across domains in cases where a domain decomposition was
not conveniently available. In some cases, the effect of truncating the target
sequence at either end was investigated; in others, when a multiple domain
structure of the target was suspected (e.g. due to comments in the
literature), various partitions of the sequence were run independently. The
number of candidate homologues saved from this first stage depended upon the
type of protein. For mixed alpha-beta proteins, the number of sequences in the
PDB fitting the secondary structure pattern was typically rather small, with a
significant degradation in quality of agreement after the first ~50
candidates. For all-alpha and all-beta proteins, the number of reasonable
candidates was often much larger, on the order of 300-500. In some cases these
lists could be substantially truncated on the basis of known protein function
(e.g. carbohydrate binding proteins). Finally, in many cases it was necessary
to enumerate a significant number of different alignments between the target
and candidate template. This was accomplished on a segmental basis, i.e. by
forcing the pairings of various designated beta-strands and alpha-helices.
(3) Constraint generation: Constraints were generated from the high-ranking
alignments. C_alpha atoms in the target sequence were constrained to the
corresponding template values.
(4) Tertiary folding simulations: The objective was to select the correct
candidate homologue and alignment. The computational details of the tertiary
folding simulations are briefly described as follows:
(a) An off-lattice model containing backbone atoms plus a pseudo atom
representation of the side chain for each amino acid was employed.
(b) The geometrical variables in the simulation were the phi and psi angles in
the loop regions; angles in secondary structural regions were fixed to ideal
values (-57, -47 degrees for alpha helices, -139, 135 degrees for beta
strands).
(c) The potential was a function of the distance between the side chain pseudo
atoms and the identities of the interacting residues. The functional form was
a general cubic spline that allowed great flexibility along with rapid
computation of energies and gradients. In general, hydrophobic-hydrophobic
interactions were attractive, and hydrophilic-hydrophilic interactions were
repulsive, as in the statistical potential of Sippl [1] and coworkers.
However, the potential was designed to vary as a function of protein size; we
have found this modification to be essential for obtaining reasonable results
for test cases. The size dependence was implemented by collecting distance
statistics from proteins of a given size-group (the training set). The
potential function was optimized iteratively so as to render the training set
proteins stable (i.e., after local minimization), while maintaining the
smallest energy gap possible between native conformations and their locally
minimized counterparts.
(d) Simulations of protein structures were carried out via a Monte Carlo plus
minimization algorithm [2] along the lines proposed by Li and Scheraga [3],
with a number of modifications to improve efficiency. The Monte Carlo code has
been developed to run in parallel using the MPI protocol over a network of
inexpensive personal computers.
Tertiary folding simulations were carried out using the potential above
supplemented with C_alpha restraint sets. The resulting structures were
clustered and ranked according to total energy. The choice of the correct
homologue was made by considering the total energy (including the constraint
term), similarity to the template (as determined by CE), and biological
function of the target.
References:
[1] Casari, G., Sippl, M.J. (1992). Structure-derived hydrophobic potential.
J. Mol. Biol. 224(3), 725-732.
[2] Eyrich, V. A., Standley, D. M. & Friesner, R. A. (1999). Prediction of
protein tertiary structure to low resolution: Performance for a large and
structurally diverse test set. Journal of Molecular Biology 288(4), 725-742.
[3] Li, Z. Q. & Scheraga, H. A. (1987). Monte-Carlo-Minimization Approach to
the Multiple-Minima Problem in Protein Folding. Proceedings of the National
Academy of Sciences of the United States of America 84(19), 6611-6615.
[4] Shindyalov IN, Bourne PE (1998) Protein structure alignment by incremental
combinatorial extension (CE) of the optimal path. Protein Engineering 11(9)
739-747.
Hovmoeller-Zhou , 501
number of submitted models: 5
email: svenh@struc.su.se
The secondary structure prediction for proteins is usually done on the 3
categories Helix, Sheet and Random (HSR). An alternative is to use the torsion
angles, as defined in the Ramachandran plots. Also here we get 3 categories, a
b and others, but there is no one-to-one relation between these and the HSR
categories. Several consecutive amino acids with torsion angles in the area of
the Ramachandran plot typical for a-helices will correspond to HELIX in the
PDB. However, several consecutive amino acids with torsion angles in the b
region are not necessarily classified as SHEET. A straight strand is called a
sheet only if it has a partner in the form of another strand, parallel or
anti-parallel. Single strands, not defined as SHEET in the PDB are not
uncommon.
When sorting on torsion angles we get just over 50% a, 42% b, 4.5
0.000000e+00ft-handed helix and just 2.50f the amino acids outside these three
regions. Most (70%) of the amino acids with a conformation are found in
helices, while most (55%) of the amino acids with b conformation are found in
the regions called Random coil. Just over 10f the amino acids in HELIX and
SHEET do not have torsion angles in their expected a or b regions of the
Ramachandran plot. See the Table. A careful check of these residues shows that
most of them are due to mistakes in the PDB data sets. Such mistakes further
complicate protein folding prediction if it is based on the assumption that
the information in the PDB is always correct.
Torsion angles: alpha beta left alpha othe Sum
HELIX in PDB 36.8 0.8 0.3 0.3 38.2
SHEET in PDB 0.6 18.0 0.1 0.4 19.2
Random coil 13.5 23.3 4.1 1.8 42.6
Sum 50.9 42.1 4.5 2.5 100.0
Except for glycines, all the other amino acids are to about or over 90 0n the
a or b regions. Thus, a prediction of torsion angles is essentially a 2-option
prediction, as opposed to the 3 options HSR. That facilitates folding
prediction, and gives an additional 3D prediction for the backbone. On the
other hand, as mentioned above, the translation from an a b other prediction
to an HSR prediction is not totally straight-forward.
We have plotted Ramachandran plot for each of the 20 amino acids, based on
all the 150 000 residues in our learning set. We have also made separate plots
for the residues defined as HELIX, SHEET or random coil in the PDB. These
plots show several interesting features. The torsion angles for amino acids in
HELIX are sharply focused at nearly the same value for all amino acids.
However, the residues classified as random coil, yet having torsion angles in
the a-helix region, have a very different distribution of angles; it is
elongated and inclined by 45 degrees to the axes of the Ramachandran plot.
The torsion angles for amino acids in HELIX are sharply focused near Ö = -64
Ø= -38. This value is nearly the same for all amino acids. The residues
classified as random coil with torsion angles in the a-helix region have an
elongated distribution of angles inclined by 45o to the axes of the
Ramachandran plot. The torsion angles for amino acids in SHEETs are shifted by
about 70degrees along Ö, relative to those in random coils.
We believe these findings are important for the protein folding prediction,
but we have not exploited this information fully yet.
We have analyzed single amino acids, pairs, triples and so on, based on both
the HSR and the torsion angle scheme. Our learning set contains 560 different
protein subunits, containing 150 000 amino acids. The testing set contains 30
subunits, none of which are included in the learning set. The most striking
feature is that 880f all the residues before proline have b conformation. For
most triples, the most common set of torsion angles is aaa. Most of these have
bbb as the second most probable set of torsion angles. The mixed combinations
aba, aab and so on are usually very rare. Gly and Pro are the amino acids that
are most commonly found to break a sequence of aaa or bbb, but this can also
be achieved by the amino acids with polar groups close to the b-carbons; Asn,
Asp, Ser and Thr.
We are predicting protein folding based on a combination of HSR and torsion
angle predictions. In both cases we base our predictions on statistics from
our learning set of 560 subunits in the PDB.
After a preliminary assignment of the amino acids as being in H, S or R, the
program goes over the sequence looking for single H or S or pairs of H. Since
these are not allowed, they will be reconsidered.
In summary, we reach very good predictions for helices, especially long ones,
but have more difficulties with the sheets. Part of this difficulty may stem
from the fact that two identical stretches of amino acids (equal sequence and
identical atomic co-ordinates) may in one case be called a SHEET but in
another protein it is considered as random coil, because it lacks a partner in
the form of a parallel or anti-parallel strand. We suspect that this problem
of classification may hamper the success of protein folding prediction.
Vajda , 241
number of submitted models: 22
Jahnavi Prasad, Michael Silberstein, David Gatchell and Sandor Vajda
email: vajda@bu.edu
The basic idea of the method is to generate a large number (preferably up to a
few thousands) alignments, construct a homology model for each, and rank the
models according to their free energies.
The current implementation of the procedure starts with traditional target
selection using Blast and Psi-Blast. The Domain Profile Analysis developed in
Temple Smiths lab
(http://bmerc-www.bu.edu/bioinformatics/profile_request.html) has also been
consulted, One or (infrequently) several proteins have been selected as
templates for the comparative modeling. In the second step of the algorithm,
we generate multiple alignments between target and template sequences by
varying the alignment parameters (gap-opening, gap-extension, and scoring
matrix) for producing semi-global alignments by standard dynamic programming.
The blosum62 and gonnet matrices were used with gap opening penalty values 5,
6, 7, 8, 9, 10, 12, 14, 17, 20, 25, and gap extension penalty values 0.1, 0.2,
0.3, 0.5, 0.75, 1.0, 1.25, 1.6, 2.0, 2.5, 3, 4, 5, 7, 10. We produced only one
alignment for each set of parameters using a single trace-back path in the
dynamic programming matrix, thus resulting in 330 alignments for each
template-target pair. Any alignment was deleted if it was a duplicate, or less
than 750f the target residues were aligned to the template, generally
resulting in 80 to 150 retained alignments.
In the third step, all alignments are used for model construction via the
MODELER program developed by Sali and co-workers [1]. The resulting models
were minimized for 200 steps using the Charmm potential [2], and ranked by
using an empirical free energy function [3,4]. The function combines
molecular mechanics with empirical solvation/entropic terms to approximate the
free energy G of the system consisting of the protein and the solvent. the
latter averaged over its own degrees of freedom. The free energy is given by
G = Econf + Gsolv. The conformational energy Econf is calculated by Version
19 of the Charmm potential, Econf = Eelec + Eint, where the internal (bonded)
energy, Eint, is the sum of bond stretching, angle bending, torsional, and
improper terms, Eint = Ebond + Eangle + Edihedral + Eimproper. The
electrostatic energy, Eelec, is calculated using neutral side chains and the
distance-dependent dielectric e = 4r. Gsolv is the solvation free energy,
obtained by the atomic solvation parameter model of Eisenberg and McLachlan
[5].
Notice that the function does not include the van der Waals energy term. This
approximation is based on the concept of van der Waals cancellation which
assumes that the solute-solute and solute-solvent interfaces are equally well
packed, and hence the van der Waals contacts lost between solvent and solute
are balanced by new solute-solute contacts formed upon protein folding. This
cancellation is promoted by a procedure called van der Waals normalization,
prior to the free energy calculations. Van der Waals normalization implies
that all conformations are minimized for a moderate number of steps, the
structure with the lowest van der Waals energy is selected, and all other
structures are further minimized to attain the same van der Waals energy
value. The van der Waals cancellation implies that we can remove both the
solute-solvent and the solute-solute van der Waals terms from the free energy
function.
For completeness it is necessary to add that at the time of the CASP
submissions the algorithm has not yet been extensively tested. Recently we
made a number of improvements in the methodology. First, we now include all
possible trace-back paths for each dynamic programming matrix, thereby
substantially increasing the number of alignments. Second, we use the full
template sequence in the alignment and not only the fragment for which 3D
coordinates are available. Third, the most important change is that the free
energy function seems to provide more robust predictions without the internal
energy term and hence we restrict consideration to the function G = Eelec +
Gsolv.
References:
1. R. Sanchez, R. and A. Sali, Evaluation of comparative protein structure
modeling by MODELLER-3. Proteins, Suppl. 1:50-58, 1997.
2. B.R. Brooks, R.E. Bruccoleri, B.D. Olafson, D. J. States, S. Swaminathan,
and M. Karplus. CHARMM: A Program for Macromolecular Energy, Minimization, and
Dynamics Calculations, J. Comp. Chem., 4: 187217, 1983.
3. A. Janardhan and S. Vajda. Selecting near-native conformations in homology
modeling: The role of molecular mechanics and solvation terms. Protein
Science, 7:17721780, 1997.
4. Gatchell, D. Dennis, S., and Vajda, S. Discrimination of near-native
protein structures from misfolded models by empirical free energy functions.
Proteins. 41:518-534, 2000.
5. D. Eisenberg and A.D. McLachlan. Solvation energy in protein folding and
binding. Nature (London), 319:199203, 1986.
SBauto , 382
number of submitted models: 74
Ajita Bhat
Michael J. BowerMaxwell D. CummingsKristin K. KoretkeAndrei N. LupasRobert B. RussellCraig VolkerAutumn L Sutherlin
email: bowermj@mh.us.sbphrd.com
Sbauto
Summary
All CASP4 targets were submitted to the sensitive search routine program
SENSER, which is based on PSI-Blast and HMMer. SENSER runs through three
different search strategies, using PSI-Blast as its search engine, to identify
a relationship with a sequence of known structure. As soon as a fold is
identified, an alignment between the CASP target sequence and the sequence
with a known fold is generated using HMMer. If a relationship between the CASP
target and a sequence with a known structure was not identified, a prediction
of "novel fold" was submitted. Models were generated from the alignments with
as little human intervention as possible, using the ICM package, MODELLER as
implemented by MSI, and the MOE package from Chemical Computing Group.
Details
In the first step SENSER performs a PSI-Blastsearch with the target sequence.
Proteins identified in the search are divided into a significant sequence
space, containing those sequences with an E value lower than 10-3, and a
'trailing end' of sequences between 10-3 and 10. Because some of the proteins
detected may contain unrelated domains, all proteins are trimmed to the actual
region detected in the PSI-Blast run.
In the second step transitive searches are used to expand the significant
sequence space. Only proteins within the significant sequence space that have
less than 25 0dentity to the target sequence are used as starting points for
further PSI-Blast searches, in order to avoid redundant searches, i.e. those
that produce similar profiles and sequence spaces. This value was chosen as it
is a frequently quoted threshold for the 'twilight zone', below which
sequences can not be confidently said to be homologous.
In the third step trailing-end sequences are tested for their ability to
back-validate, i.e. detect any sequence of the significant sequence space of
the target in PSI-Blast. Because several PSI-Blast searches were performed to
establish the significant sequence space, trailing-end sequences are pooled
and ranked first by number of occurrences and second by E-value, before being
tested. If a trailing-end sequence back-validates, its significant sequence
space is added to that of the target. The process is then repeated until no
further sequences are detected.
The steps above can connect proteins that are far apart in sequence space,
however, beyond the first PSI-Blast search, they do not directly provide an
alignment of the target to the sequences detected. Moreover, even for
sequences detected in the first step, PSI-Blast generally provides only
partial alignments. For these reasons, we introduced an alignment strategy
based on HMMer. After the first PSI-Blast search, we build a target HMM from
the proteins in the significant sequence space, as aligned by PSI-Blast. Any
sequence detected at this step is aligned to the target sequence using the
target HMM. Any sequence detected at a subsequent step is aligned in a five
part process:
(1) a PSI-Blast search is run for the untrimmed sequence,
(2) a multiple alignment is extracted,
(3) this alignment is combined with the sequences of the target HMM to produce
a global alignment, using the target HMM as a template,
(4) a final HMM is built from this global alignment, and
(5) this HMM is used to align the detected sequence to the target.
In an effort to compare the quality of models generated by different
partially automatic methods, the single output alignment was used in up to
three different methods, these being MODELLER as implemented by MSI (Sali &
Blundell, J. Mol. Biol., 234: 779-815, 1993), ICM from MolSoft (Cardozo et
al., Proteins 23: 403-414, 1995), and the MOE package from the Chemical
Computing Group (http://www.chemcomp.com). Where these methods were applied
to the same target with a single alignment, we are able to compare methods
without the overwhelmingly confounding variable of different sequence
alignments.
In ICM, the amino acid sequence of the target was tethered to the parent
structure. Conformations of conserved residues were kept identical to those in
the parent. A Monte Carlo search was used to find the lowest energy
conformation of nonidentical residues. The conformations of insertions and
deletions were predicted using the MC method. The structure was minimized to
remove clashes and steric strain to obtain the final structure.
For MODELLER, the alignments were imported into InsightII/Homology. The
parent structure was loaded into Insight and the structure was linked to the
aligned sequence. This was submitted to MODELLER directly from Insight. Five
models were generated, and the model with the highest pdf score (lowest
-ln[pdf]) was chosen for submission.
In MOE, a homology model was built from the parent structures and alignments
using default methods, with library searching for insertions, deletions, and
loops, and ten output models. No minimization was done in the searching step.
The geometric average model, the default, was used as the final output
structure, except in cases with large insertion regions, where the best
intermediate model was used. Controlled minimization was performed on this
model using the AMBER89 forcefield.
SBfold , 381
number of submitted models: 68
Ajita Bhat
Michael J. BowerMaxwell D. CummingsKristin K. KoretkeAndrei N. LupasRobert B. RussellCraig VolkerAutumn L Sutherlin
email: bowermj@mh.us.sbphrd.com
SBfold
Summary
All CASP4 targets were submitted to the sensitive search routine program
SENSER, as described in detail in the abstract for the SBauto submission.
Secondary structure predictions were gathered from the JPred server. Additional
sequence searches were done using regular expression patterns and HMMs. If a
protein of known structure appeared to match the properties of the target,
alignments were generated using MACAW or HMMer. Models were generated
from the alignments with as little human intervention as possible, using the ICM
package, MODELLER as implemented by MSI, and the MOE package from
Chemical Computing Group. Some models were also then generated or refined
using more manual methods.
Details
Details on the operation of SENSER are given in the SBauto abstract.
If SENSER identified a potential template structure, its match with the target was
evaluated using predicted secondary structure, the occurrenceof sequence
patterns, and biochemical information. The alignment was generated using
MACAW or HMMer.
If SENSER did not identify a potential template structure, regular expression
patterns, predicted secondary structure, and biochemical information were used to
search for possible templates. In addition, in cases where the target was only a
fragment of a larger protein, the entire protein was used in sequence searches. If a
template was judged to match the properties of the target, an alignment was
produced using MACAW, HMMer, Clustal, or a combination of these methods,
to produce the alignment that seemed most plausible to us based on conserved
residues, hydrophobicity, and secondary structure.
In an effort to compare the quality of models generated by different
partially automatic methods, the single output alignment was used in up to three
different methods, these being MODELLER as implemented by MSI (Sali &
Blundell, J. Mol. Biol., 234: 779-815, 1993), ICM from MolSoft (Cardozo et al.,
Proteins 23: 403-414, 1995), and the MOE package from the Chemical
Computing Group (http://www.chemcomp.com). Where these methods were
applied to the same target with a single alignment, we are able to compare
methods without the overwhelmingly confounding variable of different sequence
alignments. Several of the models were also refined with the program SCWRL
(Bower et al., J. Mol. Biol., 267: 1268-1282, 1997) to replace sidechains. Other
models, with multiple parent structures, had a composite parent structure
constructed almost completely manually, before submission to MOE.
In ICM, the amino acid sequence of the target was tethered to the parent
structure. Conformations of conserved residues were kept identical to those in the
parent. A Monte Carlo search was used to find the lowest energy conformation of
nonidentical residues. The conformations of insertions and deletions were
predicted using the MC method. The structure was minimized to remove clashes
and steric strain to obtain the final structure.
For MODELLER, the alignments were imported into InsightII/Homology.
The parent structure was loaded into Insight and the structure was linked to the
aligned sequence. This was submitted to MODELLER directly from Insight.
Five models were generated, and the model with the highest pdf score (lowest -
ln[pdf]) was chosen for submission.
In MOE, a homology model was built from the parent structures and
alignments using default methods, with library searching for insertions, deletions,
and loops, and ten output models. No minimization was done in the searching
step. The geometric average model, the default, was used as the final output
structure, except in cases with large insertion regions, where the best
intermediate model was used. Controlled minimization was performed on
this model using the AMBER89 forcefield.
ORNL-PROSPECT , 088
number of submitted models: 215
D. Xu, O. H. Crawford, P. F. Locascio, and Y. Xu
email: xud@ornl.gov
We predicted and submitted structures for all the protein targets (43
in total) given by CASP4, mainly using our in-house threading program
PROSPECT (PROtein Structure Prediction and Evaluation Computer Toolkit)
[1]. PROSPECT is a computer program for finding an optimal alignment
between a protein sequence and a protein structural fold. Two unique
features of PROSPECT are (1) that it guarantees to find the globally
optimal sequence-structure alignment and does so in an efficient
manner, when using both alignment gap penalty and pairwise potential
between residues that are spatially close; and (2) that it guarantees
to find the globally-optimal alignment under various constraints on the
unknown protein specified by the user [2]. Currently PROSPECT allows
the following types of constraints: (a) disulfide bonds between
specified residues; (b) active sites involving a specified set of
residues; (c) long-range NOE (nuclear Overhauser effects) restraints
from NMR experiments; (d) secondary structures predicted by computer
programs or determined from chemical shift data of NMR experiments; and
(e) position-dependent profile based on multiple sequence alignment.
Significant improvements on PROSPECT have been made since CASP3 [3].
One of the main recent technical developments on PROSPECT is a
capability of assessing its prediction reliability for each threading
alignment through mapping each threading score to a value in the
range of [0, 1]. The closer a mapped score to 1, the higher the
probability the prediction gives a correct fold recognition and
a better sequence-fold alignment. We have trained a neural network to
accomplish this mapping. A number of parameters are used to help
accomplish such a mapping. These parameters include information
about the overall scoring distribution of each structure template,
the lengths of the query sequence and template protein, and a few
others. To train the neural net, we have used the following objective
function (which defines what value each threading score is desired
to map to). For each query-fold pair, we calculate the number of
structurally alignable residues between them defined by the SARF
program [4]. Then for each query-fold pair, the two proteins are
considered to be a "true" pair if their FSSP indices [5] share
the same first digit, i.e., they belong to the same fold family.
We then divide all pairs into bins, based on the numbers of their
alignable residues, e.g., the first bin consists of pairs
with 0-9 alignable residues, the second bin consists of pairs with
10-19 alignable residues, and so on. Then we calculate the frequency
of true pairs for each bin. That number is used as the desired value
to which each threading is mapped. We have collected the statistics
on a training set consisting of 17,000+ false pairs and 708 true pairs.
Highly encouraging results have been achieved using this neural net
method to normalize threading scores, which played a significant
role in our CASP4 predictions.
PROSPECT has been implemented on an IBM/SP3 supercomputer at Oak Ridge
National Laboratory, which currently has 184 nodes. The code distributes
alignments between a query sequence and different templates to different
nodes, and basically achieves a linear speedup. The program now has the
capability of threading ~100 protein sequences against our template database
of 2000+ structures per day. Over 4 million sequence-structure alignments
were calculated on the supercomputer for the CASP4 predictions and the
data used in the neural network training described above. A Web
interface has been built for the PROSPECT program on the supercomputer, which
is available at http://compbio.ornl.gov/structure/prospect_server/.
The server takes the input of a sequence and an email address. When the
PROSPECT finishes the prediction, it sends an email back with the
information about the URL where the predictions are stored. A user can
view the alignments between the query protein and templates, as well as
the 3-D display of the templates if the user's Web browser has a plug-in
of PDB viewer.
A standard protocol for predicting each CASP4 target is described as
follows. We first ran PSI-BLAST [6] to see if there is
any obvious homolog in PDB. When there is no homolog in PDB, or the
alignment is ambiguous, we carried out threading using PROSPECT.
Secondary structures were predicted using PHD [7] as possible inputs
for threading. We also collected structural or functional information
in the SWISS-PROT database and MedLine. The information was used as
potential constraints during the threading process. Based on the neural
network assessment, we chose 5 templates to build 5 models, and assign
confidence level for each model. The atomic structures were constructed
by MODELLER [8]. Ten structures were generated for each alignment. We
then used structure assessment tools, such as WHATIF [9] to evaluate
the packing and backbone conformations, the inside/outside occupancies
of hydrophobic and hydrophilic residues, and stereochemical quality of
the predicted structure. We also checked the consistency between the
predicted secondary structures and secondary structure assignments of
the predicted structure. Based on the structure assessment, we picked the
best one among the ten structures for each model. In case none of the
ten structures was acceptable, we adjusted the alignment and rebuilt
the model. We also visually checked the structures and tuned the
alignment if necessary. When the neural network assessment did not give
high confidence level, we also used structure assessments to rank the 5
models.
References
[1] Y. Xu and D. Xu. Protein threading using PROSPECT: Design and
evaluation. Proteins: Structure, Function, and Genetics. 40:343-354.
2000.
[2] Y. Xu, D. Xu. O. H. Crawford, and J. R. Einstein. A computational
method for NMR-constrained protein threading. Journal of Computational
Biology. 7:449-467. 2000.
[3] Y. Xu, D. Xu, O. H. Crawford, J. R. Einstein, F. Larimer, E. C.
Uberbacher, M. A. Unseren, and G. Zhang. Protein threading by PROSPECT:
a prediction experiment in CASP3. Protein Engineering. 12:899-907, 1999.
[4] N. N. Alexandrov. SARFing the PDB. Protein Engineering. 9:727-732. 1996.
[5] L. Holm and C. Sander. Mapping the protein universe. Science.
273:595-602, 1996.
[6] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W.
Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs. Nucleic Acids Research.
25:3389-3402, 1997.
[7] B. Rost and C. Sander. Prediction of protein secondary structure at
better than 70% accuracy. Journal of Molecular Biology. 232:584-599,
1993.
[8] A. Sali and T. L. Blundell. Comparative protein modelling by
satisfaction of spatial restraints. Journal of Molecular Biology.
234:779-815, 1993.
[9] G. Vriend. WHAT IF: a molecular modelling and drug design program.
Journal of Molecular Graphics. 8:52-56, 1990.
Braun-UTMB , 223
number of submitted models: 104
PREDICTION. COMPETITION (CASP4):
V.S. Mathura, K.V. Soman, C.H. Schein, Y. Xu, and W. Braun
Sealy Center for Structural Biology, Department of Human Biological Chemistry & Genetics, University of Texas Medical Branch, Galveston, TX 77555-1157.
email: venkat@bohr.utmb.edu
The Human Genome Project has revealed many proteins of unknown
function. Classification of these sequences can best be done by
accurate prediction of their structures, and concurrent assignment to
families of known function. We have developed a set of tools for
homology modeling of proteins(1,2), based on self-correcting distance
geometry (DIAMOD)(3,4,5), multiple sequence alignment (MASIA (6))and
energy minimization (FANTOM(7)), that can be used even when the identity
to the target is very low(8) (300r less(9)). CASP4 provided us with an
opportunity to evaluate our methods impartially and objectively. We
submitted a total of 100 models for 27 of the 43 targets, with 15 based
on sequence homology. Models for five targets were generated ab initio.
The rest used a combination of fold recognition with multiple alignment
to improve the sequence register between the target and selected
template.
Homology or comparative modeling (CM)
When a suitable template was identified in the Protein Data Bank for a
target, our comparative modeling procedure was to: (1) Align the target
sequence with one or more template sequences using the program CLUSTALW
or alignments suggested by the fold recognition servers(CAFASP) with
minimal manual adjustment; (2) extract distance and dihedral constraints
with our in-house program EXDIS; (3) build initial models with DIAMOD;
and (4) energy minimize using the FANTOM program. For T90, a consensus
aligment was prepared manually from the 3D-PSSM, BIOINBGU, FUGUE,
GENTHREADER, and Karplus HMM98 and SAM99 results. FANTOM energy
contributions and exposed apolar surface areas calculated with the
program GETAREA were used for ranking multiple models for the same
target. Where information was available for important residues in the
template, such as those within the active site or areas of substrate
binding, we compared their location visually in the model structure.
Fold recognition (FR)
When there was not high enough sequence homology with any protein of
known structure, threading (fold recognition) was attempted, using the
web servers mentioned above and others (PSI-BLAST, 123D and FFAS). Where
several methods suggested the same template, a consensus alignment was
prepared manually. Manual corrections/adjustments were also used to
insure that secondary structures and active site or other critical
residues were aligned. For T91, an alignment from 3D-PSSM was manually
edited to improve the sequence alignment. We also used multiple sequence
alignment of protein families where a fold seemed clear cut. For
example, fold recognition identified T88 as a probable Greek key fold
and selected yeast killer toxin (1wkt) as a template. Another template
structure,1A45, that more closely resembled T88, was selected from a
multiple alignment with 57 b/g-crystallins. The indicated gapping
pattern from the multiple alignment was used to generate a model.
Ab initio modeling
When a suitable template could not be identified based on homology or
threading, but there were clear indications of conserved secondary
structure elements based on sequence alignments with related proteins,
we prepared ab initio models. The steps for generating ab initio models
for T88, T91, T97, T104, and T106 were: (1) Predict secondary structures
and exposed/buried residues of the protein from aligned sequences with
JPRED and MASIA; (2) convert this information into distance and dihedral
angle constraints using the program TRANSLATE; (3) add other constraints
derived from any available experimental data for the protein; (4) build
models from constraints with DIAMOD; (5)refine initial models by energy
minimization FANTOM. We also submitted models based on fold recognition
methods for T88 and T91.
Ab initio constraints were used in several other models where
appropriate. Di-sulfide bond constraints were added during the modeling
of T123 and T125. In another example, for T86, a monomer, a trimeric
template of very low identity was identified based on functional
similarity and conservation of key active site residues. A multiple
alignment with target homologs was used to place probable gaps between
the template and target sequences and constraints were extracted from
the template according to our usual methods. Ab initio constraints were
added at the C-terminal to replace inter-subunit contacts present in the
trimer.
Multiple alignments help in FR and CM
We combined these techniques in preparing alignments where the identity
between the target and template was very low (such as T86 and T88), when
the target had a clear sequence relationship to several templates, or
when several sequences related to the target were known. For T101, which
had about the same degree of sequence identity/similarity to 6 known
protein structures(12-18%), a CLUSTALW multiple alignment of related
proteins of the pectate/pectin lyase family was used. This agreed with
the DALI alignment of the pectate lyases but not of a structurally
related protein, chondroitinase(1DBG). We made models based on the B.
subtilis pectate lyase(1BN8) using the multiple alignment to adjust
gapping. Other models were based on the fold recognition results for
1DBG (where there was no real consensus for most of the protein).
In keeping with our efforts to use genomic data efficiently in modeling,
we used the homologous sequences available for a templates or targets to
improve the alignment.
For T118, PDB- BLAST detected similarity of the C-terminal with 1DDQ-A.
The 1DDQ-A sequence and related bacterial and fungal polymerase
alpha-factors were aligned with T118 to obtain the gapping used in the
submitted alignment. PDB-BLAST also recognized a weak pattern of
identity between T126 and 1DMS and 1EG9. Individual multiple aligments
of T126 with other olfactory factors and these templates was used to
generate the alignments submitted.
1 Soman, K.V., Midoro-Horiuti, T., Ferreon, J.C., Goldblum, R.M.,
Brooks, E.G., Kurosky, A., Braun, W. and Schein, C.H. (2000) Biophysical
Journal 79:1601-1609
2 Soman, K.V., Schein, C.H., Zhu, H. and Braun, W.A. (2000) Homology
Modeling and Simulations of Nuclease Structures. In Methods in Molecular
Biology (Humana Press, Totowa, N.J.; editor C.H. Schein) 160(in press
for December, 2000).
3 Zhu, H., Schein,C.H. and Braun,W. (1999). J. Mol. Modeling, 5,302-316.
4 Mumenthaler, Ch. and Braun, W. (1995) Protein Science 4, 863-871
5 Zhu, H. and Braun, W. Protein Sci. 1999, 8, 326-342
6 Zhu, H., Schein,C.H. and Braun,W. (2000) MASIA: a program to recognize
common patterns and properties in multiple aligned protein sequences.
Bioinformatics 16: in press
7 Fraczkiewicz, R. and Braun, W. (1998) J. Comp. Chem. 19, 319-333.
8 Mumenthaler, Ch., Schneider, U., Buchholz, Ch.J., Koller, D., Braun,
W. and Cattaneo, R.(1997 ).Protein Sci 6, 588-597.
9 Buchholz, C.J., Koller, D., Devaux, P., Mumenthaler, Ch.,
Schneider-Shaulis, J., Braun, W., Gerlier, D. and Cattaneo, R. (1997).
J. Biol. Chem. 272, 22072-22079
baker , 354
number of submitted models: 174
Dylan Chivian, Carol Rohl, Charlie EM Strauss, David Baker
email: chivian@u.washington.edu
We describe a method for comparative modeling that may accomplish
complete models of protein domains even in situations where homolog coverage
is incomplete. Models are generated using portions of homologs, ab initio
loops, and larger ab initio regions.
We begin by selecting a putative homolog, either a PSI-BLAST hit,
a consensus CAFASP result, or a representative from a SCOP class implied
by any available functional information. Sequence substitution profiles are
generated for both query and homolog using PSI-BLAST against a non-redundant
database. Additionally, we obtain predicted secondary structure from PSIPRED,
PHD, or SAM-T99. Lastly, family members of the putative homolog in the FSSP
database are used to determine obligate portions of the template topology.
An alignment is then generated using weighting from the secondary structure
(in the fashion of Fischer and Eisenberg) and profile-profile alignment
(using the approach of Godzik), constraints from the FSSP consensus
structural elements, and gap penalties that take into account the probability
the query residues are able to span the spatial gaps in the template.
The query sequence is then mapped onto the coordinates of the homolog.
Regions of high confidence alignment are treated as a fixed template,
and the coordinates of template regions are not varied from those of the
homologous structure. Gaps, insertions, and regions of low confidence
alignment are treated as variable loops. Models for loop regions
are built using a combination of database searching and a modified
version of the Rosetta ab initio structure prediction algorithm ([1], and see
ab initio Rosetta abstract). A library of initial loop conformations are
extracted from the protein structure database using a method similar to that
previously described for generating fragment libraries. Criteria for
evaluating loop conformations include sequence profile-profile similarity and
secondary structure similary over the loop region, as well as similarity
between secondary structure, torsion angles, and end-to-end distance of the
adjacent template regions. Initial loop conformations selected randomly from
the library are built onto the fixed template. Co-factors and ligands present
in the homolog structure are included in the fixed template coordinates.
Variable N- and C- terminal regions are initially built as extended chains.
Variable termini and loops greater than seven residues in length are then
subjected to a Monte Carlo simulated annealing conformational search
using a move set of three and nine residue fragments culled from the
protein following a modified Rosetta protocol. Following fragment
insertions, random small changes in torsion angles at single sites
are made, and a term for loop closure is added to the potential
function. Loops seven residues in length and less are only subjected
to these random angle changes. Multiple independent simulations for
each template are carried out. From the family of resulting models,
loop conformations that did not close or which resulted in knots are
discarded. A final set of models is then generated by permuting the
remaining loop conformations.
Superior models are selected using clustering, the full atom
potential, and visual inspection as described for the pure ab initio
predictions. Side chains are added to the five models using simulated
annealing, the full atom potential, and the Dunbrack backbone dependent
rotamer library [2].
[1] Simons, K., Bonneau, R., Ruczinski, I., Baker, D. (1999) Ab inito structure
prediction of CASP III targets using ROSETTA. Proteins Suppl 3:171-176
[2] Kuhlman, B. & Baker, D. (2000) Native protein sequences are close to
optimal for their structures. PNAS 97(19):10383-10388
Honig-Barry , 042
number of submitted models: 62
B. Al-Lazikani, J. Jung, L. Xie, B. Honig
email: bhcasp@flash62.bioc.columbia.edu
The structure prediction protocol we used during CASP4 participation involved
the use of a combination of methods, some of which are publicly available and
some of which were developed in our lab. In addition, we often relied heavily
on insights or hints derived from the biological literature research. The
first step in the evaluation of all targets was to check the query sequences
on the CAFASP server [1] and run through the PrISM[2] program that was
developed in our lab. Essentially the same protocol was followed for both
comparative modeling and fold recognition targets. If one or more possible
templates were identified, sequence alignments were obtained from a variety of
standard alignment methods or through PrISM's sequence to structure alignment
algorithm. In some cases alignments were modified manually and, in general,
models were built using a number of different alignments. Models were built
both with MODELLER[3] and with PrISM. For a given target, the use of
different modeling programs, different alignments and different templates
resulted in a variety of possible trial structures. Loops and side chains
were built for each model using new software[4] developed for this purpose in
our lab. The side chain program uses an extensive rotamer library that is
based on cartesian coordinates rather than dihedral angles. The loop modeling
procedure includes a novel method of accounting for the shape of the free
energy basin in a particular loop. A number of criteria were used to choose
the best model. We relied heavily on Verify3D[5] and on a novel physical
chemical based scoring function developed in our lab[6]. This is the first
function of its kind to account for electrostatic optimization in folded
proteins. In a number of cases we used functional features to choose among
models. Specifically, the electrostatic properties of the protein surface was
described with the GRASP[7] program and was correlated with any functional
information available about the query sequence. The predictions that we
submitted were primarily based on the results obtained with our scoring
function, but in a number of cases we allowed literature information to bias
our answer.
If no template could be identified with any confidence with the methods we
used, we created a subset of possible templates based on guesses derived from
reading the literature. In such cases the sequence to structure alignment
algorithm implemented in PrISM was exclusively used to align the target
sequence to the template. In some cases, especially in fold recognition
targets, segments in the alignment were manually adjusted based on the
conserved sequence and structure motifs. The same evaluation procedures
described in the previous paragraph were used to make the final prediction.
1. CAFASP. http://cafasp.bioinfo.pl/
2. A. S. Yang and B. Honig. J. Mol. Biol. 301, 665-678, 2000
3. A. Sali and T. L. Blundell. J. Mol. Biol. 234, 779-815, 1993
4. Z. Xiang and B. Honig. To be submitted.
5. J. U. Bowie, R. Luthy and D. Eisenberg. Science 253(5016), 164-170, 1991
6. D. Petrey and B. Honig. Prot. Sci. (in press).
7. A. Nicholl and B. Honig. Science 268(5214), 1144-1149, 1995
YASARA , 465
number of submitted models: 21
Elmar Krieger(1) and Gert Vriend(2)
Center for Molecular and Biomolecular Informatics,Toernooiveld 1, NL - 6525 ED Nijmegen, the Netherlands(1) www.cmbi.nl/staff/elmar.shtml and www.yasara.com(2) www.cmbi.nl/gv and www.cmbi.nl/whatif
email: elmar@cmbi.kun.nl
Genetic algorithms are known for many years, and after the publication by
Unger and Moult in 1993 [1], they became a standard tool for protein fold
prediction. These algorithms are so called because they utilize the same
optimization procedures as natural evolution: mutation, crossover and selection
[2]. We added a new idea to these original concepts: genetic engineering.
The flow of the procedure is: 1) collect alignments from a series of programs; 2)
build models using WHAT IF [3]; 3) determine their quality with
WHAT_CHECK [4]; 4) use the newly developed program YASARA [5] to
combine the good parts of the models and run molecular dynamics simulations;
5) iterate the steps 3 and 4 as part of a transgenic algorithm.
The program suite does not distinguish between ab initio protein folding,
threading or homology modeling. Here we concentrate on the latter topic.
We have implemented a genetic algorithm (GA) based protein modelling protocol.
In this GA, the genotypes are sets of inter-atomic distances, and the
phenotypes are the corresponding three-dimensional coordinates. Pairs of
models are combined (mating) to determine a set of distances (genotype or
genome) that describes the child. YASARA uses the inter-atomic distances,
generates three-dimensional coordinates (phenotype) and performs a molecular
dynamics run that mimics the "life" of the molecule. The child is then added
to the ensemble of models (gene pool). Normally, a genetic algorithm uses the
amount of offspring (often encoded in an energy term) as the only
quality-measure of a genome. Our GA, however, does not follow the normal
evolutionary principles because the inter-atomic distances (genome) of the
child protein are selected based on the 'quality' of the corresponding
residues in the parent molecules. In real evolution it is, of course, not
known which trades of the parents are good and which are bad, but the
WHAT_CHECK structure validation software gives us this knowledge, which we use
in a genetic engineering- like manner to derive a new genotype. Therefore, a
better name for our GA would be TA like "transgenic algorithm".
The whole protocol is a three-step procedure: 1) initiation; 2) optimization
using the transgenic algorithm (TA); 3) termination.
Initiation. A series of publicly accessible threading and sequence alignment
programs is used to generate possible alignments between multiple templates and
the model sequence: GenThreader [6], 3DPSSM [7], BIOINBGU [8] and Smith&Waterman
running on a Compugen Bioccelerator. In the latter case, the newly developed
program SecMatch filters out those alignments in which the PSI-PRED secondary
structure prediction [9] for the model sequence disagrees too much with the observed
secondary structure in the template.
The program WHAT IF then builds one model for each alignment. A short energy
minimization is performed by the interactive real-time molecular dynamics program
YASARA.
The transgenic optimization cycle. The program WHAT_CHECK is used to
determine quality parameters for each model and for each residue, which are
then converted to a residue-specific fitness matrix (see fig 1) by the newly
written program WHAT_MODELBASE. A per-model score is calculated, and pairs of
models are selected for mating. The higher the per-model score, the higher the
chance that the model mates. YASARA uses the fitness matrix to judge the parent
models. In a process called impression or reverse expression, it then derives a
genome for each parent, that - when expressed - reproduces the good aspects of
the phenotype (i.e. the parent model) but omits the bad ones. At this point,
features acquired during "lifetime" (the MD simulations) can be propagated
back to the genome, in a sense reviving Lamarck's ideas. Mating is then also
done in a non- standard way, as the probabilities of crossovers at certain
genes depend on their quality (i.e. the quality of the associated residues).
Finally, YASARA converts the atomic distances stored in the child's genome
back to a three-dimensional phenotype (gene expression), performs a molecular
dynamics run of a few pico-seconds, and adds the structure with the lowest
energy to gene pool. The distance restraints stored in the genome remain
active during the entire MD simulation, residues that get a high score in the
fitness matrix are restrained more tightly. A mutation in TA terms is
implemented as the release of the restraints during the MD run. Operations
like shortening or extending secondary structure elements or shifting
b-strands along each other form special classes of mutations.
Termination. The TA cycle is run for a few days on about 30 PCs. All software
used in this project is part of a screen-saver, so that the PCs at the desks
of our colleagues can be used in a non-obtrusive way. The TA run is terminated
after the score of the best molecule has not improved for two days, or when
the CASP deadline is near, whatever comes first.
We submitted five models to CASP4 and tried several minor variations in the
protocol. For example, in one target we kept most of the backbone constrained,
in another model the same was also true for the side chains of the conserved
residues, and in three targets no constraints were used. Besides the sequence
alignment software, the full protocol consists of seven programs. A full
description of these programs and the parameters used will be published after
the CASP4 meeting.
1. R.Unger, J.Moult (1993) J.Mol.Biol. 231(1): 75-81
2. J.H.Holland (1975) Adaption in Natural and Artificial Systems, The
University of Michigan Press
3. G.Vriend (1990) J. Mol. Graph. 8: 52-56.
4. R.W.W.Hooft, G.Vriend, C.Sander, E.E.Abola (1996) Nature 381:272-272.
5. E.Krieger, prepared for submission
6. D.T.Jones (1999) J. Mol. Biol. 287: 797-815
7. L.A.Kelley, R.M.MacCallum and M.J.E Sternberg (2000) J. Mol. Biol. 299(2):
501-522
8. D.Fischer, Pacific Symp. Biocomputing, Hawaii, 119-130, January 2000.
9. D.T.Jones (1999) J. Mol. Biol. 292: 195-202.
// 1crn.cdb - WHAT_MODELBASE Quality Check
Sequence : TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN|
Cryst-Cont : ++++ + + +++++++ + ++ + + + +++ | 0.4783
Access : 4100347612540561477626355017624034295783476263| 0.4285
Quality : 0.7425
Bonds : 9899897999989989999998698897979999892979999999| 0.9246
Angles : 2699996999908089899798899898679996696737996929| 0.7963
Torsions : ?64644647678687546352666375655764644423755543?| 0.5475
Phi/psi : ?45545655666665434455646453453654644433754335?| 0.4937
Planarity : 9999999999999999999999999999999999999999999999| 1.0000
Chirality : 9899999999999999999999999999999999999999999979| 0.9901
Backbone : ??999999999999999919799999999996999986999999??| 0.9476
Peptide-Pl : ??9999999999999999?9?9999999999?8999??999999??| 0.9944
Rotamer : ??559959954767969699?9699899799?9849999799899?| 0.8573
Chi-1/chi-2: 4444955494574644459949693494549546599944994495| 0.6341
Bumps : 9999909099999999904990999999099999999999999999| 0.7159
Packing 1 : 6489397666677654013238656653014663314314631213| 0.4651
Packing 2 : 9466653456674655444547434744336843142536653532| 0.4934
In/out : 9999999999999999999999999999999999999999999999| 0.9998
H-Bonds : 9999979799999999999999999999999999999999999979| 0.9837
Flips : 9999999999999999999999999999999999999999999999| 1.0000
Fig. 1: Residue specific fitness matrix for crambin (1CRN). 16 different
quality measures (from "Bonds" to "Flips") are considered by the transgenic
folding algorithm. Quality ranges from 0 (bad) to 5 (average/normal) and 9
(perfect).
A detailed description of the various quality indicators can be found at
http://www.cmbi.kun.nl/gv/checkhelp/
SBI-GR , 457
number of submitted models: 79
G. Raghunathan
email: raghu@strubix.com
We have used a combination of heuristic and automated procedures for
generating models submitted for CASP4. Whenever possible multiple
alignments among family members were used to arrive at conserved
functional residues. For low-homology structures, a combination of
hidden Markov model methods and psi-blast were used to obtain remote
pdb homologs. Next, alignments of target sequence and
templates were fine tuned so as to maximally align critical
functional residues, such as phosphorylation sites, glycosylation
sites, active sites, substrate binding sites and metal ion binding
sites. The above considerations were mostly derived from sequence
information and sometimes from template structure, or a combination
of both. For sequences that have low homology, predicted secondary
structure was often useful for both template selection and alignment.
The preservation of secondary structures and constraints of functional
residues and disulfides were used to effectively capture local and
nonbonded interactions. Several criteria were used for evaluating
multiple models and for choosing the best scoring structures. These
included several stereochemical and structural criteria, such as
the number of outliers in the Ramachandran map, peptide bond
distortions, side chain rotamers, buried charges, cavities, the
fraction of donors and acceptors that were not hydrogen bonded,
exposed hydrophobic surface areas and consistency of secondary
structures in the model with predictions based on sequence.
Wolynes , 032
number of submitted models: 68
Michael C. Prentiss, Zaida Luthey-Schulten, and Peter G. Wolynes
email: mprentis@uiuc.edu
Our scaffolds in the comparative modeling category of CASP4
were identified for targets with over 25equence identity
using standard Blast and Psi-Blast searches. With a scaffold
identified, we created alignments using a threading energy
function, and a phyically correct energy function that prevents
gaps from being placed in the middle of a secondary strucuture
element 1. When multiple scaffolds were found, we used these
proteins as memories for our molecular dynamics program to
fold these structures and average their differences 2. To
verify the alignments, we identified and aligned any known
sequence signatures from Blocks or Prosite in the target and
the scaffold. We also used any information from the literature
that would help to verify the alignment. Our final predictions
were refined with Modeller using our alignments, and the
scaffolds. The multiple structures generated from Modeller
were further tested by our threading energy function, and
discarded if they violated any physical characteristics of
protein structure.
1. K. K. Koretke, Z. Luthey-Schulten, and P. G. Wolynes.
"Self-consistently optimized statistical mechanical
energy functions for sequence structure alignment".
Protein Science, 5: 1043-1059,1996.
2. K. K. Koeetke, Z. Luthey-Schulten, and P. G. Wolynes.
"Self-consistently optimized energy functions for
protein structure prediction by molecular dynamics".
PNAS, 95:2932-2937,1998.
MSI , 447
number of submitted models: 29
by automated methods and manual inspection.
Carol M. Gorst, Lisa Yan and David Edwards
email: cgorst@msi.com
For homology modeling targets, the template structures used to create the
models were identified by a variety of means. Sequences corresponding to the
current PDB were searched using FASTA (Pearson, 1998) , BLAST (Altschul et
al., 1990) or SeqFold (Fischer and Eisenberg, 1997, Olszewski et al. 1999).
After selection of an appropriate template or templates, models were built
using the program MODELER (ali and Blundel, 1993) as implemented in the
insightII, 2000 interface. Several models were generated for each template or
templates either together or separately with different alignments.
Sequence alignments used in creating homology models were created using the
following alignment methods. If there were multiple template structures
available the first step was to align the template sequences based on their
structural similarity or sequence similarity. The template sequence alignment
based on structural similarity was generated by comparing the C-? distance
matrix using the structure alignment methods in insightII (Homology, 2000) or
the malign3D command in Modeler5. Multiple sequence alignment based on
sequence similarity was created using both the ClustalW (Thompson et al. 1994)
and multiple sequence alignment algorithms available in insightII
(Homology,2000). In addition a modified ClustalW alignment implemented in
insightII (Align 123, Homology, 2000) that took secondary structure into
account was used.
The model sequence was aligned to the template(s) using either a pair wise
global alignment algorithm in insightII or using a structure enhanced global
alignment algorithm (Align2D) provided in MODELER5.
In some instances, the multiple alignment of all sequences, including
templates and model, in one step using the multiple sequence alignment method
from ClustalW W or its modified version (Align 123, Homology 2000) has also
been used.
Based on the alignments generated using the above methods, multiple models
were created using MODELER. Models were selected according to model evaluation
scores calculated by Profiles-3D verify (Luethy et al. 1992) in insightII.
The models with the highest Profile-3D verify scores were submitted to CASP4.
All tools utilized herein were accessed and utilized as implemented in
insightII-2000, Molecular Simulations Inc., www.msi.com.
References:
Pearson R. W. (1998) Proc. Natl. Acad. Sci USA 85, 2444
Altschul S. F., Gish W., Miller W., Myers E. W. and Lipman D. J., (1990) J.
Mol. Biol. 215, 403
Fischer D. and Eisenberg D. (1997) Proc. Natl. Acad. Sci. USA, 94, 11929
Olszewski, K. A., Yan L., Edwards D. J., (1999) Theor. Chem. Acc. 111, 57
ali A., Blundell T. L., (1993) J. Mol. Biol. 234, 779
Thompson J. D., Higgins, D. G. and Gibson, T. J. (1994) Nucleic Acid Res. 22,
4676
Luethy R., Bowie J. U., Eisenberg D. (1992) Nature 356 83
shankari , 535
number of submitted models: 4
Shankari E. Mylvaganam and Kal Ramnarayan.
email: shankari@strubix.com
Homology-based model building procedures were used to predict
the three-dimensional structure of select target sequences
listed in CASP4. Each sequence was subjected to a BLAST search
to identify sequence homologs with known three-dimensional
structures. Structures having the highest sequence homology
with the target sequence were selected as templates in model-
building. The sequence with unknown structure was aligned
manually with the template using information from amino acid
sequence, secondary structure, functional residues, and disulfide
bridges. Using the final alignment, the model of the target
sequence was constructed with the program MODELER in INSIGHT II
(Molecular Simulations Inc.). Each model was subjected to refinement
procedures to improve the geometry of both side-chains and
backbone.
MSI-GA , 482
number of submitted models: 12
by purely automated methods.
Carol M. Gorst, Lisa Yan, Azat Badretdinov and David Edwards
email: cgorst@msi.com
The automated modeling package, Gene Atlas TM was utilized.
This method uses a combination of automated fold recognition and comparative
modeling techniques to identify remote functional homologies in many
instances where sequence-based methods fail. All components of the fold
recognition, template selection, alignment generation, model calculation and
assessment of model quality were done with out intervention by any persons.
Submissions were submitted, as generated by Gene Atlast TM,directly to CASP4
and in some instances to CAFASP. Headers were added to the automatically
generated PDB files to allow for acceptance of the data on the CASP4 server.
Additional details on this methodology are available from David Edwards,
dje@msi.com Molecular Simulations Inc., www.msi.com.
CBC-FOLD , 133
number of submitted models: 91
Ajay K. Royyuru, Barry Robson, Prasanna Athma, Faizal Reza, Tiziana Jonas, Alessandro Curioni, Wanda Andreoni, Andreas Kraemer, William Swope, Julia Rice, Isidore Rigoutsos, Tien Huynh, Daniel E. Platt, Yuan Gao, Andrea Califano
email: ajayr@us.ibm.com
We have employed a variety of methods to predict multiple models for 42 of the 43 targets.
Comparative Modeling
Target sequences were aligned to sequences from the pdb_select95 dataset using
PSI-Blast. For predictions submitted as alignments, we identified the best
alignment using the Blast reported metrics, visual inspection of the
structure, and available biological and additional information. For
predictions submitted as tertiary structure, we considered all significant
alignments (e-value < 0.1) and performed a statistical analysis to identify
conserved geometric relationships between Ca atoms of aligned regions. These
geometric relationshps were employed as Ca-Ca distance restraints in a
NMR-like torsion space restrained simulated annealing refinement protocol
using CNX, to produce an all-atom tertiary structure. For few (highly
homologous) targets, the model was further refined with Gromos in explicit
solvent for several hundred ps. Some models were refined with Charmm. A select
few models were energy minimized with CPMD.
Fold Recognition
Predictions for the non-homologous targets were created by the following methods.
Pattern Discovery
We have explored the application of pattern discovery algorithm to fold
recognition. Splash [1] is an algorithm for efficient, deterministic and
exhaustive discovery of patterns in biological sequences; it has been shown to
yield statistically and biologically significant patterns in PROSITE families
[2]. We employed Splash to discover patterns that occur in pdb_select95
sequences and are incident on the target seuqence. For all pattern derived
local alignments, we performed a statistical analysis to identify conserved
geometric relationships between Ca atoms of aligned regions. These geometric
relationshps were employed as Ca-Ca distance restraints in a NMR-like torsion
space restrained simulated annealing refinement protocol using CNX, to produce
an all-atom tertiary structure. For some predictions, the initial model was
created with defined secondary structure (based on PHD or Psi-Pred
predictions) and held rigid in torsion space, while the pattern derived
geometric restraints were employed to bring about specific tertiary contacts.
Swan: Smith-Waterman Alignment of Annotated Sequences
Each amino acide sequence in pdb_select95 was annotated with the corresponding
secondary secondary structure sequence. The target amino acid sequence and
predicted secondary structure sequence are aligned to each annotated sequence
in pdb_select95, using an implementation of the Smith-Waterman algorithm, with
a scoring function that linearly combines the amino acid alignment score and
the secondary structure alignment score. The amino acid alignment score is
based on the BLOSUM 62 substitution matrix. The secondary structure alignment
score is based on an arbitrarily constructed substitution matrix in the
alphabet of secondary structure states, that favors the identity and penalizes
the non-identity. Candidate templates are first recognized by the net
alignment score and the best template is identified by visual inspection of
the sequence/structure alignment, to yield predictions submitted in alignment
format.
Frankenstein
FRANKENSTEIN builds approximate "MONSTER" or draft structures from one or
more protein templates of known structure, tests them in the capacity of
trial three- dimensional threadings, and then progressively refines the most
likely structures found. A single template source is not required (though it
is of course desirable) and there may be sufficient departure from a
template by a sequence of transformations even if there is only one source
template.
(1) Template identification was (a) by pairwise comparison of all
overlapping 10 residue segments of the target with similar segments in all
sequences of the protein data bank, and (b) by a novel method FASTFINGER
(based on compression using prime numbers) to rapidly identify those
structures which have best match in an overall, more diffuse sense. By
considering the quality of match, initially proposed "SEED" segments of
higher match (typically, but not necessarily, the protein core) are
identified, the SEED segments not in progressive order are automatically
eliminated, and the number of discontinuities, arising from insertions and
deletions, is automatically minimized.
(2) FRANKENSTEIN then physically assembles the chain from the SEEDs,
retaining the three dimensional relationships between the SEED segment
coordinates and rapidly constructing draft models of the sections which link
the SEED segments by a method being submitted for patenting. The Sidechains
are then repaired, retaining the observed chi1,chi2,.. angles in the template,
whenever there is appropriate elationship between the chemical structure of
the sidechain in the target and the template, and otherwise referring to a
structure dictionary.
(3) Optimization of threading. Generally, the overall 3D threading is to be
quickly rejected or accepted, by consideration of compactness, hydrophobic
packing, and recognizable folding elements. In some cases, the structure was
provisionally accepted subject to ability to optimize further the required
properties, by readjustment of the threading.
(4) Refinement. Finally, The structure was then extensively minimized
overall with Robson-Platt [3] forcefield. It was then refined by equilibrating
for several hundred ps with Gromos or Charmm.
[1] A. Califano. Bioinformatics, 16:341-357 (2000).
[2] R. K. Hart, A. K. Royyuru, G. Stolovitzky, A. Califano. J. Comp. Bio. (In press).
[3] B. Robson and E. Platt J. Mol. Biol. (1986) 188, 259-281.
http://www.research.ibm.com/compsci/compbio
MOE-CCG , 444
number of submitted models: 7
in the Molecular Operating Environment
Ken Kelly,
Chemical Computing Group Inc
email: kjk@chemcomp.com
MOE-CCG's submissions to CASP4 were exercises of the protein
alignment and comparative modelling facilities of MOE
(Molecular Operating Environment), which is a product of
Chemical Computing Group Inc. All methods described here were
implemented in a high-level language called
SVL (Scientific Vector Language), and are available within
MOE in source code form for inspection and modification.
Homology searching was performed within MOE against a library of
alignments, structure-based where possible. This library, distributed
with MOE, was built by exhaustive clustering of the data in the
RCSB Protein Data Bank. The alignments are augmented by sequence data
from the NCBI protein database and the PIR resource.
The clustering procedure leveraged two aspects of MOE's alignment
module: its use of an A-star search to correctly calculate the
affine gap penalty in group-to-group alignments, as well as a
structure-based alignment capacity, which refines seed alignments
through the use of a scoring matrix derived from globally-optimal
multi-body superposition.
The homology modelling procedure is similiar
to the segment-matching methodology of Levitt (JMB,1992),
in that it is a multiple-model procedure with randomized
build-order, with boltzmann-weighted choice of loop and sidechain
conformations
based on a vdw-style energy function. This function can take
into account any organic moiety associated with the template
structure, such as a co-factor, metal, or associated protein.
Hence, at least one of our models was built from a template
with a bound ligand in place; in another both parts of a multimer
were built simultaneously.
Loop candidates come from a database of protein structures;
sidechain conformations from a large rotamer library generated
by conformation searches over penta-peptides.
The final models were selected by application of a statistical
packing quality function (histogram based). As with the energy
function used in building candidate models, the packing quality
function uses statistics gathered with a flexible atom typing
scheme extended to non-protein atoms.
Models were refined in MOE using its
implementation of the Amber95 forcefield (Cornell et al, JACS, 1995).
The built-in protocol uses a series of minimizations with
variable-weight tethers, beginning with Conjugate-Gradient and
terminating with Truncated-Newton. Certain atom positions identified
as (geometrically) conserved across the alignment were held fixed
during the early stages of refinement.
Barton , 173
number of submitted models: 25
construction of sequence-structure alignmentswith sequence profiles
Dengler, U., Webber, C., Searle, S. M. J.,
& Barton, G. J.
email: dengler@ebi.ac.uk
Identification of template structures:
To identify possible homologues of a target sequence in the PDB, iterative
sequence searches with PSI-BLAST (Altschul et al., 1997) and SCANPS (Barton et
al., 2000) were run against a combined search set comprising the SWALL, PDB
and ENSEMBL databases.
From the results of these searches, the hit with the lowest E-value whose
structure was available from the PDB was selected. In cases where there were
several candidate structures with similar E-values, the structure with the
highest quality was chosen.
Sequence-structure alignment:
A sequence profile was either taken from JPred2 (Cuff & Barton, 2000) or
constructed from the hits of the PSI-BLAST search against the combined search
set and the hits of tblastn searches against the EST sections of the EMBL
nucleotide sequence database. Hits were defined as those sequences aligning with
an E-value less than 5x10e-4. These sequences were aligned all-against-all using
AMPS (Barton, 1990) and one sequence was removed from any pair sharing greater
than 75 0dentity. To construct the sequence profile, the remaining sequences
were aligned with ClustalW (Thompson et al., 1994) in multiple alignment mode.
The target sequence was aligned to this profile with ClustalW in profile
alignment mode. From the resulting multiple sequence alignment, a pairwise
alignment of the target sequence with the template sequence was extracted. In
cases where the sequence of the target and the sequence of the template shared
high sequence identity, they were aligned with ClustalW directly.
To attain the final alignment, the pairwise alignment was edited by hand in the
JalView (Clamp et al., 1998) alignment editor. In the editing process, the
three-dimensional structure of the template protein and the predicted secondary
structure for the target sequence obtained from JPred2 were taken into account.
Model building:
No three-dimensional models of the targets were built. However, in a pre-model
building step, between two and four aligned positions flanking gaps were
removed, depending on their structural context. Usually these residues do not
occupy the same position in space as their counterparts in the template
structure.
Literature:
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller,
W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucl Acids Res 25, 3389-3402.
Barton, G. J., Webber, C. & Searle, S. M. J. (2000). ScanPS (Scan Protein
Sequence) - High performance parallel iterated protein sequence searching with
full dynamic programming and dynamically estimated sequence statistics. ISMB
2000: 8th international conference on intelligent systems for molecular biology,
San Diego, California.
Barton, G. J. (1990). Protein multiple sequence alignment and flexible pattern
matching. Methods Enzymol 183, 403-428, 1990.
Clamp, M., Cuff, J. & Barton, G. J. (1998). Jalview - a java multiple alignment
editor. http://www2.ebi.ac.uk/~michele/jalview/contents.html.
Cuff J. A. & Barton, G. J. (2000). Application of multiple sequence alignment
profiles to improve protein secondary structure prediction. Proteins 40, 502-
511.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment through sequence
weighting, positions-specific gap penalties and weight matrix choice. Nucl
Acids Research 22, 4673-4680.
SDSC1 , 186
number of submitted models: 16
I.N. Shindyalov, P.E. Bourne
email: shindyal@sdsc.edu
Finding accurate alignment between template and target is one
of the major bottlenecks in homology modeling. One way of
addressing this issue is to use intermediate sequences i.e.
sequences similar to both target and template sequences. We
propose the procedure for improving alignment with the use
of intermediate sequences with the following steps:
(i) finding candidate intermediate sequences using BLAST
program (gapped version); (ii) finding relevant templates
and building initial alignment; (iii) finding intermediate
sequences for a given target and template in non-redundant
set of protein sequences from NCBI; (iv) building a set of
pair-wise alignments between target and template based on
intermediate sequences; (v) weighting alignments according
to phylogenetic tree and combining them into target/template
similarity matrix; (vi) finding consensus alignment by
applying local dynamic programming to the target/template
similarity matrix. The procedure described improves quality
of the alignment between target and template sequences on
20-40 0epending on accuracy range used.
Zemla-Joanna , 330
number of submitted models: 12
Joanna Zemla,
Adam Zemla
email: joanna_zemla@yahoo.com
Our motivation to participate in CASP4 process as predictors
was to test effectiveness of new tools (MAPA and LGA) developed
to build protein 3D models in comparative modeling category.
Parents (sequences related to the target protein) were obtained
either using PSI-BLAST search against non-redundant NCBI
sequence database or using FASTA search against PDB database.
MAPA as our new sequence comparison tool allowed us to verify
the correctness of sequence alignments produced by PSI-BLAST
or FASTA. LGA as a structure aligner was used to perform a pairwise
comparison of selected parents to establish structurally conserved
regions. These regions were used as a basis to build a model.
The 3D coordinates for loop regions were assigned from the local
library of fragments of folds created "on fly" by LGA program.
The final main chain model was generated by AL2TS program.
This approach will be developed to fully automated protein
structure prediction method.
References:
Zemla A., http://PredictionCenter.llnl.gov/local/al2ts/al2ts.html
Zemla A., http://PredictionCenter.llnl.gov/local/lga/lga.html
Gerloff , 003
number of submitted models: 13
Zeti A. M. Hussein, Melanie McCarthy-Troke, Bernard J. Mitchell, Cairan M. Duffy, Siu-Wai Leung and Dietlind L. Gerloff*
*: corresponding author
email: D.Gerloff@ed.ac.uk
We have submitted tertiary structure predictions for eight CASP4
target proteins (see below) in order to investigate the potential
of knowledge and/or predictions about functional sites on these
proteins for being used in combination with established methods.
The prediction categories in which our predictions will be considered,
as well as their assigned degrees of difficulty, are likely to vary.
The applicable categories could range from ab initio modelling (tertiary
assembly of predicted secondary structure elements) over fold prediction
to threading alignments. Two targets for which the fold prediction was
very "obvious" (T0100 and T0101) might be designated comparative
modeling targets by some assessors. Accordingly, we are submitting
this method abstract to all three categories.
Besides our emphasis on formulating distance and/or geometrical
constraints for our models based on functional site knowledge, or
prediction, the only unifying link between our submissions is the
use of predicted Surface/Interior/ActiveSite/Parse positions (termed
SIAP-predictions in the following) according to the approach by
Benner and Gerloff, which is most conveniently described in:
S. A. Benner, G. M. Cannarozzi, D. Gerloff, M. Turcotte and
G. Chelvanayagam (1997). Bona fide predictions of protein secondary
structure using transparent analyses of multiple sequence alignments.
Chem. Reviews 97, 2725-2843.
Below, we provide more explicit information regarding the methods
used in each of the predictions. We wish to emphasize that this
information is by no means to be considered redundant with the
information given in the header text of individual prediction
submissions made to CASP4. Specifically in the header text, we
attempted to highlight the most relevant clues that led us to
each prediction (or, rather, those that we thought to be relevant
without knowledge regarding the experimental structures), and from
which underlying functional assumptions these clues were derived.
In this way, we hope to provide a transparent account of our prediction
strategy from which we will be able to learn which of our assumptions
were valid, and could, potentially, be used more generally in tertiary
structure prediction after further validation on known structures.
T0086 - Chorismate Lyase - UBIC
--------------------------------
This prediction rests primarily on the assembly of secondary
structures around a putative active site. FUNCTIONAL SITE PREDICTION
included speculations regarding a plausible mechanism for the catalyzed
reaction, and the roles of conserved functional residues. Incidentally,
using one monomer of the trimeric Bacillus chorismate mutase structure
as scaffold, and assuming an approximately equivalent location of
substrate binding led to a moderately satisfactory model.
FOLD PREDICTION: assembly of predicted secondary structures + pathway
neighbors were examined preferentially. FOLD AGREEMENT WITH CAFASP2: no.
INCREASED THE NUMBER OF HOMOLOGOUS SEQS: yes (genome projects via PEDANT).
SECONDARY STRUCTURE PREDICTION: mainly from CAFASP2 servers.
SIAP-PREDICTION: yes. CONFIDENCE IN TOPOLOGY PREDICTION: medium.
T0087 - PPase - PPX1
---------------------
This submission consists of manual fold recognition prediction(s)
and threading alignments for two predicted domains in target T0087.
FUNCTIONAL SITE PREDICTION was possible only for the first domain.
It centers in the predicted location, and ligands, for a Mn2+ ion
which is relevant for catalytic function, and the compatibility
of the predicted fold and threading alignment with our expectations
with respect to composition and geometry of a polyphosphate hydrolase
site. The predictions use the fold of the Thiamine(-PP) Binding Domain
as the parent fold. Interestingly our prediction rules out a Rossmann
fold for the first domain.
FOLD PREDICTION: combinatorial assembly of predicted secondary
structures + functional hypothesis (Mn2+-site; pyrophosphate-binding).
FOLD AGREEMENT WITH CAFASP2: no. INCREASED THE NUMBER OF
HOMOLOGOUS SEQS: yes (genome projects via SAMT99 (CAFASP2 #15)).
SECONDARY STRUCTURE PREDICTION: mainly from CAFASP2 servers.
SIAP-PREDICTION: yes. CONFIDENCE IN TOPOLOGY PREDICTION: high (first
domain) + medium (second domain).
T0094 - Cyclic Phosphodiesterase - CPDase
------------------------------------------
This submission is a tentative manual threading alignment,
assuming that the weak sequence similarity to histone acetyl-
transferase (1ygh), as detected by pdb-blast (CAFASP server #1),
bears any significance. FUNCTIONAL SITE PREDICTION included our
emphasis to group together most of the functional residues that
were conserved in two putative homologs, and to fulfill the spatial
requirements for two disulfide bridges. The difficulties with
fitting the predicted secondary structure, as well as the different,
inferred, active site location of target vs. template, could indicate
that the assumptions are in fact invalid, and our prediction false.
FOLD PREDICTION: BLAST against PDB (CAFASP2 server #1).
FOLD AGREEMENT WITH CAFASP2: yes. INCREASED THE NUMBER OF
HOMOLOGOUS SEQS: yes? (Geobacter genome, distant homolog?, via NCBI).
SECONDARY STRUCTURE PREDICTION: mainly from CAFASP2 servers (esp. #7).
SIAP-PREDICTION: yes. CONFIDENCE IN TOPOLOGY PREDICTION: medium-low.
T0098 - Spo0A C-terminus - SP0A [solved: 1FC3]
-----------------------------------------------
Our prediction represents one possible arrangement of alpha-helices that
would seem compatible with our secondary and SIAP-predictions, as well as
limited FUNCTIONAL INFORMATION & HYPOTHESES. The latter included knowledge
of DNA-binding specificity at a bi-partite regulatory site, albeit with
uncertainty regarding the number of DNA-contact sites per protein domain.
We [falsely!] chose to presume two sites, and proposed a fold containing
two HTH-motifs seen in an earlier CASP-target (T0079). A possible Zn2+-
binding cluster of residues seemed to support our prediction, while
protein-to-DNA sequence-sequence correlation (1) was pointing towards
the correct location for the HTH motif, albeit at sub-significant scores.
FOLD PREDICTION: functional hypothesis (DNA-binding; 2x) + assembly of
predicted secondary structures + deja vu. FOLD AGREEMENT WITH CAFASP2:
yes (HTH-structural motif) + no (2 HTH-motifs and relative orientation).
INCREASED THE NUMBER OF HOMOLOGOUS SEQS: yes? (M.leprae genome, distant
homolog?, via NCBI; variation between sequences very limited).
SECONDARY STRUCTURE PREDICTION: CAFASP2-servers + Benner&Gerloff.
SIAP-PREDICTION: yes. CONFIDENCE IN TOPOLOGY PREDICTION: medium.
T0100 - Pectin Methylesterase - PMEA [solved: 1QJV]
----------------------------------------------------
This submission is a manual threading alignment using a single-stranded,
right-handed beta helix as the parent fold, as this was suggested by the
majority of CAFASP2-submissions for this target. The most important anchors
for the threading alignment came imposed by PUTATIVE DISULFIDE BRIDGES in
other members of the T0100-family, and FUNCTIONAL SITE PREDICTIONS
(of putative active site residues, their relative locations and their
locations with regard to the putative pectin binding site). These clues
were used in combination with known as well as predicted, structural
constraints that apply specifically to repetitive folding topologies
of this kind (reviewed e.g. in (2)). [In retrospect, all of our
assumptions seem to have proven valid; however, we overpredicted an
insertion after approx. 290aa, which corresponded, in reality, to a full
turn of the beta-helix].
FOLD PREDICTION: obvious. FOLD AGREEMENT WITH CAFASP2: yes.
INCREASED THE NUMBER OF HOMOLOGOUS SEQS: no.
SECONDARY STRUCTURE PREDICTION: not used. SIAP-PREDICTION: yes.
CONFIDENCE IN TOPOLOGY PREDICTION: high.
T0101 - Pectate Lyase - PELL
-----------------------------
This submission is a manual threading alignment using a single-stranded,
right-handed beta helix as the parent fold, as this was suggested by the
majority of CAFASP2-submissions for this target. In contrast to T0100,
only little help could be obtained through speculative disulfide bridges.
Instead the alignment is based on FUNCTIONAL SITE PREDICTION of candidate
residues forming a calcium-binding site which is relevant for enzymatic
activity in most members of this family of pectate lyases. Incidentally,
our hypothesis would place the calcium ion at a similar location as it
is found in co-crystal structures of other pectate lyases (e.g. 1bn8).
Further help came from fold-specific structural anchors, e.g. N-ladders (2).
FOLD PREDICTION: obvious. FOLD AGREEMENT WITH CAFASP2: yes.
INCREASED THE NUMBER OF HOMOLOGOUS SEQS:yes (genome projects via
SAMT99 (CAFASP2 #15)). SECONDARY STRUCTURE PREDICTION: not used.
SIAP-PREDICTION: yes. CONFIDENCE IN TOPOLOGY PREDICTION: high.
T0105 - Protein Sp100b - SP100
-------------------------------
Our prediction for is an ab initio assembly of predicted
secondary structure elements (see separately submitted SS prediction),
based primarily on two FUNCTIONAL SITE PREDICTIONS: (i) a putative cluster
of Cys residues (Zn-binding?) in all homologues but drawn from different
locations in the sequence; (ii) DNA-binding via a recognition helix.
Constraints were derived from these prediction and used in combination
with the predicted amphiphilic or internal characters of the beta strands.
The result is a strongly twisted, half open and concave barrel structure
which can be viewed almost as a cyclic permutated SH3-barrel. Combinatorial
analysis suggests there might be other plausible topologies. Producing
an approximate coordinate model representing this topology prediction was
extremely difficult because of beta-strand twisting and bending. We hope
that the intended topology is discernable, at least...
FOLD PREDICTION: combinatorial assembly of predicted secondary
structures (ab initio). FOLD AGREEMENT WITH CAFASP2: no.
INCREASED THE NUMBER OF HOMOLOGOUS SEQS: yes (literature! (3)).
SECONDARY STRUCTURE PREDICTION: CAFASP2 servers + Benner&Gerloff.
SIAP-PREDICTION: yes. CONFIDENCE IN TOPOLOGY PREDICTION: medium-low.
T0121 - MalK - MALK [C-terminal 135aa only]
--------------------------------------------
Our submissions include two different manual threading alignments (models 1
and 2) for the C-terminal 135 residues onto a duplicated OB-fold template.
Besides this parent structure, we would also consider it possible to use
a minimal Ig-like beta sandwich-core (strand connections A-(+3)-B-(-1)-C-
(-1)-D-(+3)-E-(+1)) but preferred to concentrate on the alignment issue
with 1b9m_A as the preferred parent structure. The function of the domain
is basically unknown. We tried to derive FUNCTION SITE PREDICTIONS based
on sequence conservation and literature (limited information about muta-
genesis experiments, extracted from (4) and many other publications),
e.g. an epitope function for the strongly conserved "GIRPED" sequence.
In the alignment for model 1, we allowed these reflections to influence
the sequence-to-structure alignment; the alignment for model 2 is based
on sequence similarity and SIAP-prediction (and looks quite plausible, too).
FOLD PREDICTION: 2 plausible alternatives from CAFASP2 servers +
combinatorial assembly of predicted secondary structures.
FOLD AGREEMENT WITH CAFASP2: yes (CAFASP FFAS server #10, for the
fold alternative used in the submissions). INCREASED THE NUMBER
OF HOMOLOGOUS SEQS: yes (genome projects via SAMT99 (CAFASP2 #15)).
SECONDARY STRUCTURE PREDICTION: CAFASP2 servers + Benner&Gerloff.
SIAP-PREDICTION: yes. CONFIDENCE IN TOPOLOGY PREDICTION: medium.
References used in individual predictions:
-------------------------------------------
(1): M. Suzuki, S. E. Brenner, M. Gerstein and N. Yagi (1995).
Protein Eng. 8, 319-328.
(2): J. Jenkins, O. Mayans, R. Pickersgill (1998).
J. Struct. Biol. 122, 236-246
(3): T. J. Gibson, C. Ramu, C. Gemund and R. Aasland (1998).
Trends Biochem. Sci. 23, 242-244
(4): G. Schmees, A. Stein, S. Hunke, H. Landmesser, E. Schneider (1999).
Eur. J. Biochem. 266, 420-430
SDSC2:Reddy-Bourne , 187
number of submitted models: 19
Boojala V. B. Reddy and Philip E. Bourne
[SDSC2:Reddy-Bourne (7555-8669-7298)]San Diego Supercomputer Center9500 Gilman Dr., #CRB-166, MC-0537La Jolla, San Diego CA 92093-0537Phn: 858-822-0860; Fax: 858-822-0873breddy@sdsc.edu, bourne@sdsc.edu
email: breddy@sdsc.edu, bourne@sdsc.edu
(I) Identification of template structure for comparative modeling.
(a) Web based programs PSI-BLAST(NCBI), FASTA(EBI) and MPSearch(EBI) were
used to compare a target against the PDB sequence database.
(i) If homologs were identified those were used iteratively to define more
distant homologs. PDB identifiers (PDBids) of the corresponding sequences were
collected.
(ii) In using FASTA and MPSearch, PDBids were collected using PAM-250,
PAM-350, PAM-500, PAM-120, BLOSUM62, BLOSUM40 scoring matrices and the GONNET
substitution table. Default GAP-penalties were used in all cases.
(iii) In cases where no hits were identified we used the non-redundant protein
sequence (NR) database to search for homologous sequences. Then based on the
name(s) of the proteins which gave good E-values we searched the PDB sequence
database and collected the PDBids for similar protein names.
(b) An all-to-all structure comparison using the Combinatorial Extension
algorithm (CE) with a z-score of 4.5 and an RMDS of 4.0 Å provided all similar
structures to the PDB proteins defined by step (a) These were referred to as
CE-PDBids.
(c) The CE-PDBids were used in GenTHREADER to identify the most probable
folds.
(d) Using CE we iteratively compared similar and representative PDBids of the
top 10 folds until the genTHREADER scores converged to the same top ten folds.
Folds 1-5 were selected along with the sequence alignments provided by
genTHREADER as templates for homology modeling.
(II) Comparative modeling of target structure.
(a) We used only one template that had the highest z-score from genTHREADER
and modeled the target structure using the HOMOLOGY module from INSIGHT-II of
MSI Life Science software.
(b) Loop modeling was done using the loop search routine of HOMOLOGY and we
assigned one of the ten high loops suggested by this routine. The end repair
routine was used if necessary.
(c) All loops were refined using the steepest descent and conjugate gradient
minimization routines. Similarly SCR side chains are energy minimized.
(d) The final model was refined using the optimize routine in the BOULDER
module of InsightII before submitting to CASP4.
Moult , 363
number of submitted models: 90
Eugene Melamud, Carol DeWeese-Scott, Zhen Wang,
John Moult
email: cdscott@carb.nist.gov
CASP4 fold recognition Methods:
To identify potential fold relationships we used a combination of
sequence profile and secondary structure prediction methods.
A library of fold profiles was constructed for the members of
the 35on-redundant PDB subset, by running PSI-Blast for 5 iterations
with an expectation score cut-off 0.001, against the NCBI nr database,
and storing the resulting Position Specific Scoring Matrices(PSSMs).
A PSSM profile was constructed for each target in the same manner.
These target profiles were aligned to each PDB profile using a modified
version of the Smith-Waterman algorithm.
The significance of each alignment was evaluated by randomly shuffling
the template profile and deriving a Z-score distribution. The resulting
highest scoring alignment was then assessed for secondary structure
comparability by comparing the secondary structure predictions of the
template and target using PsiPred. The top 10 hits were then examined
for functional and fold relationship significance.
In addition to the profile-profile method we also used an "intermediate
sequence-profile" method based on the Impala program. All sequence
relatives of the target from a PSI-Blast output were searched using
Impala against the library of folds. The Impala PDB hits were then
collected and sorted based on expectation-scores. The top scoring hit,
the intermediate sequence, and the target sequence were then combined
into a single alignment using the ClustalW program. This alignment was
then examined visually for common conserved regions.
Comparative Modeling Methods:
Generating Alignments:
Where multiple templates were available, a structural superposition was
generated using CE. The resulting structure-based "template profile" was used
as the basis for a set of sequence alignments between the target and the template.
The target sequence was also searched against the nr database using PSI-BLAST to
generate a sequence-based "target profile."
Our goal was to generate as many reasonable alignments with as many
different templates as possible, and then evaluate them to determine which
template and which alignment would produce the most accurate model. Both the
target sequence and the target profile were aligned against the template profile
using CLUSTALX with varying gap opening penalties. If FASTA or PSI-BLAST
generated an alignment for the target to a sequence, those alignments were also
considered.
Evaluating Alignments:
Alignments were evaluated using in-house software. Each pair of residues
forming an interaction in the template was compared with the corresponding pair
in the target model, and a score was assigned based on the likelihood of the
substitutions. The methodology of the system is as follows: 1.) For each
template selected, all interactions between amino acids in the protein are
identified using two different methods. In one, any atomic contact within 4.5 A
is considered an interaction; in the other, contacts involving pre-defined
group-group interactions are used. 2.) Each interaction is assigned a score
using a database-derived scoring matrix. The score reflects the probability
within a family of the template pair of interacting residues being replaced
by the corresponding pair in the target.
Evaluations were also made based on the prevalence of insertions
and deletions in secondary structure units, volume changes, substitutions of Gly
residues with disallowed phi/psi to non-Gly residues, and cis-Pro substitutions
to non-Pro residues.
In addition, secondary structure predictions from PHD and PSIPRED were
compared against the actual secondary structure (defined by DSSP) of the template
for each alignment of the target to the template.
Loops:
If a loop region contained the same number of residues in both the target
and the primary template, it was assumed that the structure is similar to the
primary template. Loop-building methods were applied in cases of target loops
with insertions or deletions relative to the primary template.
A limited database search method was used to identify alternate family
members likely to make good templates: If a template contained the same number
of loop residues as the target, it was identified as providing a potential
conformation for that loop. Each alternate template was locally superimposed
to the primary template using the secondary structural regions on either side
of the loop. The final conformation for a loop was chosen based on the proper
number of residues, structural similarity of neighboring secondary structure
relative to the primary template, and sequence identity between the alternate
template and the target.
Where no template contained an appropriate loop, a graph-based method
was used to generate possible loop conformations (Samudrala & Moult, 1998).
Final conformations were selected on the basis of the relative score and visual
inspection. Detailed models were built using the in-house comparative modeling
package APSE.
MSI , 447
number of submitted models: 29
David H Kitson
email: dkitson@msi-eu.com
Homology modelling methods were used to build models of the alpha subunit of
tryptophan synthase from Pyrococcus furiosus. This protein is target T0122
for the CASP4 experiment.
The PSI-BLAST [1] server at NCBI was used to search the sequences of the
entries in the PDB database [2] for matches to the 248-residue sequence of the
target protein. Five hits were found with highly significant E-values (<=
1.0e-30) - 1CW2, 1C8V, 1A5A, 1BEU and 2TSY (in all cases, the hits were to the
A chains of these proteins).
Atomic coordinates are missing from several residues in each of these PDB
structures. To give as complete a coverage of the target protein as possible,
two template structures, both for tryptophan synthase from Salmonella
typhimurium - 1CW2 [3] and 1A5A [4] were selected. Both of these structures
have a cis peptide bond between residues A27 and A28.
The initial sequence alignments was created between 1CW2 and the target. Two
alignment methods were used - ClustalW [5] and Align2D (a part of the Modeller
package [6,7], which attempts to place gaps in structurally reasonable
locations). These alignments will be referred to as CW and A2D. A third
alignment, A2DA, was made by making a minor manual alteration to the A2D
alignment in one region of the sequence (around residues A189 - A192).
The second template, 1A5A, was then added to the alignment (the sequence
matches the sequence of 1CW2). Five models were built for each of the three
alignments (CW, A2D and A2DA) using Modeller [6,7]. For each model, the loop
refinement option of Modeller was used to generate five sets of loop
structures.
Profiles-3D [7,8] was run to analyse each of the models. Profiles-3D
generates a score that reflects the suitability of the environment of each
residue. The sequences of all models generated by a particular alignment
method were coloured to indicate the regions that Profiles-3D suggested might
be misfolded.
Based on the patterns of Profiles-3D scores, and manual examination of the
models, a hybrid final alignment (FINAL) was chosen. This consisted of the
A2DA alignment for the N-terminal region, CW or A2DA around the region of A189
- A192 (they are the same in this region) and CW at the C terminus. Five
models, with five optimised loop models for each, were built for this FINAL
alignment using Modeller.
Profiles-3D was again used to evaluate these models. Based on a combination
of the Profiles-3D score and the Modeller PDF (a measure of the extent to
which the models fit the restraints derived from the templates) models were
chosen for submission. Three of the submitted models (numbers 1, 2 and 5)
were built using the FINAL alignment. Submitted model number 3 was built
using A2DA and model 4 was built using CW.
1. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z.,
Miller, W. and Lipman, D. J. (1997) Nucl. Acid Res., 25, 3389-3402.
2. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N.,
Weissig, H., Shindyalov, I. N. and Bourne, P. E. (2000) Nucleic Acids Res.,
28, 235-242.
3. Sachpatzidis, A., Dealwis, C., Lubetsky, J. B., Liang, P. H., Anderson, K.
S. and Lolis, E. (1999) Biochemistry, 38, 12665.
4. Rhee, S., Miles, E. W. and Davies, D. R. (1998) J. Biol. Chem., 273, 8553.
5. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) Nucl. Acids Res. 22,
4673-4680.
6. ali, A. and Blundell, T. L. (1993) J. Mol. Biol., 234, 779-815.
7. Molecular Simulations, San Diego, CA (2000).
8. Lüthy, R., Bowie, J. U. and Eisenberg, D. (1992) Nature, 356, 83-85.