CASP4 Abstracts

Fold recognition category



These are the abstracts submitted by the predicting groups to the 2000 CASP4 meeting.

002 , Lomize-Andrei
003 , Gerloff
012 , Levitt
023 , Jones
027 , SHESTOPALOV
028 , Ram-Samudrala
032 , Wolynes
042 , Honig-Barry
044 , Walts-Wondrous-Wizards
045 , Del-Carpio-Yoshimori
052 , MRIT-Onizuka
065 , Torda-Andrew
073 , Holm
076 , Weng
086 , Bass-Michael
088 , ORNL-PROSPECT
094 , SAM-T2K
103 , Fugue-Cam
111 , SAM-T99
126 , Sternberg
133 , CBC-FOLD
137 , Zhou-HX
150 , Chandonia-Cohen
162 , Valencia-CNB
173 , Barton
187 , SDSC2:Reddy-Bourne
191 , Lee-Jung
197 , Godzik
204 , Finkelstein
220 , valencia-cnb-pred
223 , Braun-UTMB
229 , UCLA-DOE
229 , UCLA-DOE
248 , BMERC
255 , BinToHes
274 , Tsigelny
278 , Flake&mates
280 , Elber-Meller-2000
328 , Gibrat-Marin
329 , Tatsuya
331 , Levy
344 , PDB-ISL
357 , Fischer-Daniel
361 , GMD-SCAI
363 , Moult
375 , Ho-Kai-Ming
381 , SBfold
382 , SBauto
384 , Murzin
389 , 123D+
390 , Taylor
393 , Skolnick-Kolinski-THD
401 , Reva-Boris
414 , Friesner
473 , Mushegian
492 , Knapp
536 , Fox-Sheppard


Jones , 023

number of submitted models: 56

Fold recognition using THREADER and GenTHREADER

David T. Jones

Brunel University
email:
David.Jones@brunel.ac.uk

 
THREADER3 is the latest incarnation of our well-known threading 
program (D.T. Jones et al. Nature 358, 86-89, 1992) and although 
it now incorporates a number of new features (in particular the use of PSI-BLAST 
profiles), and a more refined set of potentials, the overall concept of the method 
remains more or less unchanged since CASP2. Firstly, a library of unique,  
continuous protein domain folds is derived from the database of protein  
structures. The fold library used throughout CASP4 was based on the  
domains found in SCOP V1.50 (A.G. Murzin et al. J. Mol. Biol. 247, 536-540, 1995). 
Each fold is considered as a  chain tracing through space with the original 
sequence either being ignored completely (for fold recognition predictions) 
or weighted into the scoring function (for comparative modelling targets). 
The test sequence is then optimally fitted to each library fold (allowing for 
relative insertions and  deletions in loop regions), using a double dynamic programming  
algorithm, with the 'energy' of each possible fit (or threading) being  
calculated by summing the proposed pairwise interactions and solvation  
parameters. 
 
Unlike in previous years, THREADER3 was used to make fold  
assignments without any reference to functional information. This was  
partly due to a lack of time and partly because we wished to test a new idea  
for automatic post-processing of threading predictions. For CASP4, the  
raw threading output was evaluated using a neural network (similar to that used 
in GenTHREADER) trained to  discriminate between correct and incorrect fold recognition 
matches. This method is still very experimental, but it was used for all "non-obvious"  
predictions targets. Final predictions were based on the final neural  
network output. Predictions for targets where the neural network output  
(range 0-1) of the top match was < 0.5 were not submitted (but were  
selected for ab initio prediction if the size permitted). Only a single  
prediction was submitted for each target, unless either a second fold had an  
equal score to the top hit or in a few cases where more than one alignment  
was generated with and without secondary structure prediction inputs. 
 
Remote homology targets were predicted using GenTHREADER/mGenTHREADER as submitted 
to the CAFASP2 prediction section. However, in making CASP4 submissions, where  
GenTHREADER was able to make a confident prediction (generally in  
cases where a clear evolutionary link is apparent between the target  
program and an entry in the fold library), this fold was assumed correct  
and THREADER3 was simply used to generate the final alignment (though with  
appropriate sequence and secondary structure weighting options). 
 


SHESTOPALOV , 027

number of submitted models: 152

Doublet Code of Protein Secondary Structure
and its Application for Secondary Structure Prediction and Fold Recognition

Shestopalov Boris V

Institute of Cytology of Russian Academy of Sciences
email:
shest@mail.cytspb.rssi.ru

 
The problem of the protein three-dimensional structure prediction has not yet 
resolved. We propose to resolve this problem using the Linderstrom-Lang 
hierarchial model of the protein three-dimensional structure formation [1].     
The first step of this process is the secondary structure formation. Then the 
local folds are formed - the supersecondary structure stage. The final stage 
is the tertiary structure formation.     We state that all the information on 
these stages is coded and contained in the previous levels of structure.     
The protein secondary structure code for water-soluble proteins is now 
determined (the preliminary versions are described in [2] and [3], some 
modifications are done for CASP4). The code is doublet one. The alpha-helices 
are coded by the amino acid residue pairs (i, i+4), the beta-structures - by 
the pairs (i, i+2), the coil regions - by the pairs (i, i+1). The code is 
overlapping one and the overlapping is resolved by the selection rule aiming 
to keep the most number of codons after selection.     During the CASP4 
experiment the protein secondary structure code has been used for the 
secondary structure prediction and the protein fold recognition. For secondary 
structure prediction the homologous sequences information has been used. The 
homologous sequences were searched by BLAST2 [4](EMBL service), PSI-BLAST and 
Conserved Domain Database [5] , NCBI service and PRODOM [6]. The most deverse 
subset has been used.     Then the predicted secondary structure has been 
confronted with the secondary structures from Protein Data Bank in search of 
similar sequences of secondary structure elements. The results obtained have 
been used for the fold recognition. In the case of helix-turn-helix motif our 
method has been used [7]. In some cases, when it was possible and useful, the 
expert considerations have been used. After the fold recognition the secondary 
structure prediction is corrected using the secondary structure alignment of 
the predicted secondary structure and secondary structures for proteins from 
PDB, recognized as most similar to the predicted protein. Alignment has been 
constructed manually, using, when it is possible, Yale structural alignments 
for PARENT and its homologues [8].     Evidently the result of the fold 
recognition depends on the quality of the secondary structure prediction and 
the final secondary structure prediction depends on the quality of the fold 
recognition. The main restriction of the method used is the application of the 
doublet code of the secondary structure based on the middle interactions only, 
excluding long ones. The use of the homologous sequence information, as it is 
known, does not garantee correct secondary structure prediction as well as the 
use of the fold recognition results. We hope to correct this situation after 
the completion of our theory of the protein three-dimensional structure, now 
in development. Then all the formal and expert ad hoc schemes developed for 
CASP4 will became unnecessary. We hope that in future it will be possible to 
construct the code tables using only pure physical considerations without 
statistical analysis of PDB data as now.    Five models for secondary 
structure prediction have been constructed. MODEL A is single sequence 
prediction (SSP), obtained by DOUBLET CODE METHOD as model 3 in CASP3 [3], 
with slightly modified code tables, MODEL B is obtained from MODEL A by 
transforming ambiguous and undetermined regions into COIL. MODEL C is multiple 
sequence prediction (MSP), obtained by application of DOUBLET CODE and 
PSI-BLAST, MODEL D is variant of MODEL C, obtained using Prodom, MODEL E is 
MSP with more expert intervention, including the using of the fold recognition 
results, described above. MODEL 1 is MODEL E, if absent - MODEL D, if absent - 
MODEL C, if absent - MODEL B. For the fold recognition it has been used MODEL 
D, if absent - MODEL C, if absent - MODEL B.                                                                                           
   
References 
 
1. Linderstrom-Lang K.V. (1952) Proteins and enzymes, Stanford Univ. Press, 
   Stanford, California. 
2. Shestopalov B.V. Prediction of protein secondary structure by doublet code 
   method. Mol. Biol., Moscow, Engl. transl.,24/4,  p.900-907. 
3. Shestopalov B.V.,CASP3, submitted. 
4. Yan P. Yuan, Eulenstein, O., Vingron, M. & Bork, P. 1998. Towards detection 
   of orthologues in sequence databases.  Bioinformatics, 14, 285-289  
5. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z, Miller W & 
   Lipman D.J. /1997/. Nucl. Acid. Res. v. 25, pp 3389-3402. 
6. http://www.toulouse.inra.fr/prodom.html. 
7. Shestopalov B.V. Amino-acid sequence template useful for alpha-helix-turn-alpha-helix 
   prediction, FEBS Lett. 233: (1) 105-108 JUN 6 1988. 
8. Krebs W., Gerstein M. http://bioinfo.mbb.yale.edu/align/server.cgi. 
 


Holm , 073

number of submitted models: 21

Intermediate sequence search

Jong Park, Sabine Dietmann, Andreas Heger, Liisa Holm

EMBL-EBI (European Molecular Biology Laboratory's outstation the European Bioinformatics Institute)
email:
holm@ebi.ac.uk

 
We tried to predict those targets for which there were no known 
homologous structures.  The basic method systematically applied  
in each case was PDB-ISL [1], an intermediate sequence search.  
  
In some cases, the alignment was improved by manual optimization 
of atomic solvation preference [2]. A large number of targets  
had no obvious hit by PDB-ISL, and their prediction was based  
merely on human judgement. Judgement included functional  
considerations, if there was a nonsignificant hit which might  
be a remote homologue of a known family, and solvation  
preference analysis.  We did not predict structures which  
looked like coiled coil from secondary structure composition. 
 
The most fun prediction was that for T102, where we did not  
spot the analogy to nk-lysin. PDB-ISL gave no hits, so we used 
pencil and paper. The prediction was based on correlated  
residues (implying contacts), which were identified from a  
multiple alignment, between residues 15-23, 16-52 and 18-34,  
and secondary structure elements known from NMR plus the  
information that the ends are covalently linked. Visual  
screening of helical structures rejected many as topologically  
impossible given helix handedness and the contact constraints,  
and resulted in three templates being selected, one of which  
required reversing the chain direction. The solvation  
preference profile was quite good for the t102-2tct model.   
 
Solvation preference analysis is dangerous because it only 
gives really good values for the native sequence-structure  
alignment and we had to optimize alignments manually and  
compensate for template divergence and gaps by guessing. Some  
PDB-ISL predictions were rejected due to bad solvation  
preference. Thus t109 was not predicted to be similar to the  
top PDB-ISL hit 1kfs, because we only found out too late that  
an optimized alignment gave a good solvation preference. 
 
References: 
 
[1] Teichmann SA, Chothia C, Church GM, Park J. (2000) Fast  
assignment of protein structures to sequences using the  
intermediate sequence library PDB-ISL. Bioinformatics 16:117-124. 
  
[2] Bork P, Holm L, Koonin EV, Sander C. (1995) The  
cytidylyltransferase superfamily: identification of the  
nucleotide-binding site and fold prediction. Proteins 22:259-266. 
 


Fischer-Daniel , 357

number of submitted models: 156

Fold Recognition Using the BioInBgu server

Fischer, Siew, Esterman and Mishalia

Ben Gurion University
email:
dfischer@cs.bgu.ac.il

 
We have submitted predictions for the casp4 experiment 
based on the results from the bioinbgu server. The goal 
was to submit the exact same output as the server, 
except for the cases where the server's result appeared 
to be weak. In such cases, an educated guess based on 
the top hits was applied, in combination to alternative 
searches performed on parts of the sequence and on homologues. 
An attempt to perform basically "computable" tasks was carried 
out. When all failed, a coin was flipped. We avoided using 
any biological information. 
 
The bioinbgu server has been described in Fischer, D, 
Proceedings of the Pacific Symposium in Biocomputing, 
 Hawaii, 119-130, January 2000, and its abstract follows: 
 
Recent assessments of structure prediction 
have demonstrated that  
(i) although fold recognition methods can 
often identify remote similarities when standard sequence search methods 
fail, the score of the top-ranking fold is not always significant 
enough to allow a confident prediction; 
(ii) the use of structural information such as secondary structure increases 
recognition accuracy;  
(iii) modern sequence-based methods incorporating 
evolutionary information from neighboring sequences can often identify 
very remote similarities;  
(iv) there is no one single method 
that is superior to other methods when evaluated over a wide range of targets, 
and  
(v) extensive human-expert intervention is usually required for the most 
difficult prediction targets. 
Here, I describe a new, hybrid fold recognition method that incorporates 
structural and evolutionary information into a single fully 
automated method.  This work is a first attempt towards the automation 
of some of the processes that are often applied by human predictors. 
The method is tested 
with two fold-recognition benchmarks demonstrating a superior performance. 
The higher sensitivity and selectivity enable the applicability 
of this method at genomic scales.        
 
 


Flake&mates , 278

number of submitted models: 215

Critical Assessment of techniques for protein structure prediction
in the Cassandra package

T. Huber, M.J. Abraham, D.J. Ayers, Z. Dosztanyi, J.B. Procter, A.J. Russell, A.E. Torda, S. Flake

Supercomputer Facility, Australian National University
email:
Thomas.Huber@anu.edu.au

 
In previous CASP experiments, methods based on divine inspiration, 
biochemical knowledge and predictions from publicly available servers 
have been remarkably successful. One may even think it fool-hardy to 
submit automatic guesses from a home-built prediction package. 
Following Eddie Edwards (Edwards 1990), one may, however, also treat 
the experiment as a critical assessment of techniques for protein 
structure prediction and use the opportunity to demonstrate 
reproducible strengths and weaknesses of one's own method. In this 
spirit, the game was not played by instinct. It was regarded as an art 
learnt by obedience to instruction and a complete disregard of self. 
 
Our calculations were performed with the Cassandra package, a locally 
written protein structure prediction package. 
Alignments of target sequences to a library of protein folds were 
generated by a two step threading approach (Huber & Torda 1999) with 
scoring terms based on terms from z-score optimised force fields 
(Huber & Torda 1998), PhD secondary structure predictions (Rost & 
Sander 1993) and PAM250 sequence similarity scores. The weights of the 
different scoring contributions and penalties for introducing gaps 
into alignments have been rigorously optimised against a large, 
statistically significant set of structurally aligned proteins with 
low sequence similarity. 
The template fold library was a set of only 893 protein structures. 
The coordinates of the structures, however, were optimised so 
as to score well with sequences of structurally similar 
proteins and simultaneously penalise inappropriate sequences. 
 
After ranking the predictions, side chains were placed using self- 
consistent mean field optimisation (Huber, Torda & van Gunsteren 
1994,1996). 
 
Acknowledgement: We like to thank the inspirational Laurie Nichols 
 
 
References: 
 
1) Eddie Edwards (The Eagle) (1990), 
   "I think what my Olympic participation shows is that you don't have 
   to be the best in the world to be popular." 
    
2) Huber, T. and Torda, A.E., 
   Prot. Sci. 7 (1998) 1-8. 
    
3) Huber, T. and Torda, A.E., 
   J. Comp. Chem. 20 (1999) 1455-1467. 
    
4) Rost B. and Sander C., JMB 232 (1993), 584-599. 
 
5) van Gunsteren W.F., Huber T., Torda A.E., 
   Proceedings of the 1st European Conference on Computational Chemistry, 
   American Institute of Physics Conf. Proc., New York (1994). 
    
6) Huber T., Torda A.E. and van Gunsteren, W.F., 
   Biopolymers 39 (1996), 103-114. 
 


Torda-Andrew , 065

number of submitted models: 93

Fold recognition and sequence to structure
alignments without Boltzmann-based force fields

Abraham, M, Ayers, D, Dosztanyi, Z, Huber, T, Procter, JB, Russel, AJ and Torda, AE

Australian National University
email:
Andrew.Torda@anu.edu.au

 
Alignments were calculated and models ranked using the sausage 
program [1]. Sidechains were fitted using a self-consistent 
mean-field method [2]. 
 
Three force fields were used in three different steps 
 
1. Sequence to structure alignments used a score function 
which used the identity of only one interaction partner 
[5]. This allowed us to use the Gotoh method [4] for speed, 
while avoiding the frozen approximation or double dynamic 
programming. 
 
2. Ranking of models used a z-score optimised force field [3] 
 
3. Fed by unbounded optimism or perhaps pure faith, 
side-chains were placed on the models using a more 
conventional, physically based, molecular mechanics style 
force field. 
 
The first two force fields may be knowledge-based, but they 
were built in complete ignorance of Boltzmann 
statistics. Instead, the parameters are optimised so as to 
distinguish native coordinates from a mass of misfolded 
structures. 
 
A second series of optimisation calculations allowed us to 
find weights for additional terms for secondary structure 
predictions [6], sequence similarity and gap penalties. 
 
Finally, the library of templates consisted not of simple 
protein coordinates, but rather of precalculated fields due to 
averaging over similar structures. 
 
The alignment code and methodology is undisputably fast. It 
may occasionally be correct. 
 
For the last few targets, secondary structure predictions were 
made using a neural net fed on the sausage alignment 
calculations. 
 
-------------------- 
 
[1] Huber T, Russell AJ, Ayers D, Torda AE (1999) 
Bioinformatics, 15, 1064-1065. 
Sausage: protein threading with flexible force fields. 
and 
http://www.rsc.anu.edu.au/~torda/sausage.html 
 
[2] Huber T, Torda AE, van Gunsteren WF (1996), Biopolymers, 
39, 103-114. 
Optimization methods for conformational sampling using a 
Boltzmann-weighted mean field approach. 
 
[3] Huber, T and Torda, AE (1999) Protein Sci, 7, 142-149. 
Protein fold recognition without Boltzmann statistics or 
explicit physical basis. 
 
[4] Gotoh, O. (1982) J Mol Biol, 162, 705-708. 
An improved algorithm for matching biological sequences. 
 
[5] Huber T, Torda AE (1998) J Comput Chem, 15, 1455-1467. 
Protein sequence threading, the alignment problem, and a 
two-step strategy. 
 
[6] Rost B and Sander C. (1993) J Mol Biol, 232, 584-599. 
Prediction of protein secondary structure at better than 70% 
accuracy. 
 


Mushegian , 473

number of submitted models: 4

PSI-BLAST, MACAW, SWISSMOD and nothing much more

A. Mushegian

Akkadix Corporation
email:
mushegian@akkadix.com

 
Detection of remote homologs with the known structure and proper alignment of the 
target and the template are crucial, if not only, determinants in successful fold 
recognition. I asked how far one can get in protein fold recognition by using the 
standard publicly available tools of database search and sequence alignment, PSI-BLAST 
(Altshul et al., 1997) and MACAW (Schuler et al, 1991), and the on-line modeling service, 
SWISS-MOD. I entered into the competition on Aug. 15th, and ignored the targets which 
expired or were annotated as having the homologs with known structure. Additionally, 
I disregarded a few short sequences and two targets which were predicted to consist 
mostly of the long coiled coils sensu Lupas. This left 16 targets. Seven of those 
turned out to have remote homologs with known structure, as judged by 
1. using PSI-BLAST with the cutoff 0.05 to convergence, or 
2. by collecting the homologs found in (1.) and using them as queries in new rounds of 
   PSI-BLAST or 
3. using the checkpoint profile built in (1.) and searching the PDB-fasta. 

Seven models were attempted based on the alignment of targets to these templates, 
and four best ones were submitted.  


Ram-Samudrala , 028

number of submitted models: 207

Handling interconnected structural changes in comparative modelling of
proteins using a statistical scoring function, graph theory, andexhaustive enumeration techniques

Ram Samudrala and Michael Levitt

Stanford University
email:
ram@csb.stanford.edu

 
The interconnected nature of interactions in protein structures, 
thorough sampling of side chain and main chain conformations, and 
devising a discriminatory function that can distinguish between 
correct and incorrect conformations are the major hurdles preventing 
the construction of accurate homology models. We present an algorithm 
that uses graph theory to handle the problem of 
interconnectedness. Sampling of side chain and main chain 
conformations is accomplished by exhaustively enumerating all possible 
choices using a discrete state model, including fragments from a 
database of protein structures.  The optimal combination of these 
possibilities is selected using an all-atom scoring function aided by 
the graph-theoretic approach. 
 
Following is a brief description of the components and steps of this 
method, which can be divided into: discriminatory function, 
identification of template and generation of alignment, initial model 
building, construction of variable main chain and side chain regions, 
and moving models closer to the native conformation. 
 
0. DISCRIMINATORY FUNCTION: the function used throughout generally is 
an all-atom distance-dependent conditional probability discriminatory 
function based on a statistical analysis of known protein 
structure. The negative log of the conditional probability of 
observing two atoms interact given a particular distance is used as a 
``pseudo-energy'' term.  Reference: J Mol Biol 275: 893-914 (1998). 
 
1. IDENTIFICATION OF TEMPLATE AND GENERATION OF ALIGNMENT: The CAFASP 
meta-server data were used to identify the proteins that a given 
target sequence was related to (based on a consensus of all the hits 
produced by the different servers). The alignments generated by the 
different servers were then used to construct initial models. The 
initial models were then ranked by our discriminatory function and the 
models that ranked highest were used for further model-building. 
 
2. INITIAL MODEL BUILDING: Following the sequence alignment, for each 
parent structure, an initial model was generated by copying atomic 
coordinates for the main chain (excluding any insertions) and for the 
side chains of residues that are identical in the target and parent 
structures.  Residues that differ in type were constructed using a 
minimum perturbation technique.  The MP method changes a given amino 
acid to the target amino acid preserving the values of equivalent chi 
angles between the two side chains, where available. The other chi 
angles are constructed by the MP method using an internally developed 
library based on residue type. 
 
3. CONSTRUCTION OF VARIABLE MAIN CHAIN AND SIDE CHAIN REGIONS:  
 
Main chain sampling is performed using an exhaustive enumeration 
technique based on discrete states of phi/psi angles. For longer main 
chain regions, we use fragments (3-tuples) from a database of protein 
structures to generate the discrete phi/psi angles. 
 
Side chains possibilities are generated by selecting the most probable 
side chain rotamers based on the interactions of a given rotamer with 
the local main chain (evaluated using the discriminatory function 
above). Reference: Samudrala R, Moult J. Prot. Eng.  11: 991-997, 
1998. 
 
We then use a graph-theoretic approach to assemble the sampled side 
chain and main chain conformations together in a consistent manner. 
Each possible conformation of a residue is represented using the 
notion of a node in a graph.  Each node is given a weight based on the 
degree of the interaction between its side chain atoms and the local 
main chain atoms.  The weight is computed using a all-atom conditional 
probability discriminatory function. Edges are then drawn between 
pairs of residues/nodes that are consistent with each other (i.e., 
clash-free and satisfying geometrical constraints). The edges are also 
weighted according to the probability of the interaction between atoms 
in the two residues. Once the entire graph is constructed, all the 
maximal sets of completely connected nodes (cliques) are found using a 
clique-finding algorithm. The cliques with the best probabilities 
represent the optimal combinations of mixing and matching between the 
various possibilities, taking the respective environments into 
account.  Reference: J Mol Biol 279:287-302 (1998).  Clique-finding is 
accomplishing using the Bron and Kerbosch algorithm.  Reference: 
Communications of the ACM, 16: 575-577 (1973). 
 
All models used were refined using ENCAD. 
 
5. MOVING MODELS CLOSER TO THE NATIVE CONFORMATION: 
 
Once we had generated a final model for each parent, we used  
an off-lattice fourteen-state phi/psi model and a sequential 
build-up algorithm to generate structures around the conformational 
space of the final model. We then used our scoring function to select 
the best ranking ones. The goal here is that some of the conformations 
sampled would actually be closer to the native conformation and that 
our scoring function will be able to select it. 
 
We test how the above approach works in a comparative-modelling 
scenario and assess the predictive power of this method by applying it 
to properly controlled blind tests as part of the fourth meeting on 
the Critical Assessment of protein Structure Prediction methods 
(CASP4). Compared to CASP2 and CASP2, where a similar approach was 
used, we have improved the method used to sample main chains and have 
made minor enhancements to the other components of this approach 
including the scoring function. The biggest change is in our attempt 
to move models closer to the final answer. It remains to be seen how 
the improvements in methodology correlate with model accuracy. 
 


Skolnick-Kolinski-THD , 393

number of submitted models: 74

Generalized Comparative Modelling: A combined threading
-refinement approach to structure prediction

A. Kolinski, D. Kihara,M. Bettancourt, Piotr Rotkiewicz, M. Boniecki, and Jeffrey Skolnick

Danforth Plant Science Center
email:
skolnick@danforthcenter.org

 
	A hierarchical, generalized comparative  
modeling method has been applied to predict the  
tertiary structure. First, PROSPECTOR (1), a new  
threading algorithm, is used to select the template  
structure. Threading also provides predicted  
secondary structure and tertiary contacts that are not  
restricted to the template structure but can be  
extracted from other structures. This allows the  
possibility of fold prediction in those regions absent  
in the alignment of the probe sequence to the  
template structure. Next, the aligned parts of the  
probe sequence were fitted to the template, and  
pieces of the lattice chain were built by taking into  
consideration the excluded volume of the model  
chain and the necessity of "stretching" the chain  
between the gaps in the template. Then, starting  
from the shortest loop, the loops and nonaligned 
 chain ends were randomly inserted, again taking  
into account the excluded volume.  The proper  
geometry of the model chain (avoiding non-physical  
distances between side-groups close along the  
chain) was preserved during the chain-building  
procedure. Then, using the side chain, center of  
mass based lattice model (SICHO) of Kolinski and  
Skolnick (2), the structure is refined in the  
neighborhood of the template fold; an early variant  
of this method has been described previously (3),  
but now the template is treated in a more permissive  
manner. From a series of folding/structure 
 refinement simulations that employs parallel  
tempering to explore conformational space, the  
lowest energy structures are extracted and  
to a two-part structure selection protocol. First, the  
structures are clustered, and the resulting clustered  
folds are selected to provide a set of predicted  
structures. In parallel, distance geometry is used  
to generate alternative representative structures. All  
folds are locally relaxed using a more detailed off- 
lattice model comprised of the alpha carbons and a  
one or two center description of the side chains that  
depends on the side chain size. Atomic detail is then  
added and the resulting structures are reported. 
1.	J. Skolnick and D. Kihara, Defrosting the frozen approximation: PROSPECTOR: A new 
        approach to threading ,Proteins in press (2000). 
2.	A. Kolinski and S. A., Assembly of protein structure from sparse experimental 
        data: An efficient Monte Carlo Model ,Proteins 32 475-494 (1998). 
3.	A. Kolinski, P. Rotkiewicz, B. Ilkowski and J. Skolnick, A method for the 
        Improvement of threading-based protein models ,Proteins 37 592-610 (1999). 
         


BMERC , 248

number of submitted models: 42

Discrete State-Space Models Method

Jadwiga R. Bienkowska , Honxian He and Temple F. Smith

BMERC, Boston University
email:
jadwiga@darwin.bu.edu

 
The methods used in the CASP4 prediction contest combined 
two algorithms [Bienkowska et al. 2000] and [Das and 
Smith. 2000]. The first algorithm takes into account the 
high probability of sub-optimal sequence-to-structure 
alignments in the fold recognition method. The second 
algorithm is a profile based multiple sequence alignment 
method. It was applied in the search for the best 
sequence-to-structure alignment. 
 
I. Fold recognition approach 
 
Our approach is based on the DSM representation of protein 
structures (Stultz et al. 1993). Mathematically a DSM is 
represented as an HMM, with a distinction that DSMs are 
designed rather than trained HMMs. The design of a 
structural DSM relies on a prior knowledge about protein 
structure and attempts to introduce a minimal bias among 
alternative realizations of the same structural fold. 
 
Each DSM for a fold represented by a PDB structure is 
constructed hierarchically out of a set of substructure 
models.  These sub-models represent standard secondary 
structures and include, via residue position and solvent 
exposure parameterization, implied three dimensional 
packing information.  The secondary structure sub-models 
are joined by loop/turn sub-models following the observed 
arrangement of the secondary-structure elements along the 
protein sequence.  Each sub-model, or a DSM plex, is 
constructed on the basis of a secondary-structure 
assignment made for the parent structure by DSSP.  The 
secondary-structure plex is represented by a hidden Markov 
chain that allows for the anticipated variation and DSSP 
assignment uncertainty in length of plus or minus one 
residue at both ends.  Assignment of amino acid 
probabilities to each hidden state in helix or strand is 
based on position secondary structure and solvent 
exposure. The states that correspond to loops are assigned 
independently of the solvent exposure of the residues 
observed in loops. Thus, only the structural information 
about the model protein is taken in this approach. The 
anticipated variation in the protein structure is 
represented in the variation encoded in each secondary 
structure plex and also in the C-terminal and N-terminal 
loop plexes that allow an additional amphipathic helix to 
be added at both ends of a structural domain. 
 
The HMM representation of the protein 3D structures allows 
the use of the forward-backward (Rabiner 1989) or filtering 
(White 1988) algorithm for calculation of the total 
probability (Bienkowska et al. 2000) that any given model 
could have generated any given sequence. The filtering 
algorithm is more sensitive than dynamic programming (or 
Viterbi) algorithms, commonly used in sequence-sequence and 
sequence-structure comparison methods, that calculate only 
the most probable path trough the model (optimal 
sequence-model alignment). The filtering algorithm 
calculates P(Seq|Model), from which one can then calculate 
P(Modeli|Seq) using the Bayesian relation.  The posterior 
probability is then relative to the entire set of competing 
models. The DSM library contained 539 competing models that 
represent 305 SCOP superfamilies. We employed a binary 
decision approach and called a prediction when the 
posterior probability of a model was greater than 0.5. 
 
II. Alignment generation 
 
The query sequences and the sequences representing the 
functional family of the selected model structure were 
subsequently submitted to the multiple alignment software 
PIMA.  PIMA algorithm uses the combined probability of the 
amino acids aligned among the homolog sequences and the 
prior probabilities of observing each pair of amino acids in 
an alignment. 
 
Bienkowska J.R., Yu L., Zarakhovich S., Rogers Jr R. and 
Smith T.F. Protein Fold Recognition by Total Alignment 
Probability. Proteins: Structure, Function and Genetics, 
40(3): 451-464, 2000. 
 
Das S. and Smith T. F. Identifying nature's protein lego 
set. Advances in Protein Chemistry ed. Peer Bork vol 54: 
159-183 Academic Press 2000. 
 
 
Rabiner L. R.  A tutorial on hidden Markov models and 
selected applications in speech recognition.  Proceedings 
IEEE, 77:257-286, 1989. 
 
Stultz C.M., White J.V., and Smith T.F. Structural analysis 
based on state-space modeling.  Protein Science, 2:305-314, 
1993. 
 
White J. V.  Bayesian analysis on time series and dynamic 
models, pages 255-283.  Marcel Dekker, New York, NY USA, 
1988. 
 


Sternberg , 126

number of submitted models: 45

Fold Recognition using structural profiles (3D-PSSMs) and textual information (SAWTED)

Kelley LA, MacCallum RM & Sternberg MJE

Imperial Cancer Research Fund
email:
kelley@icrf.icnet.uk

 
The primary method used was that described in (Kelley et al., 2000), 
using the program 3D-PSSM. One of the key features of this technique is the use  
of multiple structural alignments of remote homologues to create extended sequence 
profiles (3D-PSSMs). These profiles can capture the sequence characteristics of an entire  
structural superfamily, and extend the range of profiles generated from sequence  
similarity alone (e.g. PSI-Blast). 
 
The method involves a three-pass dynamic programming algorithm against a library  
of known folds taken from SCOP and the PDB. Each of the three passes of dynamic  
programming uses sequence, secondary structure and solvation terms. Secondary  
structure is matched between a known library structure and the predicted  
secondary structure for the query. Secondary structure prediction was done  
using PSI-Pred (Jones,1999). Our solvation model is knowledge-based and similar  
to (Jones et al., 1992). 
 
Each of the three passes differs in the sequence profile used. Two  
of the sequence profiles (or PSSMs) are taken from PSI-Blast: i)the PSSM 
for the query sequence and ii) the PSSM for the library structure. The third 
sequence profile is generated from multiple structural alignments and so we 
call these 3D-PSSMs. We use structural alignments of homologous proteins of  
similar three-dimensional structure in the SCOP database to obtain a structural  
equivalence of residues. These equivalences are used to extend multiply aligned  
sequences obtained by PSI-Blast. The resulting large superfamily-based  
multiple alignment is converted into a (3D)PSSM.  
 
The final alignment produced by the algorithm is the highest scoring of these  
three passes. Our web server (http://www/bmm.icnet.uk/~3dpssm) reports the top  
20 highest scoring structures in our library, as calculated by 3D-PSSM. In addition 
to the alignment score, we have incorporated a textual component to aid in functional 
assignment. The program used is SAWTED for Structure Assignment With Text Description 
(MacCallum et al.,2000; http:/www.bmm.icnet.uk/~sawted). This method compares the  
comments and keywords between SWISS-PROT homologues of the query sequence and the  
library sequence. Confident SAWTED scores are combined with the 3D-PSSM alignment 
scores to reflect potential functional similarity between query and template. This 
is intended to mimic, to a small degree, the human assessor's ability to gauge the 
likelihood of a correct fold or superfamily assignment based on his or her knowledge  
of the function of the query and template. 
 
The above techniques were applied automatically in both the CASP4 and CAFASP-2 
evaluations. However, for the CASP4 evaluation, we have additionally used manual 
intervention in many cases.  
 
For orphan targets, or targets with few varied homolgous sequences in the sequence  
database, we would run tblastn at the NCBI against unfinished microbial genomes 
(http://www.ncbi.nlm.nih.gov/Microb_blast/unfinishedgenome.html). This would sometimes 
supply us with a sufficiently large and diverse set of sequences to improve secondary 
structure prediction accuracy, and a more powerful sequence profile. 
 
When we suspected a secondary structure prediction may be erroneous, we would 
send the target sequence to the Jpred server (http://jura.ebi.ac.uk:8888/submit.html) 
and look for consensus and, if necessary, run the 3D-PSSM server on a separately compiled 
secondary structure prediction. 
 
In cases where no confident 3D-PSSM hits could be found for a given target, we would 
often use PFAM (http://pfam.wustl.edu/) alignments to either generate an alternative 
secondary structure prediction, or analyse alternative sequences from the same PFAM family  
as the target, using the protocol above. 
 
In many cases, the automatic alignments produced by the 3D-PSSM server were manually 
adjusted to meet a variety of criteria: 
 
	a)Maintenance of a hydrophobic core based on three-dimensional models generated 
	from the alignments. 
 
	b)Equivalencing of known core residues (as pre-calculated using a mutual 
	contact algorithm) with hydrophobic residues in the target. 
 
	c)Preservation of the continuity of secondary structure elements. 
 
	d)Maintenance of the spatial arrangements of residues suspected to form the active site. 
 
	e)Alignment of known motifs (such as the Walker A and B motifs in P-loops, or known 
	conserved residue types in OB-folds e.g.(Bycroft et al.,1997)) 
 
	f)Maintenance of the spatial distances between cysteine residues believed 
	to form disulphide bridges. 
 
 
Bycroft M. , Hubbard T. J. P. , Proctor M. , Freund S. M. V. Murzin A. G. (1997).  
The solution structure of the S1 RNA binding domain: a member of an ancient nucleic 
acid-binding fold.  Cell , 88, 235-242  
 
Jones D. T. , Taylor W. R. Thornton J. M. (1992). A new approach to fold recognition.   
Nature , 358, 86-89. 
 
Jones D. T. (1999). Protein secondary structure prediction based on  
position-specific scoring matrices.  J. Mol. Biol. 292, 195-202  
 
Kelley LA, MacCallum RM & Sternberg MJE (2000). Enhanced Genome Annotation  
using Structural Profiles in the Program 3D-PSSM. J. Mol. Biol. 299(2), 501-522.  
 
MacCallum, R.M., Kelley, L.A. and Sternberg, M.J.E. (2000) Bioinformatics 16(2), 
125-129.  
 


123D+ , 389

number of submitted models: 214

Fold recognition with 123D+ server.

Nickolai N. Alexandrov

NCI-FCRF
email:
nicka@ncifcrf.gov

 
123D+ server compares a target sequence with a set of protein domains from ASTRAL 
non-redundant set (version 1.50, 50 0dentity list). For every residue in the 
domain, the following information is derived from the PDB files: (i) residue 
type (amino acid in SEQRES field), (ii) secondary structure, assigned by 
Stride, and (iii) the number of contacts with other residues. Domain profiles 
are created by psi-blast run against NR database. Similarly, psi-blast profile 
is also created for a target sequence. Secondary structure of a target is 
predicted by probabilistic approach from statistics of amino acid pairs in a 
sliding window of 17 residues. Similarity score between position i in target 
and position j in domain is computed as: log((Paa*Pss*Pcc)/(P'aa*P'ss*P'cc)), 
where Paa is a probability to have the same amino acid in i and j, computed 
from the psi-blast profiles; Pss is a probability to have the same secondary 
structure; and Pcc is a probability to have the same number of contacts, 
computed from the contact capacity potentials for every residue type. P'aa, 
P'ss, and P'cc are correspondent expected probabilities. 123D+ uses dynamic 
programming to find an optimal sequence-structure alignment. In addition to 
standard events of match, deletion, and insertion, the algorithm features a 
choice of residues not to be aligned, which helps to deal with different loop 
conformations. As default alignment mode was used fit, where the whole domain 
is required to be aligned with a part of the target sequence. 123D+ was 
benchmarked with ASTRAL set of domains and outperformed psi-blast in fold 
recognition. 123D+ is available at 
http://www-lmmb.ncifcrf.gov/~nicka/run123D+.html. 


Lomize-Andrei , 002

number of submitted models: 28

Assembly of protein cores from regular secondary structures:
ab initio and fold recognition techniques.

Andrei L. Lomize, Irina D. Pogozheva, and Henry I. Mosberg

College of Pharmacy, University of Michigan
email:
almz@umich.edu

 
 
   3D models of protein cores (complexes of several interacting 
alpha-helices and beta-sheets, excluding nonregular loops) have 
been generated for 19 CASP4 targets with no detectable sequence 
homology to proteins of known structure.  The partially automatic 
procedure described below reproduces main blocks of a large software 
package that is under development in our group to test the validity 
of the entire approach and its specific parts.  The procedure 
includes the following three steps. 
 
   STEP 1. Ab initio prediction of secondary and supersecondary 
structure using two different methods: 
  (a) calculation of alpha-helices, alpha-hairpins, and beta-hairpins 
in hydrophobically collapsed protein using the program Framework [1]; 
  (b) identification of alpha-helices and beta-strands based on 
hydrophobicity patterns in multiple sequence alignments [2]; 
    Possible beta-sheet topologies and the structural class of the 
target (beta-sandwich, beta-barrel, beta-helix, beta-prism, different 
alpha+beta and alpha/beta structures, alpha-superhelix, or alpha-bundle) 
were suggested based on a qualitative analysis of results produced by 
both methods. 
 
   STEP 2. Fold recognition.  The procedure included the following 
three parts. 
  (1) Identification of related PDB structures using a library of 
"supersecondary nuclei" in proteins [3], and the following criteria: 
     (a) similar secondary structures of the target and template, 
including number, order, and lengths of alpha-helices and beta- 
strands, and identical beta-sheet topologies, 
     (b) similar biological functions; 
Twelve of the nineteen targets considered (T0088,T0094,T0098,T0100, 
T0101,T0102,T0104,T0107,T0108,T0109,T0118, and T0126)  satisfied 
these criteria, and therefore were designated for fold recognition. 
  (2) Finding optimal alignment of secondary structures in the 
target and template that maximizes formation of aliphatic, aromatic, 
and polar clusters and burial of nonpolar side-chains. 
  (3) Adjustment of side-chain conformers and the spatial positions 
of entire alpha-helices to improve close packing, burial of nonpolar 
groups, and hydrogen bonding. 
 
   STEP 3. Ab initio assembly of 3D cores from alpha-helices and 
beta-sheets - for targets that could not be assigned to any known 
protein fold in STEP 2 (T0091, T0095, T0097, T0105, T0106, T0110, 
and T0114).  The docking of regular secondary structures (using 
QUANTA and our unpublished software) sought to optimize burial 
of nonpolar side-chains, segregation of aliphatic, aromatic, and polar 
groups into separate clusters, close packing, and hydrogen bonding 
in simultaneously constructed models of several homologous proteins 
from the target family.  Two different assembly strategies were tested 
for all-alpha-helical domains: stepwise building of the core from 
gradually growing structures (T0106 and models 2 of T0095 and T0097), 
and formation of a nearly complete core (models 1 of T0095 and T0097). 
 
   [1] A.L.Lomize and H.I. Mosberg (1997) Thermodynamic model of 
secondary structure for alpha-helical peptides and proteins. 
Biopolymers, v.42, pp. 239-269 
   [2] A.L.Lomize, I.D. Pogozheva, and H.I. Mosberg (1999) Prediction 
of protein structure: the problem of fold multiplicity.  Proteins, 
Suppl.3, pp.199-203 
   [3] A.L.Lomize, I.D. Pogozheva, and H.I. Mosberg (1999)  Protein 
structure assembly pathways.  Protein Sci., v. 8, Suppl.1, p.86 
 
 
 
 


Reva-Boris , 401

number of submitted models: 192

Recognition of protein structure by threading
with averaging energies over homologs.

B.A.Reva, A.V.Finkelstein, D.S.Rykunov, M.Yu.Lobanov.

Novartis Institute for Biomedical Research
email:
boris.reva@pharma.novatis.com

 
To compute an energy of a protein chain with loops in external  
field we develop a model where 
 
(i) a 3D position of any amino acid residue is given by a  
position of its Ca atom; each of the amino acids of a target  
sequence occupies a position either on a template or in a "loop", i.e.non-aligned region 
of the sequence  (loop structures are not defined; the energy  
of a loop depends on the template positions of its ends, in  
particular, the  types of residues in the loop, and on the  
number of residues in the loop) 
 
(ii) a template is a limited set of positions for Ca-atoms  
in 3D space; only backbones of proteins from the PDB are used  
as structural templates;  
 
(iii) an energy of interaction between a residue and a template  
is given by the potential of an "external field" that depends  
on the type of residue and on its position on the template; 
 
(iv) residues of a target sequence can occupy any position  
on a template.  
 
For each target sequence all available pdb structures were used  
as templates. 
 
An external field acting on a residue is produced by summing 
up all interactions of a given residue with residues of a template 
structure. Local interactions between neighbor residues are 
calculated explicitly. (Local interactions include interactions  
between neighbors separated by 1, 2, 3 residues along a chain,  
bending energy that depends on a type of a residue between  
two interacting residues, and also chiral energy of a backbone  
[1].) 
 
A sequence-to-structure alignmnet is computed by dynamic  
programing. The energy of the obtained structure is computed 
and used to rank templates. 
 
Averaging energy over homologs [2] was applied in energy  
calculations described above. To this end, we computed sequence  
alignment [3] of a target sequence and the corresponding  
homologs extracted from the non-redundant sequence base.  
Only homologs with low sequence similarity were used in  
energy averaging.    
 
To check the quality of obtained sequence-to-structure alignments 
we computed Z-scores using fragmentarian gapless threading [4]. 
Typically the lowest energy structures give low Z-scores, 
however when the energy based selection was ambigous we used 
the Z-score based selection for the final submissions. 
 
References 
 
1.Reva,B., Finkelstein,A., Skolnick,J. Derivation and testing  
residue-residue mean-force potentials for protein structure  
recognition. In: Methods in Molecular Biology, vol.143;  
Protein Structure prediction: Methods and Protocols. 2000, pp. 155-174. 
 
2.Reva B.A., Skolnick J., Finkelstein A.V. -  
Averaging of interaction energies over homologs improves  
protein fold recognition in gapless threading. -  
Proteins, 35: 353-359, 1999. 
 
3.Altschul S.F., Madden T.L., Schiffer A.A.,Zhang J.,  
Zhang Z., Miller W., Lipman D.J. - Gapped BLAST and  
PSI-BLAST: a new generation of protein  
database search programs -  
Nucleic Acids Res. 25: 3389-3402, 1997. 
 
4.Reva, B.A., Topiol, S. 
Recognition of protein structure: determining the relative  
energetic contribution of beta-strands, alpha-helices and loops. 
Biocomputing. Proceedings of the Pacific Symposium 2000; 
World Scientific Publishing Co. Pte. Ltd. pp.168-178. 
 


Tatsuya , 329

number of submitted models: 115

Protein threading based on multiple protein structure alignment

Tatsuya Akutsu, Morihiro Hayashida, Yuichiro Horai, Kenta Nakai

Human Genome Center, University of Tokyo
email:
takutsu@ims.u-tokyo.ac.jp

 
 
We used protein threading in which structure alignment results were used 
as profiles. The prediction method was not automatic. The outline of the 
prediction method is as follows: 
 
(1) Candidates of possible structures are obtained using several tools: 
    SSEARCH33 (Smith-Waterman algorithm), PSIBLAST, PHD, PSIBLAST-based 
    search tool using intermediate sequences (developed by us), CAFASP  
    results. 
 
(2) Structures similar to each candidate are searched from PDB, using 
    STRALIGN (pairwise structure alignment program developed by us) and 
    SCOP/ASTRAL database. 
 
(3) For each candidate, a multiple structure alignment is computed 
    from pairwise structure alignments for similar structures 
    by using a method similar to the center star method. 
 
(4) A protein threading (i.e., an alignment between a sequence and a 
    structure) is computed by using CLUSTALW and PSIBLAST. Then, candidate 
    structures are ranked based on human knowledge.  
 
In order to compute pairwise structure alignment, we used STRALIGN [1]. 
STRALIGN computes a structure alignment between two C-alpha chains 
by using dynamic programming, least-squares fitting (RMS fitting) 
and iterative improvement. 
 
In order to compute a multiple structure alignment from pairwise structure 
alignments, we used the center star method, which was well known for 
constructing a multiple sequence alignment from pairwise sequence alignments. 
For applying the center star method, the center structure should be 
determined.  Since a candidate structure was given in our case, we used 
the candidate structure as the center structure. Different from the standard 
center star method, we did not allow insertions for the center structure. 
Details of computation of multiple structure alignment is described in [2]. 
 
Threadings were computed based on sequence vs. profile alignment (i.e., 
alignment between the target sequence and the multiple structure alignment). 
CLUSTALW and PSIBLAST (using -B option) were used for computing threadings. 
 
[1] Tatsuya Akutsu, Protein structure alignment using dynamic programming 
    and iterative improvement, IEICE Trans. on Information and Systems,  
    E79-D:1629--1636, 1996. 
 
[2] Tatsuya Akutsu and Kim Lan Sim, Protein threading based on multiple 
    protein structure alignment, in: Genome Informatics 1999 (Universal 
    Academy Press, Tokyo), 23--29, 1999. 
 
 


Finkelstein , 204

number of submitted models: 31

Recognition of protein structure by threading
with double averaging energies over target sequence homologs and structural neighbors.

D.S.Rykunov, B.A.Reva, M.Yu.Lobanov, A.V.Finkelstein.

Institute of Protein Research RAS, Institute of Theoretical & Experimental Biophysics RAS
email:
rykunov@alpha.protres.ru

 
Our group (Finkelstein) and Dr. Reva (Reva-Boris) group use the same core 
threading program, but different template libraries and the post-threading 
prediction processing. Here we describe the core approach, our (Finkelstein) 
template definition and final decision step. 

To compute the energy of protein chain onto the template fold we develop a model where 
(i) Each residue of the target sequence either occupies some position on the 
template (then its 3D position is given by the template's Ca atom), or it is 
in the "loop", i.e. in the non-aligned region of the sequence. The "loop" 
structure is not defined; its energy depends on the template positions of its 
ends, and on the number and the types of the loop residues. 
(ii) The energy of interaction between each residue of the target and the 
template is determined by the potential of an "external field", acting in the 
given template position at the residue of a given type. Summing up all 
interactions of the target residue with surrounding residues of the template 
produces the external field potential. The local interactions between close 
target residues are calculated explicitly from their types and coordinates at 
the template (they include contact terms, the bend terms and the chiral terms) [1].  

The optimal sequence-to-structure alignment is computed by dynamic 
programming. The energy of the obtained structure is used to rank templates 
for the target sequence. The averaging energy over homologs [2] is applied to 
the energy calculations described above. To this end, we use BLAST [3] to 
obtain the target sequence homologs and to build the multiple sequence 
alignment. Only the homologs with a low and medium sequence similarity are 
used for the energy averaging. The representative set of the SCOP (rel. 1.50) 
[4] domains is used as the templates for each target sequence. The domains 
have been clustered using the package STAMP [5], and the energies obtained for 
the target sequence threaded onto each cluster member are averaged as 
described in our recent paper [6] A human expertise of about 20 lowest-energy 
structures is performed as the final step of the prediction process. It is 
based on the visual evaluation of the predicted structure compactness, the 
distribution of the hydrophobic/polar residues on the core/surface of the 
template, the comparison of the secondary structure predicted by the obtained 
sequence-to-structure alignment with the predictions obtained with the 
independent secondary structure prediction tools (PHD [7] and ALB [8]) and on 
any extra literature data on the target function/active site position, if 
available. 

Acknowledgements. This work was supported by the Russian Foundation for Basic Research 
grant 98-04-49303, by the INTAS grant 99-01476 and by an International Research Scholar's 
Award to A.V.F. from the Howard Hughes Medical Institute. 
  
	REFERENCES 
 
1.Reva,B., Finkelstein,A., Skolnick,J. Derivation and testing residue-residue mean-force 
potentials for protein structure recognition. In: Methods in Molecular Biology, vol.143; 
Protein Structure prediction: Methods and Protocols. 2000, pp. 155-174. 
2.Reva B.A., Skolnick J., Finkelstein A.V. - Averaging of interaction energies over homologs 
improves protein fold recognition in gapless threading. - Proteins, 35: 353-359, 1999. 
3.Altschul S.F., Madden T.L., Schäffer A.A.,Zhang J., Zhang Z., Miller W., Lipman D.J. - 
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs - Nucleic 
Acids Res. 25: 3389-3402, 1997. 
4.Murzin A. G., Brenner S. E., Hubbard T., Chothia C. - SCOP: a structural classification 
of proteins database for the investigation of sequences and structures. - J. Mol. Biol. : 
247, 536-540, 1995. 
5.Russell, R.B., Barton, G.J. STAMP: Structural Alignment of Multiple Proteins. 
Proteins 14: 309-323, 1991. 
6.Rykunov D.S.,Finkelstein A.V., Lobanov V.Yu. - Search for the most stable folds of 
protein chains. III. Improvement in fold recognition by averaging over homologuous 
sequences and 3D structures. - Proteins, 40: 494-501, 2000. 
7.Rost B., Sander C. - Combining evolutionary information and neural networks to predict 
protein secondary structure. - Proteins, 19: 55-72,  1994. 
8.Ptitsyn O.B., Finkelstein A.V. - Theory of protein secondary structure and algorithm 
of its prediction. - Biopolymers, 22: 15-25, 1983. 
 


Walts-Wondrous-Wizards , 044

number of submitted models: 172

Playing protein fold charades

N. Alexandrov, V. Brover, M. Troukhan, W. Volkmuth

Ceres, Inc
email:
nicka@ceres-inc.com

 
Our prediction process consists of two steps: selecting a template structure 
and making an alignment.  

1. Template selection. 

All target sequences were compared with a set of structural domains using the 
123D+ program, which combines sequence similarity, secondary structure 
prediction and contact capacity potentials to compute a similarity score. If 
there was a hit with Z-score > 6, we made the selection based on the strongest 
hit. When the hit covered only a part of the target sequence, we cut out the 
remaining part and repeated the run. If 123D+ did not detect an obvious hit, 
we predicted the fold anyway, because sampling of a random set of recently 
predicted structures indicates that approximately 900f them are structurally 
similar to already known folds, even if there is no strong sequence 
similarity. Without a strong 123D+ hit, we used other available associative 
information in an attempt to link the target with a protein with known 
structure. We used literature search, known metabolic pathways, gene 
expression data, position on the chromosome, operons, distribution of folds in 
the organism, secondary structure prediction, predictions of transmembrane 
helices and coiled coils. We demonstrated that there is a correlation between 
protein folds and gene expression and between protein folds and location in 
the chromosome. All these additional information gave us quite weak signals. 
However, when consistent, these signals resulted in rather confident 
predictions. This part of the prediction is analagous to playing charades, 
where one discovers an unknown word using many inderect, independent hints. 
Interestingly, we can compare the effectiveness of such an approach verses a 
pure automated method, as 123D+ server also participated in the CAFASP section 
of CASP4.   
 
2. Alignment 
 
Alignments were computed with 123D+ program and were in some cases manually 
corrected. Manual intervention was limited to (i) placing deletions within the 
target sequence so that their edges are close in space in 3D structure and 
(ii) moving insertions in the target sequence to the surface of protein 
structure.   


Gibrat-Marin , 328

number of submitted models: 48

Use of several filters to improve the sensitivity and specificity of fold recognition methods

Antoine Marin, Joël Pothier, Karel Zimmermann, Jean-François Gibrat

Institut National de la Recherche Agronomique
email:
gibrat@versailles.inra.fr

 
The success of BLAST is due to a large extent to the associated statistical 
processing of the raw results that provides an objective way of judging the 
significance of a match. It seems to us that current threading methods do not 
pay a sufficient attention to this problem of significance. We have developed  
a threading method and we have tried to address specifically the problem of 
significance of a match. 
 
Method: 
Like most threading techniques, our method consists of 4 elements : a library of folds,  
a score function, an algorithm to obtain the best sequence/structure alignment and  
a measure of the significance of the best sequence/structure alignment score.  

Database of folds: 
We consider all complete proteins of the PDB having less than 35equence identity. 
We do not divide proteins into structural domains. The core of the 3D structures consists 
in conserved secondary structure elements. We require that residues of the query sequence 
be aligned with residues in the core. 

Score function: 
The algorithm is a 2 stages procedure. The first stage (1D stage) uses a score 
function that involves only 1 site of the template fold. Each site 
(corresponding to the Ca position of a residue in the 3D structure) is 
characterized by the residue that occupies this position in the template 3D 
structure  and by its structural state. A structural state is defined as the 
combination of a secondary structure  type (helix, strand or coil) and the 
fact of being buried or exposed. We have developed BLOSUM-like  substitution 
matrices that take into account the structural state of the residues. The 
second stage (3D stage) uses a score function that involves 2 sites of the 
template fold in contact (i.e, positions in the template 3D structure whose 
sidechains are in contact). The score function is made of 2 terms. The first 
one measures the likelihood of replacing a pair of residues in contact  in the 
template 3D structure by a pair of residues of the query sequence (it is 
similar to a sequence comparison algorithm but in 3 dimensions). The second 
term measures the likelihood of positioning a pair of residues of the query 
sequence at 2 sites in contact in the template 3D structure characterized by 
given structural states. For instance this term measures the likelihood of 
aligning a pair (Gly,Asp)  at 2 sites in contact that are, say, for the first 
a buried helix and for the second a buried strand. 

Alignment method: 
For the first stage since the score function depends only on 1 site we use a 
modified dynamic programming algorithm. For the second stage the only exact 
method to find the best alignment  is a branch and bound algorithm. We have 
implemented the Lathrop's branch and bound algorithm. However  this algorithm 
is too time consuming for the biggest cores in the database so we also 
developped a  heuristic algorithm based on a stochastic method. 

Significance of the score: 
The raw score value is in general useless to judge the significance of the 
sequence/structure  alignment because the value obtained depends on the length 
of the query sequences, on the number  of sites in the template core and, 
above all, on the particularities of the 3D structure. In order to normalize 
this raw score, for each template core we align N test sequences having the 
same length as the query sequence. No pair of sequences amongst these N 
sequences has more than 25% sequence identity and none of these test sequences 
has more than 25 0dentity with the query sequence. These N sequences are real 
protein sequences extracted from protein sequence databases because we fear 
that  shuffled sequences may lack the subtle characteristics of true protein 
(this might be especially true for the 3D term of the score function). These N 
(N=100) sequences define a distribution of scores for aligning  real proteins 
of length L with a particular fold. Since this distribution has been obtained 
empirically  we do not know its analytical form. To avoid making unwarranted 
assumptions about the form of the distribution we normalize the query score as 
follows. First we determine the distance between the score of the 25 quantile  
and the score of the 75 quantile. Then we divide the distance between the 
score of the 25 quantile and the  query score by this first calculated 
distance. We use the 25 and 75 quantiles because we consider that the score of  
proteins before the 25 quantile or after the 75 quantile can be biased (for 
instance proteins whose score  appears after the 75 quantile might be related 
to the core we are considering). Once the scores of the query sequence aligned 
with all the cores have been normalized in this way we can rank the cores by 
decreasing  normalized distance. 
 
Database of test: 
It is crucial to be able to test our method under realistic conditions. For 
this purpose we created a database  that consists in pairs of proteins having 
similar 3D structures but very different sequences. The  pairs of structurally 
similar proteins were obtained running VAST on the set of 1175 PDB proteins 
with less than 35%  sequence identity. We considered only protein pairs having 
similar lengths (showing at most a variation  in length of 25%) and for which 
at least 650f the residues were included in the 3D alignment. The FASTA  
program was run with the proteins of the selected pairs above and only those 
pairs for which the FASTA  expected value was greater than 1 were retained. We 
thus obtained 334 pairs corresponding to 291 individual  proteins. These 334 
pairs include homologous proteins but also pairs of proteins for which no 
evolutionary  relationship has been demonstrated. This database allows us to 
test under realistic conditions our method since we have pairs of structurally 
similar proteins whose relationship cannot be found by usual sequence 
comparison methods. 

Results: 
We used a subset of the 291 proteins (209 proteins with a length less than 
250). Each query protein was run against the database of 1175 cores and the 
cores ranked according to the normalized distance as explained in Method. The 
1D stage is used as a filter for the 3D stage, in that only the first 10 cores 
in the ordered 1D list are subjected to the 3D filter. The rank and normalized 
score of the first true positive and first false positive found in the list 
were recorded. For the 209 proteins we obtained the results described in 
tables I and II. 

Table I: Rank of the first true positive found in the list
              1      5     10     15     20   more
Rank 1D:  61.2%  75.6%  82.8%  86.1%  89.5% 100.0%
Rank 3D:  53.6%  74.2%  81.8%  81.8%  81.8% 100.0%


Table II: Distribution of the true and false positives as a function of the normalized distance
                          1D                          3D                 
Normalized dist   True Pos.   False Pos.     True Pos.   False Pos.  
---------------   ----------------------    ------------------------
	       
 ]+inf - 6.0]     19   9.1%    0    0.0%       3   1.4%    0    0.0%
 ] 6.0 - 5.5]     10  13.9%    0    0.0%       3   2.9%    0    0.0%
 ] 5.5 - 5.0]     11  19.1%    0    0.0%       8   6.7%    0    0.0%
 ] 5.0 - 4.5]     15  26.3%    0    0.0%       8  10.5%    1    0.5%
 ] 4.5 - 4.0]     17  34.4%    2    1.0%       6  13.4%    0    0.5%
 ] 4.0 - 3.5]     17  42.6%   19   10.0%      12  19.1%    3    1.9%
 ] 3.5 - 3.0]     24  54.1%   41   29.7%      19  28.2%    9    6.2%
 ] 3.0 - 2.5]     28  67.5%   84   69.9%      33  44.0%   16   13.9%
 ] 2.5 - 2.0]     20  77.0%   53   95.2%      20  53.6%   58   41.6%
 ] 2.0 - 1.5]     25  89.0%    8   99.0%      19  62.7%   74   77.0%
 ] 1.5 - 1.0]     15  96.2%    2  100.0%      18  71.3%   37   94.7%
 ] 1.0 - 0.5]      5  98.6%    0  100.0%      16  78.9%    8   98.6%
 ] 0.5 - 0.0]      2  99.5%    0  100.0%       5  81.3%    0   98.6%
 ] 0.0 --inf[      1 100.0%    0  100.0%      39 100.0%    3  100.0%
 
 
Conclusion: 
Table I shows that if there is a true positive in the database (in our case 
this is always true), it has about 60% (resp. 50%) chance to appear in first 
position in the 1D stage (resp. 3D stage).  However the rank does not 
constitute a good criterion for judging the significance of a sequence/ 
structure alignment since when there is no similar 3D structure in the 
database there is still a  core ranked first, a core ranked second, etc.  The 
normalized distance provides a far better criterion. According to table II if 
the normalized distance is above 4 there is 1hance to have a false positive 
but 35hance to have a true positive for the 1D stage.  The 3D stage gives 
slightly worst results (for a rate of false positive of 1% the coverage is 
less than 20%) However we can combine the results of the 2 filters. Instead of 
a single normalized distance the alignment  of a core with the query sequence 
is characterized by 2 normalized distances (1D, 3D). In 2 dimensions it is  
possible to define a polygon where the number of false positives is less than 
a given percentage. For instance  plotting the 1D normalized distances along 
the x axis and the 3D normalized distances along the y axis we can define a 
polygon by the line x = 0, the line y = 0, the line x = 4.2, the line y = 4.0 
and the line y = -x + 6.5, i.e., a square (more or less) with a upper right 
corner that has been 'cut'. Outside this polygon there are 43%  true positives 
and 0.5 0.000000 alse positives. We think that this idea can be generalised to 
more than 2 filters. The rational behind this approach is that a false 
positive can, just by chance, get a good score for a particular filter but it 
is less likely to get a good score for several different filters. On the other 
hand a true positive can occasionally get a bad score for a given  filter but 
should fare better, on average, for the other filters. 


Godzik , 197

number of submitted models: 158

FFAS fold prediction

L.Jaroszewski, A.Godzik

The Burnham Institute
email:
adam@ljcrf.edu

 
Abstract 
 
We applied identical procedure for homology modeling targets and fold 
recognition targets.  It consists of three steps: A: Selection of the 
template(s), B: Generation of suboptimal alignments, C: Model building and 
evaluation. In the cases when FFAS z-score value indicated that the similarity 
between the template and query is strong (z-score values higher than 15), the 
step B was usually skipped and the model was built based on the alignment from 
FFAS.  This was the case for many of the homology modeling targets.  The 
prototype of this procedure called "Multiple Model Approach" was described and 
evaluated in (4-5).  

A. Selection of the template(s) - Fold & Function Assignment System (1,2). 
FFAS profile-profile search was performed in PDB database. FFAS is based on 
the sequence profile-profile matching with dynamic programming.  The multiple 
alignment is prepared based on the PSI-BLAST(8) output.  Non-redundant 
database of protein sequences was used for profile calculation.  FFAS uses 
sequences from PSI-Blast output with E-value below 0.01 and an elaborate 
weighting scheme for the sequences included in the profile(1).  Weights are 
assigned based on the dissimilarity of the sequence in respect to the other 
sequences in the family.  In addition, FFAS performs a normalization of the 
matrix containing the comparison scores between all positions of both aligned 
profiles before the best path is searched for with dynamic programming 
Smith-Watermann algorithm(8). 

B. Calculation of suboptimal alignments. 
A set of suboptimal (alternative) alignments was generated for the query 
sequence and the template structure(s) selected from the PDB database in the 
step A.  After the calculation of the initial alignment based on the 
profile-profile FFAS method, a1 similarity matrix was recalculated using 
several combinations of threading terms (burial and local conformation terms 
are used).  The threading energy was calculated for the sequence profile 
rather than for a single sequence, as it had been done in the classical 
threading.  Several gap penalty values were also explored.  Gap penalties were 
set higher within the secondary structure elements defined with the method 
described in the separate publication(3).  The resulting alignments were 
clustered to avoid redundancy. 

C. Model building and evaluation.  The models based on the alignments 
calculated in the step B were built and evaluated. We used MODELER(5) program 
developed in A. Sali lab for model building.  Model evaluation is based on the 
threading energy using statistical potential and evolutionary information 
encoded in sequence profiles  (the threading energy was calculated for the 
sequence profile rather than for a single sequence, as it had been done in the 
classic threading - for example in MatchMaker program).  The threading energy 
per residue was the final criterion of the model quality.   

 
References 
1. Rychlewski, L., Jaroszewski, L., Li, W. & Godzik, A. (2000).  
"Comparison of sequence profiles. Strategies for structural predictions using 
sequence information". Protein Science 9, 232-241  

2. Jaroszewski, L., Rychlewski, L. & Godzik, A. (2000). 
"Improving the quality of twilight-zone alignments". Protein Science, 9, 1487-1496 
 
3. Jaroszewski, L. & Godzik, A. (2000). Search for a New Description of 
Protein Topology and Local Structure. ISMB 2000 - 8-th International 
Conference on Intelligent Systems for Molecular Biology, San Diego 2000  

4. Jaroszewski, L., Pawlowski, K. & Godzik, A. (1998).  
"Multiple model approach: an extension of comparative modelling". Journal of 
Molecular Modelling 4, 294-309  

5. Pawlowski, K., Jaroszewski, L., Bierzynski, A. & Godzik, A. (1997). 
"Multiple model approach - dealing with alignment ambiguities in comparative 
protein modeling". In Biocomputing, 97 (Altman, R. B., Dunker, A. K., Hunter, 
L. & Klein, T. E., eds.), pp. 328-339. World Scientific, Singapore. 

6. Sali, A. and Blundell, T. L. (1993). 
"Comparative protein modelling by satisfaction of spatial restraints". J. Mol. 
Biol. 234, 779-815 

7. Smith, T.F. and Waterman, M.S. (1981) "Identification of common molecular 
subsequences". J Mol Biol 147:195-7  

8. Altschul, S.F. et al. (1997) "Gapped BLAST and PSI-BLAST: a new generation 
of protein database search programs". Nucleic Acids Res 25:3389-402  


Tsigelny , 274

number of submitted models: 210

HIDDEN MARKOV MODELS BASED SYSTEM (HMMSPECTR) FOR DETECTING STRUCTURAL HOMOLOGIES ON THE BASIS OF SEQUENTAL INFORMATION

Igor F. Tsigelny 1,2*, Yuriy Sharikov 2, Palmer Taylor 1, Lynn F. Ten Eyck 2 Eyck2(1 Department of Pharmacology, 2 San Diego Supercomputer Center)

University of California, San Diego
email:
itsigeln@ucsd.edu

 
A system based on Hidden Markov Models (HMMSPECTR) for finding of structural 
homologs of proteins on the basis of their sequences is described. The system 
receives a single probe sequence or sequences alignment and uses the sequence 
or alignment to search the library of Hidden Markov Models (HMMs) on the basis 
of structural alignments. The initial library of fold superfamilies was 
constructed using SCOP folds classification [1]. We created structural 
alignments, using CE algorithm [2], for each superfamily for six main classes 
of folds: all alpha proteins, all beta proteins, alpha and beta proteins 
(alpha/beta), alpha and beta proteins (alpha+beta), multi-domain proteins 
(alpha and beta), small proteins. For the following folds we created 
alignments for each family: EF hand-like (alpha), PHGase F-like (beta), 
Supersandwich (beta); NAD(P)-binding Rossman-fold domains (alpha/beta); 
Thiredoxin fold (alpha/beta); Puruvate-ferredoxin oxidoreductase (PFOR) domain 
III (alpha/beta), IL8-like (alpha+beta), Zincin-like (alpha+beta). For the 
fold Globin-like (alpha) we created alignments for all protein domains (2 
families – Globins and Phycocyanins). In all cases such a division was needed 
to cover by alignments, all SCOP proteins of specific subdivision. Overall 
number of structural alignments created was about 1500. Average number of 
proteins in each of these alignments is about 250. For each alignment we 
produced 4 HMMs using the HAMMER package [3]. The first HMM was created on the 
basis of the initial alignment and three differently trained HMMs. HMMSPECTR 
works in three modes using three different libraries of alignments and HMMs. 
Each of them is used in following order when the score of probe protein 
decreases. 

(1) After the finding of a HMM with the best score the system starts further 
analysis of alignment of the HMM. It creates a subset of ‘cluster’ alignments 
of protein sequences having close scores when aligned with the probe sequence. 
On the basis of these sets of alignments the system creates a new set of HMMs 
which is then analyzed. Then it extracts a ‘dominant’ protein in each HMM – a 
protein having the best sequence alignment to the consensus sequence of the 
corresponding HMM.  

(2) There are some cases when a target protein is presented in some of HMMs 
but have not enough relatives with solved structures. In such cases dividing 
of large HMMs into smaller parts is very effective. In this mode the system 
uses the library of HMM constructed on the basis of partial structural 
alignments of folds.  

(3) In the case of very low scores received even in the second mode the system 
switches to the third mode. This mode uses the principles of self-organization 
in the process of decision making. The systems uses a library of HMMs created 
with the different ratios - gaps/letters and different number of participating 
proteins sequences. The decision is made on the basis of optimization of 
results on the basis of all parameters used in the calculations 

REFERENCES 
1. Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995). SCOP: a 
structural classification of proteins database for the investigation of 
sequences and structures. J. Mol. Biol. 247, 536-540.  

2. Shindyalov I N, Bourne P E. (1998). Protein structure alignment by 
incremental combinatorial extension (CE) of the optimal path. Protein 
Engineering 11:739-747. 

3. Eddy S. (1995). Hidden Markov Models of Proteins and DNA Sequence. 
Washington Univ. School of Medicine. 


BinToHes , 255

number of submitted models: 43

Secondary structure and function based protein fold recognition

Eckart Bindewald, Silvio Tosatto, Jochen Maydt, Achim Trabold, Juergen Hesser, Reinhard Maenner

Institute of Computer Science V, University of Mannheim
email:
bindewald@ti.uni-mannheim.de

 
If no proteins with known structure and homologous sequence were found for the 
target protein (using PSI-BLAST [1]) we used our fold recognition system 
MANIFOLD to suggest possible template proteins.  It uses a database of 
profiles of a set of 2083 representative protein structures of the FSSP 
database [2].  For each database protein the sequence, secondary structure, 
accessibility (defined with DSSP [3]), FSSP structure code and (if applicable) 
enzyme code is stored.  A similar profile is prepared for each CASP-4 target 
protein, using the predicted secondary structure  (based on the secondary 
structure prediction programs JPRED [4], PHD [5] or SSPRO [6]) and predicted 
accessibility (taken from the output of PHD [7]).  The enzyme code was taken 
from SWISS-PROT [8]. 

As the first step, structure-function rules are applied in order to exclude 
template proteins which have a structure that is assumed to be incompatible 
with the function of the target protein (this applies in our implementation 
only to proteins with an enzyme code).  In order to derive such rules, we made 
a cross analysis between the FSSP structure codes and the enzyme codes of all 
proteins which are representative structures in the FSSP database.   If a set 
of structures among the FSSP representatives was at least ten times, and with 
no exception, associated with the same set of functions, it was used as a rule 
that this association must hold also for proteins with unknown structure. 
Possible template proteins that passed this set of rules were subsequently 
ranked according to the following criteria: The main order criterion for the 
fold recognition process is the secondary structure  "block" similarity.  The 
length of helix, strand or loop regions is ignored, each helix or strand 
region is represented by a single letter.  The number of mutations needed to 
convert the target/template secondary block structure yields a score which is 
the main ordering criterion for the template structures. The template proteins 
where further sorted according to a "jury" of other criteria.  

The additional similarity criteria are:  
1. the relative length (number of amino acids) similarity,  
2. the score of a Needleman-Wunsch alignment of the combined secondary  
   structure/accessibility,  
3. functional similarity as reflected in the enzyme-code (if applicable).  

The functional similarity is taken to be the number of identical enzyme code 
hierarchy levels.  For each of these additional criteria a ranking list is 
created. For each criterion a "jury point" is given, if the template structure 
is found among the top ten considering this criterion alone, and another point 
if the template structure is among the top five structures. The top scoring 
template proteins were inspected. The used template protein was chosen 
according to the most convincing sequence alignment (using ClustalW), combined 
with a visual inspection of a manual alignment of the secondary structure. The 
subsequent modeling of the protein structure given the template structure is 
described in the CASP-4 comparative modeling abstract "Ab initio loop modeling 
with precalculated synthetic loops and sidechain placement" (Tosatto, 
Bindewald et al).  


Referenzes: 
[1] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D. Lippman:  
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.  
Nucleic Acids Research, 25(17): 3389-3402 (1997). 
[2] L. Holm, C. Sander.  
The FSSP database of structurally aligned protein fold families.  
Nucleic Acids Research 22(17): 3600-3609. (1994). 
[3] W. Kabsch, C. Sander:  
Dictionary of protein secondary structure:  
Pattern recognition of Hydrogen-Bonded and Geometrical Features.  
Biopolymers 22, 2577-2637 (1983). 
[4] J. A. Cuff, M. E. Clamp, A. S. Siddiqui, M. Finlay, G. J. Barton:  
Jpred: a consensus secondary structure prediction server.  
Bioinformatics 14:892-893 (1998). 
[5] B. Rost, C.Sander:  
Prediction of protein secondary structure at better than 70% accuracy.  
J.Mol.Biol. 232: 584-599.(1993). 
[6] P. Baldi, S. Brunak, P. Frasconi, G. Pollastri, G. Soda:  
Exploiting the past and the future in protein secondary structure prediction.  
Bioinformatics 15:937-946 (1999). 
[7] B. Rost,C. Sander:  
Conservation and Prediction of Solvent Accessibility in Protein Families. 
Proteins 20: 216-226 (1994). 
[8] A. Bairoch, R. Apweiler:  
The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998.  
Nucleic Acids Research 26(1) 38-42 (1998). 
[9] J.D. Thompson, D.G. Higgins, T.J. Gibson:  
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment  
through sequence weighting, position-specific gap penalties and weight matrix choice. 
Nucleic Acids Res. 22: 4673-4680(1994). 
 


Fox-Sheppard , 536

number of submitted models: 17

A sequence-based method of homolog detection using two rounds of BLAST

Brian Fox and Paul Sheppard

ZymoGenetics, Inc.
email:
bfox@zgi.com

 
METHOD: The target sequence was blasted against all known proteins, each 
reasonable hit was blasted against the sequence of all the proteins 
in the PDB.  The top reasonable hits to a PDB were used as a model. 
Manual alignment between the PDB sequence and the target was used  
to try and put the gaps in the sequence near the loop of the 
3D structure.  This method can be effective in finding more remote homologs than 
only using a single round of BLAST. 


Taylor , 390

number of submitted models: 56

Prediction of Protein Structure: a Cooperative Approach

William R. Taylor, Kuang Lin, Delmiro Fernandez

National Institute for Medical Research
email:
kxlin@nimr.mrc.ac.uk

 
Three-dimensional(3D) models of CASP4 targets were generated applying 
a Cooperative approach. The probe sequences were firstly used in 
iteratively sequence databank searching with templates and multiple 
alignment. 
With predicted secondary structures from the multiple sequence  
alignments, the backbone structures were then built using the  
multiple sequence threading program MST. 
After manual selection, we used MaxSprout to construct full atom  
structure models. 
A simple 1D/3D matching program TUNE helped in the selection of  
target structures. 
 
Reference: 
MULTAL    W.R.Taylor (1988) J. Molec. Evol., 28:161--169  
DOMS      W.R.Taylor (1999) Prot. Engng., 12:203--216  
QUEST     W.R.Taylor (1998) J. Molec. Biol., 280:375--406  
MST       W.R.Taylor (1997) J. Molec. Biol., 269:902--943 
MaxSprout L.Holm and C.Sander (1991) J. Mol. Biol. 218:183-194.  
PREDATOR  D.Frishman and P.Argos (1997) Proteins,  27, 329-335.  
PSIPRED   D.T.Jones  (1999) J. Mol. Biol. 292:195-202.    
CATH      C.A.Orengo et al.  (1997) Structure. 5. 8. 1093-1108. 
 


MRIT-Onizuka , 052

number of submitted models: 235

The threading using Multi-dimensional Singleton Mean-force Potentials and the sequence-fragment to structure-fragment alignment with continuation-bonus scoring

Kentaro ONIZUKA

Matsushita Research Institute Tokyo Inc.
email:
onizuka@mrit.mei.co.jp

 
The method used to identify the fold type of the given protein sequences 
in CASP4 case has several features below. 
 
1. Threading using mean-force potential of Sippl type. 
 
2. The mean-force potentials are Multi-dimensional. 
   Actually they are 3D. Not only the residue-residue distance but the 
   direction is taken into account. To avoid the explosion of potential- 
   describing coefficients, linear compression technique is applied. 
 
3. The mean-force potentials are not pairwise but singleton. The 
   energy of the pair is defined with respect only to one of the residue 
   type of the pair, and the other residue type is set to 'any.' 
 
4. The template protein structures are compiled into sequence profiles where 
   the energy value with respect to each position and each amino residue-type 
   is assigned to each cell of the sequence profiles, summing up over the  
   energy of all possible interactions to the residue position. 
 
5. The given amino-residue sequence of the prediction target is 
   threaded to the template protein sequence profiles, by the dynamic  
   programming algorithm. 
 
6. The match score of each position is not the single residue to single  
   position match score but the windowed fragment-fragment match score. 
   The window width is 17 residues (-8 to +8 around the residue to match). 
 
7. The gap scoring scheme consists of two ways, 1) continuation bonus,  
   and 2) extension penalty. 
 
In this abstract, I briefly explain the features which are not 
published. 
 
[1] The Singleton Multi-dimensional Mean-Force Potentials 
 
The mean-force potentials in general are pairwise potentials which 
depend on the residue-types of both residues involved in the 
interaction. The pairwise potentials, however, arise a very difficult 
problem, when applied to the gapped structure-sequence threading. The 
problem is how we can know the best structure-sequence alignment which 
gives the best score. As a combinatorial optimization problem, this 
structure-sequence alignment problem optimizing the summation of 
pairwise potentials does not have rapid algorithm that gives the exact 
optimal solution.  There are two approximation technique, 1) frozen 
approximation, and 2) defining singleton potential. Frozen 
approximation looks promising but when sequentially remote the score 
is quite unreliable. If the potentials are singleton depending only 
one of the residue-type of the pair and other residues are set to any, 
the optimization can be done by the dynamic programming as well as the 
frozen approximation. 
 
I assessed whether the pairwise potentials are superior to the 
singleton ones in terms of self-recognition ratio under so-called 
Sippl test (gapless self-structure recognition test). In the case of 
single-dimensional potentials, the self-recognition ratios for 
pairwise potentials are slightly better than singleton potentials, 
while in the case of multi-dimensional (in 3D case) the singleton 
potentials almost always marked slightly better recognition ratios 
than pairwise potentials. 
 
The multi-dimensional mean force potentials are defined with respect 
to, other than sequential separation and residue-type, at most, the 
six degrees of freedom of the relative configurations (relative 
positions and relative orientations) between a pair of residues; i.e., 
not only the distance between the residues, but also the relative 
directions and orientations. In CASP4 case, I adopted 3D potential 
with respect to the distance and the direction. 
 
The greatest difficulty in multi-dimensional statistical analysis is 
the explosion in the number of coefficients to represent the 
potential. To overcome the explosion of coefficients, the author 
applied a linear compression technique to the multi-dimensional 
distribution.  Here the multi-dimensional distribution is linearly 
expanded by a series of orthonormal bases into a set of coefficients, 
where the flexible choice of the maximal order for the expansion in 
each dimension controls the total number of coefficients representing 
the distribution. 
 
[2] The Dynamic-Programming-based Threading 
    Using Fragment-fragment Match and Continuation Bonus Scoring 
 
It is relatively easy to obtain optimal alignment when the mean-force 
potentials are singleton depending on the type of only one residue 
involved in the interaction. After statistical analysis is done by 
compiling the coefficients of mean force potentials from the learning 
data-set, each template structures can be compiled into the sequence 
profile where the potential value (summation of the 
mean-force-potentials involving the residue at the position) for each 
amino residue-type is assigned. And then each given protein sequence 
is threaded to the sequence profile by the dynamic programming 
algorithm. 
 
The gap-scoring technique here adopted is a combination of 1) giving a 
continuation bonus when gaps are not inserted, and 2) giving an 
extension penalty when gaps extend. The continuation-bonus scoring is 
beneficial because the total continuation-bonus depends on the length 
of the shorter one of either the given sequence or the template 
structure profile.  
 
Regarding the match score, when residue 'a' of the given sequence is 
matched to the position 'i' in the template structure profile, not 
only the energy value of the residue type 'a' at the position 'i' but 
the scores of neighboring position are considered. Thus the potential 
value is not the residue-position match-score but the score of 
sequential fragment to the structure fragment of the same length.  In 
the CASP4 case, the fragment length was 17 (-8 to +8). This 
modification to the mean-force-potential-based threading contributes 
greatly to improve the self-structure recognition ratio. 
 
[Result] 
 
The threading method used in CASP4 was found to be very sensitive 
against the structural diversity. Thus, the self-structure recognition 
ratio in gapped threading achieved over 70% accuracy in Sippl test, 
while in the test to recognize the different protein but belonging to 
the same SCOP fold class, the recognition ratio was very 
disappointing. In CASP4 case, many targets are predicted to have  
2.75 and 3.9 of SCOP fold class. 
 
[Reference] 
 
Sippl, 1990: Sippl M.J. (1990) Calculation of 
Conformational Ensembles from Potentials of Mean Force: An Approach to 
the Knowledge-based Prediction of Local Structure in Globular 
Proteins. J. Mol. Biol., 213,859-883. 
 
Matsuo et al., 1995:  Matsuo Y., Nakamura H., Nishikawa 
K. (1995) Detection of Side-Chain Packing and Electrostatic 
Interactions. J. Biochem. 118,137-148 
 
Alexandrov et al. 1996: Alexandrov N.N. Nussinov R. Zimmer R.M. (1996) 
Fast Protein Fold Recognition via Sequence to Structure Alignment and  
Contact Capacity Potentials. Proc. PSB 1996 53-72 
 
Hendlich et al., 1990: Hendlich M., Lackner P., Weitckus S., 
Floeckner H., Froschauer R., Gottsbacher K., Casari G., Sippl 
M.J. (1990) Identification of Native Protein Folds Amongst a Large 
Number of Incorrect Models, The calculation of Low Energy Conformation 
from Potentials of Mean Force. J. Mol. Biol.,216,167-180 
 
Wallace, 1991:  Wallace G.K. (1991) The JPEG Still 
Picture Compression Standard. CACM 34, 34,30-44 
 
 


GMD-SCAI , 361

number of submitted models: 98

Fold recognition with ToPLign/123D and ToPLign/RDP

Ralf Zimmer Theo MevissenIngolf SommerThomas Lengauer

GMD-SCAI
email:
ralf.zimmer@gmd.de

 
A reduced candidate list is produced via ToPLign [1]                     
sequence/profile alignments, 123D [2] and RDP [3].                       
The top scoring candidates from this list and the                        
corresponding refined alignments have been produced                      
with RDP [3] using newly derived modifications of                        
contact capacity [2] and pair interaction potentials.                    
References:                                                              
[1] Heinz Mevissen and Ralf Thiele and Ralf Zimmer and Thomas Lengauer:  
    "Analysis of Protein Alignments -- The software environment ToPLign" 
    GMD, 1994-98 (http://cartan.gmd.de/ToPLign.html)                                                        
[2] Nick Alexandrov and Ruth Nussinov and Ralf Zimmer:                   
    "Fast Protein Fold Recognition via Sequence to Structure Alignment   
    and Contact Capacity Potentials",                                    
    Pacific Symposium on Biocomputing'96, World Scientific Publ. Co.,    
    1996, 53--72.                                                        
[3] Ralf Thiele and Ralf Zimmer and Thomas Lengauer:                     
    "Recursive Dynamic Programming for Adaptive Sequence                 
    and Structure Alignment",                                            
    ISMB'95, C. Rawlings et al. (Eds.), AAAI Press, 384--392.            
[4] Ralf Thiele and Ralf Zimmer and Thomas Lengauer:                     
    "Protein Threading by Recursive Dynamic Programming",                
    JMB,757-779,290,1999                                                 
[5] Ralf Zimmer and Marko Woehler and Ralf Thiele:                       
    "New Scoring Schemes for Protein Fold Recognition                    
    based on Voronoi Contacts",                                          
    Bioinformatics, 14, 3, 295--308, 1998.                               
[6] Alexander Zien and Ralf Zimmer and Thomas Lengauer:                  
    "A simple iterative approach to parameter optimization",             
    JCB, 2000, in press.                                                 
 


SBfold , 381

number of submitted models: 68

SBfold's procedures for fold recogntion.

Kristin K Koretke, Robert Russell, Autumn L Sutherlin, Craig Volker, Michael J Bower, Ajita Bhat, Maxwell D Cummings, and Andrei N Lupas

SmithKline Beecham Pharmaceuticals
email:
Andrei_N_Lupas@sbphrd.com

 
Summary
 
All CASP4 targets were submitted to the sensitive search routine program 
SENSER, as described in detail in the abstract for the SBauto submission. 
Secondary structure predictions were gathered from the JPred server. 
Additional sequence searches were done using regular expression patterns and 
HMMs. If a protein of known structure appeared to match the properties of the 
target, alignments were generated using MACAW or HMMer.  

Details 
        
Details on the operation of SENSER are given in the SBauto abstract. 
If SENSER identified a potential template structure, its match with the target 
was evaluated using predicted secondary structure, the occurrence of sequence 
patterns, and biochemical information. The aligment was generated using MACAW 
or HMMer.          
If SENSER did not identify a potential template structure, regular expression 
patterns, predicted secondary structure, and biochemical information were used 
to search for possible templates. In addition, in cases where the target was 
only a fragment of a larger protein, the entire protein was used in sequence 
searches. If a template was judged to match the properties of the target, an 
alignment was produced using MACAW, HMMer, Clustal, or a combination of these 
methods, to produce the alignment that seemed most plausible to us based on 
conserved residues, hydrophobicity, and secondary structure.                                                                                 
                 


SBauto , 382

number of submitted models: 74

SBauto's procedures for fold recogntion.

Kristin K Koretke, Robert Russell, Autumn L Sutherlin, Craig Volker, Michael J Bower, Ajita Bhat, Maxwell D Cummings, and Andrei N Lupas

SmithKline Beecham Pharmaceuticals
email:
Kristin_K_Koretke@sbphrd.com

 
Summary 

All CASP4 targets were submitted to the sensitive search routine program 
SENSER, which is based on PSI-Blast and HMMer. SENSER runs through three 
different search strategies, using PSI-Blast as its search engine, to 
identifiy a relationship with a sequence of known structure. As soon as a fold 
is identified, an alignment between the CASP target sequence and the sequence 
with a known fold is generarated using HMMer. If a relationship between the 
CASP target and a sequence with a known structure was not identified, a 
prediction of "novel fold" was submitted.  	
	

Details 
In the first step SENSER performs a PSI-Blast search with the target 
sequence. Proteins identified in the search are divided into a significant 
sequence space, containing those sequences with an E value lower than 10-3, 
and a 'trailing end' of sequences between 10-3 and 10. Because some of the 
proteins detected may contain unrelated domains, all proteins are trimmed to 
the actual region detected in the PSI-Blast run.  	
In the second step transitive searches are used to expand the significant 
sequence space. Only proteins within the significant sequence space that have 
less than 25 0dentity to the target sequence are used as starting points for 
further PSI-Blast searches, in order to avoid redundant searches, i.e. those 
that produce similar profiles and sequence spaces. This value was chosen as it 
is a frequently quoted threshold for the 'twilight zone', below which 
sequences can not be confidently said to be homologous.  	
In the third step trailing-end sequences are tested for their ability to 
back-validate, i.e. detect any sequence of the significant sequence space of 
the target in PSI-Blast. Because several PSI-Blast searches were performed to 
establish the significant sequence space, trailing-end sequences are pooled 
and ranked first by number of occurrences and second by E-value, before being 
tested. If a trailing-end sequence back-validates, its significant sequence 
space is added to that of the target. The process is then repeated until no 
further sequences are detected.  	

The steps above can connect proteins that are far apart in sequence space, 
however, beyond the first PSI-Blast search, they do not directly provide an 
alignment of the target to the sequences detected. Moreover, even for 
sequences detected in the first step, PSI-Blast generally provides only 
partial alignments. For these reasons, we introduced an alignment strategy 
based on HMMer. After the first PSI-Blast search, we build a target HMM from 
the proteins in the significant sequence space, as aligned by PSI-Blast. Any 
sequence detected at this step is aligned to the target sequence using the 
target HMM. Any sequence detected at a subsequent step is aligned in a five 
part process: (1) a PSI-Blast search is run for the untrimmed sequence, (2) a 
multiple alignment is extracted, (3) this alignment is combined with the 
sequences of the target HMM to produce a global alignment, using the target 
HMM as a template, (4) a final HMM is built from this global alignment, and 
(5) this HMM is used to align the detected sequence to the target. 																														
		 


Fugue-Cam , 103

number of submitted models: 215

FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties

Jiye Shi, Tom L. Blundell and Kenji Mizuguchi

Department of Biochemistry, University of Cambridge
email:
kenji@cryst.bioc.cam.ac.uk

 
We have attempted to predict the structures of all the CASP4 
targets using the fully automatic method FUGUE (Shi et al., 
submitted). FUGUE has been developed for recognizing distant 
homologues by sequence-structure comparison and producing 
reliable alignments. It has three key features: (1) Improved 
environment-specific substitution tables (Johnson et al., 1993; 
Overington et al., 1990). Substitutions of an amino acid in a 
protein structure are constrained by its local structural 
environment, which can be defined in terms of secondary 
structure, solvent accessibility, and hydrogen bonding status 
(Burke et al., 1999; Mizuguchi et al., 1998a). The environment- 
specific substitution tables have been derived from 177 
structural alignments in the HOMSTRAD database (Mizuguchi et 
al., 1998b). (2) Automatic selection of alignment algorithm 
with detailed structure-dependent gap penalties. FUGUE uses the 
global-local algorithm to align a sequence-structure pair when 
they greatly differ in length and uses the global algorithm in 
other cases. The gap penalty at each position of the structure 
is determined according to its solvent accessibility, its 
position relative to the secondary structure elements (SSEs) 
and the conservation of the SSEs. (3) Combined information from 
both multiple sequences and multiple structures. FUGUE is 
designed to align multiple sequences against multiple structures 
to enrich the conservation/variation information. For a given 
query sequence, FUGUE calls PSI-BLAST to collect sequence 
homologues from the NCBI non-redundant sequence database and 
calculates a sequence profile from refined PSI-BLAST alignment. 
This sequence profile is then used to search against a 
structural profile library, which is derived from the HOMSTRAD 
structural alignments using environment-specific substitution 
tables and structure-dependent gap penalties. Z-score is 
calculated to evaluate the similarity of each sequence-structure 
pair. 
 
References: 
 
Burke, D. F., Deane, C. M., Nagarajaram, H. A., Campillo, N., 
Martin-Martinez, M., Mendes, J., Molina, F., Perry, J., Reddy, 
B. V., Soares, C. M., Steward, R. E., Williams, M., Carrondo, 
M. A., Blundell, T. L. & Mizuguchi, K. (1999). An iterative 
structure-assisted approach to sequence alignment and 
comparative modeling. Proteins Suppl(3), 55-60. 
 
Johnson, M. S., Overington, J. P. & Blundell, T. L. (1993). 
Alignment and searching for common protein folds using a data 
bank of structural templates. J Mol Biol 231(3), 735-52. 
 
Mizuguchi, K., Deane, C. M., Blundell, T. L., Johnson, M. S. & 
Overington, J. P. (1998a). JOY: protein sequence-structure 
representation and analysis. Bioinformatics 14(7), 617-23. 
 
Mizuguchi, K., Deane, C. M., Blundell, T. L. & Overington, 
J. P. (1998b). HOMSTRAD: a database of protein structure 
alignments for homologous families. Protein Sci 7(11), 2469-71. 
 
Overington, J., Johnson, M. S., Sali, A. & Blundell, T. L. 
(1990). Tertiary structural constraints on protein evolutionary 
diversity: templates, key residues and structure prediction. 
Proc R Soc Lond B Biol Sci 241(1301), 132-45. 


Murzin , 384

number of submitted models: 21

Distant Homology Recognition and Fold Prediction by a knowledge-based approach using SCOP and Pfam

Alexey G. Murzin and Alex Bateman

Centre for Protein Engineering, Cambridge, UK
email:
agm@mrc-lmb.cam.ac.uk

 
Since our team’s last performance in CASP2 four years ago, we have been 
working on the methods that could extend the superfamilies of known structure 
in SCOP to the sequence families of unknown structure in Pfam and other 
sequence libraries. We entered CASP4 hoping that this prediction experiment 
would provide an opportunity to test our new methods. A systematic work on the 
extension of SCOP superfamilies has already resulted in the structural 
assignment of many sequence families of unknown structure and, often, unknown 
function. Indeed, in CASP3, there were at least three targets predictable by 
this approach. Disappointedly, however, none of the CASP4 targets turned out 
to be in our list of protein families with already assigned structures.   

Therefore, in CASP4 we used essentially the same approach as developed for 
CASP2 (Murzin A.G. and Bateman A. Distant homology recognition using 
structural classification of proteins. Proteins, Suppl. 1:105-112, 1997). We 
searched for probable homologues of the target sequences and available 
biochemical information on the target protein and/or its sequence family and 
used the predicted secondary structure to shortlist the SCOP superfamilies, to 
which each attempted target may belong.  Predictions were based on the 
discovery of superfamily specific characters. The experience and expertise 
gained from our working on SCOP and Pfam databases were of a great help in 
this knowledge-based approach.  Also, we tried our knowledge-based approach in 
the two other prediction categories. We used superfamily specific features to 
improve the alignments in some of the comparative modelling targets. For 
several targets, predicted by our approach to be not related to any of the 
SCOP superfamilies, we attempted the fold prediction using the conservation 
patterns in the target sequence families, the available biochemical data 
and/or the empirical folding rules derived from known protein structures. 

The choice of prediction format, TS, and the target selection were influenced 
by the CASP3 Fold Recognition assessment experience (Murzin A.G. Structure 
Classification-Based Assessment of CASP3 Predictions for the Fold Recognition 
Targets. Proteins Suppl. 3:88-108, 1999). To ensure the detection of (partly) 
correct predictions by both sequence-dependent and sequence-independent 
numerical evaluation procedures, each of our predictions was composed of the 
regions of confident structure and alignment, the regions of confident 
structure but tentative alignment, and the regions of tentative structure. The 
3D coordinates for the most of the target atoms were the best way to represent 
this structural mosaic in a single format. As one of us strongly opposed to 
the NONE prediction, this option was not used. Therefore, in the absence of 
predicted homologous structure, we either built a 3D model of our prediction 
‘ab initio’, or had it dropped. Only one model was submitted for each of the 
completed predictions. Apart from the two targets whose structures were known 
to us before they were submitted to CASP4, we did not attempt the large, 
presumably multi-domain targets without apparent domain boundaries. Because of 
time limitations, we also ignored late comparative modelling targets including 
all but one of the predicted members of the P-loop hydrolase superfamily. Due 
to the presence of characteristic P-loop motifs in their sequences, their 
homology recognition seemed straightforward, and the actual challenge was the 
alignment. All other targets were attempted but six or so of them were dropped 
eventually.  In total, we submitted predictions for 21 targets. This include 
four Comparative Modelling targets, T0090, T0092, T0093(!) and T0103; ten 
Distant Homology Recognition targets, T0088, T0096_1, T0098, T0100, T0101, 
T0104, T0108, T0109, T0118 and T0121_2; three targets with predicted known 
folds (there may or may not be a distant homology), T0095, T0102 and T0114; 
and four targets with predicted (probably) novel folds, T0086, T0091, T0094 
and T0110. 

Many of the Distant Homology Recognition predictions were based on the result 
of previous analysis of SCOP superfamilies, for example the pectate lyase 
beta-helix fold of T0100 and T0101 (Chothia C. and Murzin A.G. New folds for 
all-beta proteins.  Structure 1, 217-222, 1993). There were several cases of 
déja vu. T0108 had the same characteristic feature as the CASP4 target T0038 
and was modelled on the experimental structure of the latter. In T0121_2, 
there was the OB-fold signature similar to one we derived for the prediction 
of T0004. For the fold prediction of T0102, we used the same pseudo ‘ab 
initio’ approach as we used for the CASP2 target T0042. Incidentally, the 
predicted fold of T0102 was found to be similar to the experimental fold of 
T0042. In T0086, there was a probable tandem repeat of two 
(alpha)-alpha-beta-beta-beta motifs, detected by the analysis of its extended 
sequence family, analogous to the approach that detected the internal 
duplication in T0002_2. Similarly, a tandem repeat of two 
beta-alpha-beta-alpha-beta motifs was detected in the extended T0094 sequence 
family. Unlike T0002_2, there was no SCOP superfamily assigned for either 
T0086 or T0094. Both target structures were modelled ‘ab initio’. 

One of our CASP2 techniques, not credited properly at the time because it had 
been used only for the late target T0026, was in great use through most of our 
CASP4 predictions. For almost every target predicted to belong to a large 
superfamily with many known structures, a composite template structure was 
assembled from different fragments of several superfamily structures 
superimposed onto their common fold. It allowed the selection of the most 
suitable parts from different structures. In particular, the predicted 
structure of the P-loop hydrolase T0104 was assembled from the fragments of 
several topologically distinct members of this very diverse superfamily to 
generate a novel topological variant.  For a number of our predictions, we 
also created hybrid templates including fragments of non-homologous structures 
to model the ‘missing’ parts in the parent structure or even to construct the 
whole fold. Then we used Modeller to generate the 3D coordinates, 
automatically sealing the gaps and fixing the stereochemistry of the joints. 


Braun-UTMB , 223

number of submitted models: 104

3-D MODELING OF PROTEIN TARGETS FOR THE CRITICAL ASSESSMENT
OF STRUCTURE PREDICTION COMPETITION (CASP4)

V.S. Mathura, K.V. Soman, C.H. Schein, Y. Xu and W. Braun

University of Texas Medical Branch at Galveston
email:
ksoman@nmr.utmb.edu

 
The Human Genome Project has revealed many proteins of unknown  
  function.  Classification of these sequences can best be done by  
  accurate prediction of their structures, and concurrent assignment to  
  families of known function. We have developed a set of tools for  
  homology modeling of proteins(1,2), based on self-correcting distance  
  geometry (DIAMOD)(3,4,5), multiple sequence alignment (MASIA (6))and  
  energy minimization (FANTOM(7)), that can be used even when the identity  
  to the target is very low(8) (300r less(9)). CASP4 provided us with an  
  opportunity to evaluate our methods impartially and objectively. We  
  submitted a total of 100 models for 27 of the 43 targets, with 15 based  
  on sequence homology. Models for five targets were generated ab initio.  
  The rest used a combination of fold recognition with multiple alignment  
  to improve the sequence register between the target and selected  
  template.  
 
Homology or comparative modeling (CM)  
  When a suitable template was identified in the Protein Data Bank for a  
  target, our comparative modeling procedure was to: (1) Align the target  
  sequence with one or more template sequences using the program CLUSTALW  
  or alignments suggested by the fold recognition servers(CAFASP) with  
  minimal manual adjustment; (2) extract distance and dihedral constraints  
  with our in-house program EXDIS; (3) build initial models with DIAMOD;  
  and (4) energy minimize using the FANTOM program. For T90, a consensus  
  aligment was prepared manually from the 3D-PSSM, BIOINBGU, FUGUE,  
  GENTHREADER, and Karplus HMM98 and SAM99 results.  FANTOM energy  
  contributions and exposed apolar surface areas calculated with the  
  program GETAREA were used for ranking multiple models for the same  
  target. Where information was available for important residues in the  
  template, such as those within the active site or areas of substrate  
  binding, we compared their location visually in the model structure.  
  
Fold recognition (FR)  
  When there was not high enough sequence homology with any protein of  
  known structure, threading (fold recognition) was attempted, using the  
  web servers mentioned above and others (PSI-BLAST, 123D and FFAS). Where  
  several methods suggested the same template, a consensus alignment was  
  prepared manually. Manual corrections/adjustments were also used to  
  insure that secondary structures and active site or other critical  
  residues were aligned. For T91, an alignment from 3D-PSSM was manually  
  edited to improve the sequence alignment. We also used multiple sequence  
  alignment of protein families where a fold seemed clear cut. For  
  example, fold recognition identified T88 as a probable Greek key fold  
  and selected yeast killer toxin (1wkt) as a template.  Another template  
  structure,1A45, that more closely resembled T88, was selected from a  
  multiple alignment with 57 b/g-crystallins.  The indicated gapping  
  pattern from the multiple alignment was used to generate a model.  
  
Ab initio modeling  
  When a suitable template could not be identified based on homology or  
  threading, but there were clear indications of conserved secondary  
  structure elements based on sequence alignments with related proteins,  
  we prepared ab initio models.  The steps for generating ab initio models  
  for T88, T91, T97, T104, and T106 were: (1) Predict secondary structures  
  and exposed/buried residues of the protein from aligned sequences with  
  JPRED and MASIA; (2) convert this information into distance and dihedral  
  angle constraints using the program TRANSLATE; (3) add other constraints  
  derived from any available experimental data for the protein; (4) build  
  models from constraints with DIAMOD; (5)refine initial models by energy  
  minimization FANTOM. We also submitted models based on fold recognition  
  methods for T88 and T91.  
  Ab initio constraints were used in several other models where  
  appropriate. Di-sulfide bond constraints were added during the modeling  
  of T123 and T125. In another example, for T86, a monomer, a trimeric  
  template of very low identity was identified based on functional  
  similarity and conservation of key active site residues. A multiple  
  alignment with target homologs was used to place probable gaps between  
  the template and target sequences and constraints were extracted from  
  the template according to our usual methods.  Ab initio constraints were  
  added at the C-terminal to replace inter-subunit contacts present in the  
  trimer.  
 
Multiple alignments help in FR and CM  
  We combined these techniques in preparing alignments where the identity  
  between the target and template was very low (such as T86 and T88), when  
  the target had a clear sequence relationship to several templates, or  
  when several sequences related to the target were known. For T101, which  
  had about the same degree of sequence identity/similarity to 6 known  
  protein structures(12-18%), a CLUSTALW multiple alignment of related  
  proteins of the pectate/pectin lyase family was used.  This agreed with  
  the DALI alignment of the pectate lyases but not of a structurally  
  related protein, chondroitinase(1DBG). We made models based on the B.  
  subtilis pectate lyase(1BN8) using the multiple alignment to adjust  
  gapping. Other models were based on the fold recognition results for  
  1DBG (where there was no real consensus for most of the protein).  
  In keeping with our efforts to use genomic data efficiently in modeling,  
  we used the homologous sequences available for a templates or targets to improve 
  the alignment.  
   
For T118, PDB- BLAST detected similarity of the C-terminal with 1DDQ-A.  
  The 1DDQ-A sequence and related bacterial and fungal polymerase alpha-factors 
  were aligned with T118 to obtain the gapping used in the submitted alignment. 
  PDB-BLAST also recognized a weak pattern of  
  identity between T126 and 1DMS and 1EG9.  Individual multiple aligments  
  of T126 with other olfactory factors and these templates was used to  
  generate the alignments submitted.  
 
  1 Soman, K.V., Midoro-Horiuti, T., Ferreon, J.C., Goldblum, R.M.,  
  Brooks, E.G., Kurosky, A., Braun, W. and Schein, C.H. (2000) Biophysical  
  Journal 79:1601-1609  
  2 Soman, K.V., Schein, C.H., Zhu, H. and Braun, W.A. (2000) Homology  
  Modeling and Simulations of Nuclease Structures. In Methods in Molecular  
  Biology (Humana Press, Totowa, N.J.; editor C.H. Schein) 160(in press  
  for December, 2000).  
  3 Zhu, H., Schein,C.H. and Braun,W. (1999). J. Mol. Modeling, 5,302-316.  
 
  4 Mumenthaler, Ch. and Braun, W. (1995) Protein Science 4, 863-871  
  5 Zhu, H. and Braun, W. Protein Sci. 1999, 8, 326-342  
  6 Zhu, H., Schein,C.H. and Braun,W. (2000) MASIA: a program to recognize  
  common patterns and properties in multiple aligned protein sequences.  
  Bioinformatics 16: in press  
  7 Fraczkiewicz, R. and Braun, W.  (1998) J. Comp. Chem. 19, 319-333.  
  8 Mumenthaler, Ch., Schneider, U., Buchholz, Ch.J., Koller, D., Braun,  
  W. and Cattaneo, R.(1997 ).Protein Sci 6, 588-597.  
  9 Buchholz, C.J., Koller, D., Devaux, P., Mumenthaler, Ch.,  
  Schneider-Shaulis, J., Braun, W., Gerlier, D. and Cattaneo, R. (1997).  
  J. Biol. Chem. 272, 22072-22079 


Levitt , 012

number of submitted models: 180

World-Wide Server Fold Recognition and Automatic Modeling

Michael Levitt

Stanford University
email:
michael.levitt@stanford.edu

 
The methods used for Comparative modeling and Fold-Recognition were the same 
and what follows is the same in both abstracts.  This work was greatly aided 
by the availability of the output of all the 30 or so servers participating in 
CAFASP on the CAFASP web site at http://cafasp.bioinfo.pl/target.  In general 
these results were available within hours of the target sequence announcement 
and we never felt the need to consult the original servers in any way.    

We first used the freeware program "wget" to download all the files for any 
new targets.  Then he parsed all these files using a large Perl script.  This 
script collected together the results from all the servers to give consensus 
secondary structure predictions, consensus fold-recognition results and every 
alignment produced.  The script also converted all the proteins recognized by 
the different servers into SCOP version 1.50 superfamily codes and the counted 
how often the different codes occurred.  Initially, we used the results for 
over 20 servers but then found it more accurate to concentrate on eight that 
seemed to perform most consistently.  These were: ffas, foldfit, fugue, 
genthreader, inbgu, mgenthreader, pdbblast, and target99.  As may have been 
expected, the groups behind each of these eight servers were generally the 
experts who had done well in fold-recognition at previous CASP events (Godzik, 
FFAS and PDB-Blast; Sternberg, foldfit or 3D-PSSM; Mizuguchi/Blundell, FUGUE; 
Fischer, INBGU; Jones genTHREADER and mGenTHREADER; and Karplus, SAM-T99 or 
target99).  Unlike the CAFASP compilation released on the web by Danny Fischer 
(http://www.cs.bgu.ac.il/~dfischer/CAFASP2/summaries/), no manual intervention 
was used in parsing these raw results.  For each target we produced a summary 
file that listed:  

(1) The fold recognition hits in decreasing order of significance with the PDB 
entry name, the significance scores and the SCOP 1.50 ID.  In some cases the 
raw significance score given by the server was modified so that scores were on 
the same scale (-100 for highest significance to small positive numbers for no 
significance). For example: 

T0099_ffas_hit_1      1bu1a   -33.2   2.32.2 
T0099_ffas_hit_2      1ark    -30.7   2.32.2 
 
(2) All the alignments produced by each method together with information on 
the sequence match.  For example: 

T0099_ffas_al_2-a.mas_1ark   2.32.2  EFIAIYDYKAETEEDLTIKKGEKLEIIEK-EGDWWKAKAIGSGEIGYIPANYIAAA 
T0099_ffas_al_2-b.sla_1ark   2.32.2  IFRAMYDYMAADADEVSFKDGDAIINVQAIDEGWMYGTVQRTGRTGMLPANYVEAI 
T0099_ffas_al_2-x.par_1ark   2.32.2  nMAT=55, pID=28, nDEL=1, nINS=0, nCov=55/56, spaci=-99.000 
 
(3) A Consensus summary allowing the fold to be recognized.  For each SCOP 
superfamily we collect the number of hits, the mean significance score, the 
method and rank, the SCOP title and the PDB domain names with their SPACI 
scores (Brenner, Koehl and Levitt, 2000).  For example: 

%T0099  4.77.1      -78.4     3 genthreader_1 mgenthreader_2 pdbblast_9 
%T0099                           (Alpha and beta (a+b),SH2-like,SH2 domain) 
%T0099                           1fmk 0.578, 2src 0.540,  
%T0099  4.123.1     -59.9     6 genthreader_1 mgenthreader_2 pdbblast_6 pdbblast_7 
%T0099                           (Alpha and beta (a+b),Protein kinase-like (PK-like)) 
%T0099                           1fmk 0.578, 2src 0.540, 1qcfa 0.431, 1ad5a 0.258, 
%T0099  2.32.2      -45.9    60 ffas_1 ffas_2 ffas_3 ffas_4 ffas_5 ffas_6 ffas_7  
%T0099                           (All beta,SH3-like barrel,SH3-domain) 
%T0099                           1ckaa 0.665, 1fmk1 0.578, 2src 0.540, 
 
For more complete results see our "private" site at 
http://csb.stanford.edu/levitt/casp1234.   During the CASP event, information 
contained in that site was updated regularly by Levitt and shared with the 
different CASP4 groups in my lab headed by Samudrala, Xia, Fain and Koehl 
respectively.  This is the only information that was shared.  Each group then 
went on to make their own comparative models (Samudrala, Koehl and Levitt) 
and/or ab initio models (Fain, Levitt, Samudrala, and Xia).  There was no 
comparison of models, as each individual preferred to use CASP as an 
opportunity to prefect their methods rather than to "win" CASP.    

Overall we felt very confident (perhaps wrongly so) about recognizing an 
appropriate template in the comparative modeling and fold recognition parts of 
CASP4.  We considered 17 targets to be Comparative Modeling targets (T0089, 
T0090, T0092, T0099, T0101, T0103, T0111, T0112, T0113, T0117, T0119, T0121, 
T0122, T0123, T0125, T0127, T0128) and did them all.  Of the remaining 26 
targets, we considered 18 to be Fold-Recognition targets and 8 to be Ab Initio 
targets.  For those targets that we considered to be fold-recognition targets, 
9 were considered easy as their was very clear sequence similarity (T0087, 
T0088, T0093, T0096, T0098, T0100, T0104, T0109, T0116), and 7 were considered 
difficult and could not have been done without the consensus use of the 
servers participating in CAFASP (T0094, T0095, T0107, T0108, T0115, T0118,   
T0126), and 2 were considered to have no recognizable fold (T0120, T0124).  
They were also too large for ab initio modeling so no results were submitted 
for these.  In all cases we submitted all-atom models.  

In the predictions done by Levitt group, all the alignments for targets 
submitted after 15 August were re-aligned using the structure of the template 
to modify normal dynamic programming.  This was done as follows: (a) The cost 
of deleting residues from the template was proportional to the distance across 
the gap in three-dimensions (measured between the CA atoms adjacent to the 
gap).  (b) The cost of inserting residues depended on how buried the residues 
adjacent to the insertion were.  (c) Buried residues were given greater weight 
in the scoring.  Each of these measures has associated with it a weight and 
not having time to optimize these weights on known structural alignments, we 
used 25 combinations of parameters and generated alignments for every one.   

All the alignments taken from CAFASP before 15 August or re-aligned as 
described above, we then used with our well-established automatic modeling 
methods, SegMod and Encad, to generate stereochemically acceptable all-atom 
models for each alignment (see Levitt, M. Accurate Modelling of Protein 
Conformation by Automatic Segment Matching. J. Mol. Biol. 226, 507-533 (1992) 
and Levitt, M. Energy Refinement of Hen Egg-White Lysozyme. J. Mol. Biol. 82, 
393-420 (1974)).    
 
Finally the best models were selected as follows.  Use the rapdf probability 
score (Samudrala, R & Moult, J.  An All-atom Distance-dependent Conditional 
Probability Discriminatory Function for Protein Structure Prediction.  J. Mol. 
Biol., 275: 893-914, (1998)) to choose the best 1000 models (it there are that 
many).  Cluster all these 1000 or fewer models into 10 clusters (using 
bottom-up hierarchical clustering based on inter-structure CA coordinate RMS 
deviation).  For each model we use the rapdf score, Samudrala's HCF 
hydrophobic compactness score, Keasar's surface energy, and the number of 
hydrogen bonds to rank the conformations in each cluster.  Finally choose the 
five lowest energy models never including more than one model from a given 
cluster. Occasionally manual intervention was used in deciding the rank of the 
models in the official submission to CASP.  For this we viewed the models to 
judge general protein like shape and also used the coverage.  For example, a 
model with a less favorable energy score may be ranked above a model with 
better score if the first model covered more of the target sequence