Peter J. Munson and Valentina DiFrancesco
Analytical Biostatistics Section, Laboratory of Structural Biology, Division of Computer Research and Technology, National Institutes of Health, Bldg 12A, Room 204, Bethesda, MD 20892-5626, {munson@helix.nih.gov}

Maximum Likelihood Periodic Quadratic-Logistic Profile Predictions

We have submitted secondary structure predictions for the proteins: ipns, pbdg, ppdk, prosub, l14, staufen3, and mystery. Our methodology is called a maximum likelihood quadratic logistic (QL) discrimination model based on profiles [1,2]. Briefly, we have calibrated a logistic model for a three state prediction using the maximum likelihood principle assuming that secondary structural state obeys an independent trinomial probability model. The logistic model includes linear or "main-effect" terms for every amino acid residue within a 17 residue window of the state to be predicted, together with certain quadratic or "pair-wise" effects. Namely, we assume a 3.6 residue period for the helix component, and a 2.0 residue period for strand, and multiply the residue pair-wise term by cos(2*pi*k/3.6) or cos(2*pi*k/2.0). The 2*20*20=800 residue pair preference parameters are estimated along with the main-effect terms, using a penalized maximum-likelihood technique. Crossvalidated prediction rates for this method are seen to be 62.5% using on single sequences.

The profile method begins with a set of aligned homologous sequences, and rather than representing the sequence elements by a 20 vector of zeros and ones (dummy variables), uses the proportions of each residue seen at an aligned position, giving a 20 vector of proportions. For quadratic terms, we replace the 400-vector of dummy variables representing the residue pairs observed within a window with the corresponding 400 vector of proportions. Alignments are done by first choosing homologues from Swiss-Prot or PIR with greater than 20% homology on stretches longer than 80 residues, and using either pairwise or multiple alignment (CLUSTAL, PILE-UP) programs to determine alignments. Alignments were reviewed manually to remove obvious spurious homologues or spurious portions of the alignment. Areas with gaps in the final alignment were arbitrarily assigned a high coil probability. The expected percent correct figure for Q-L using profiles is 67% to 69%, in two separate crossvalidated tests.

[1] Munson, P. J., V. Di Francesco and R. Porrelli. Protein Secondary Structure Prediction using Periodic-Quadratic-Logistic Models: Theoretical and Practical Issues. 27th Annual Hawaii International Conference on System Science. 5: 375-384, 1994.
and updated in:
[2] Di Francesco, V., P.J. Munson, J. Garnier. Use of Multiple Alignments in Protein Secondary Structure Prediction. 28th Hawaii International Conference on System Sciences. (accepted), 1995.

Asilomar Conference home page