Several quality assessment evaluation issues have been raised at the CASP9 meeting and subsequently discussed in the CASP9 QA evaluation paper (http://onlinelibrary.wiley.com/doi/10.1 ... 23180/full). Summarizing, the main problems are as follows:
1) Single-model methods and quasi-single-model methods are disadvantaged in comparison to clustering methods on big sets of models.
2) Unfiltered CASP dataset may be not optimal for evaluation of QA methods (performance of clustering methods is overestimated when models are widespread in quality).
3) Structure evaluation is domain-based while QA is target-based. How can we switch to the domain-based evaluation for QA1?
In order to address these issues in the coming CASP10 evaluation we suggest the following changes to the procedure of QA prediction.
1.1. Prediction Center (PC) ranks all server models submitted on a target with the naïve consensus method within 2 days after closing server prediction window on the target. As the analysis of CASP9 results shows, the correlation of the naive_consensus score with the GDT_TS is expected to be very high on the whole set of server models (up to 0.97 on the average), and therefore we can expect the ranking to reflect quite adequately the real quality of models. The ranking from the naïve consensus method will not released to predictors but used as a guidance for PC's preparation of model test sets.
1.2. For each target, PC sorts server models in 30 quality bins and a) releases for quality assessment 30 (or/and 60) models – one (or/and two) from each bin. This way the released representative models will cover the whole range of model accuracies in the full model set. Alternatively, b) PC releases 30 (or/and 60) randomly picked server models. One way or the other, we will release for quality estimation a subset of models that is small enough to eliminate advantage of clustering methods over the single-model ones. Prediction window for 1.2 will be open for 3 days for all groups (server- and regular- deadline).
1.3. After closing stage 1.2, we will release best 150 models (according to the naïve method’s ranking) submitted on the target. This way really bad models will be eliminated from the dataset and all QA methods will receive an input dataset that is likely more similar to the datasets from real life applications. Prediction window for 1.3 will be open for another 3 days for all groups.
1.4. After closing stage 1.3, we will release all server models submitted on the target. This models will not be further used for QA prediction, but may be used by regular TS predictors.
To address the evaluation issue (3), we suggest to calculate global per-domain quality score 0≤S≤1 from the per-residue distance deviations di using the S-score technique (Levitt and Gerstein, PNAS 1998):
𝑆=1/(|𝐷|) ∑ (𝑖∈𝐷) [ 1/(1+(𝑑i /𝑑0 )^2 )],
where di is the predicted distance error for residue i, d0 – parameter, D – evaluation domain. We plan to use this score in addition to the global scores as submitted by the predictors in evaluations of whole targets.
Please let us know what you think about the suggested changes by posting your comments.
Thanks,
CASP9 QA assessors:
Andriy Kryshtafovych, Krzysztof Fidelis, Anna Tramontano