Link Search Menu Expand Document

DOME recommendations

Duration 15 minutes

Slides content

ELIXIR ML Focus Group ELIXIR ML Focus Group
More information on the web page
Currently co-chaired by Fotis Psomopoulos, Silvio Tossato, Leyla Jael Castro
Why DOME
Why DOME
Why do we need standards to publish machine learning approaches?
Machine/Deep learning has a high impact in biomedical research, it touches all areas.
We have more and more publications but there is no agreement on what key elements could be use to describe a machine learning approach
For instance, about 20% of publications including a machine learning approach do not include information about the evaluation
DOME to the rescue The DOME recommendations propose some of those key elements for machine learning approaches
They include Data, Optimization, Model and Evaluation
By now, they focus on supervised learning only
Data in DOME
  • Provenance: Protein Data Bank (PDB). X-ray structures missing residues. Npos = 339,603 residues. Nneg = 6,168,717 residues. Previously used in (Walsh et al., Bioinformatics 2015) as an independent benchmark set.
  • Dataset splits: training set: N/A. Npos,test = 339,603 residues. Nneg,test = 6,168,717 residues. No validation set. 5.22% positives on the test set.
  • Redundancy between data splits: Not applicable. Availability of data Yes, URL: http://protein.bio.unipd.it/mobidblite/. Free use license.
Optimization in DOME
  • Algorithm: Majority-based consensus classification based on 8 primary ML methods and post-processing.
  • Meta-predictions: Yes, predictor output is a binary prediction computed from the consensus of other methods; Independence of training sets of other methods with test set of meta-predictor was not tested since datasets from other methods were not available.
  • Data encoding: Label-wise average of 8 binary predictions.
  • Parameters: p = 3 (Consensus score threshold, expansion-erosion window, length threshold). No optimization.
  • Features: Not applicable.
  • Fitting: Single input ML methods are used with default parameters. Optimization is a simple majority.
  • Regularization: No.
  • Availability of configuration: Not applicable.
Model in DOME
  • Interpretability: Transparent, in so far as meta-prediction is concerned. Consensus and post processing over other methods predictions (which are mostly balck boxes). No attempt was made to make the meta-prediction a black box.
  • Output: Classification, i.e. residues thought to be disordered.
  • Execution time: ca. 1 second per representative on a desktop PC.
  • Availability of software: Yes, URL: http://protein.bio.unipd.it/mobidblite/. Bespoke license free for academic use.
Evaluation in DOME
  • Evaluation method: Independent dataset
  • Performance measures: Balanced Accuracy, Precision, Sensitivity, Specificity, F1, MCC.
  • Comparison: DisEmbl-465, DisEmbl-HL, ESpritz Disprot, ESpritz NMR, ESpritz Xray, Globplot, IUPred long, IUPred short, VSL2b. Chosen methods are the methods from which the meta prediction is obtained.
  • Confidence: Not calculated.
  • Availability of evaluation: No.
DOME publication You can find more information in the published paper and the DOME website
What comes next?
  • DOME annotation on scholarly articles → so we learn how much is commonly reported about ML approaches
  • DOME formalization as structured metadata → BioHackathon Europe 2022, project 17 “Metadata schemas for Linked Open Science”
  • (Metadata) DOME for quality assessment and comparison
  • (Metadata) Connecting metrics to the DOME recommendations → moving beyond the qualitative assessment
  • (Metadata) Extending DOME beyond supervised approaches
  • Adoption → researchers, publishers, repositories (e.g, NFDI4DataScience portal)