Unraveling compound taxonomies in untargeted metabolomics through artificial intelligence

Henrique dos Santos Silva
Dissertation for Master's Degree in Biochemistry Specialization in Biochemistry, U Lisbon, Portugal
AI/ML Published: (Jun/2022)
DOI: http://hdl.handle.net/10451/56544

FT-ICR-MS instruments have an ultra-high resolution and extreme mass accuracy, which allows for the unambiguous attribution of chemical formulas to metabolites. This work aimed to develop a tool for classifying FT-ICR-MS-based metabolomics that would use the annotated chemical formulas to classify metabolites into a more descriptive taxonomy than the ones in already developed classification systems. The ChemOnt taxonomy was used, which is hierarchical and with four main classification levels: Kingdom, Superclass, Class, and Subclass. AI approaches (ML classification algorithms) were used to build the classification model, using a local per parent node hierarchical approach. Five proven algorithms were used to train and tune each classifier: RF, KNN, LR, SVM, and NB. Tuning was performed with 3-fold cross-validation using the Grid Search algorithm based on the F1-score with macro average. Feature selection was performed using the MDI of RF, and one feature in pairwise correlated features was removed. MDI revealed that from a total of 133 features, only 25 had at least 0.1 importance in at least one of the classifiers. The “Organic compounds” classifier presented high overfitting. Cost-complexity pruning was used, however, performance did not increase, and overfitting did not decrease. Two multiclass approaches were used with this classifier: “output-code” and “one vs rest” with a sampling of the negatives. Neither has shown to increase the performance of the classifier as well. Performance of the algorithms was, in decreasing order: RF, KNN, LR, SVM, and NB. The last two algorithms were left out of the final classification model. Validation accuracy on the test set at each level was of 99,98% (Kingdom), 88,4% (Superclass), 79,7% (Class), and 74,6% (Subclass). Experimental validation with FT-ICR-MS data (yeast and human fingerprint) showed that there were only “Organic compounds”, with 100% accuracy, and at the remaining levels: Superclass (>92%), Class (>87%), and Subclass (>78%).