GC-EI-MS datasets of trimethylsilyl (TMS) and tert-butyl dimethyl silyl (TBDMS) derivatives for development of machine learning-based compound identification approaches

Milka Ljoncheva, Sintija Stevanoska, Tina Kosjek, Sašo Džeroski
(2022) J.Chem. 14(1):62
AI/ML Published: (Apr/2023)
DOI: https://doi.org/10.1016/j.dib.2023.109138

In the field of environment and health studies, recent trends have focused on the identification of contaminants of emerging concern (CEC). This is a complex, challenging task, as resources, such as compound databases (DBs) and mass spectral libraries (MSLs) concerning these compounds are very poor. This is particularly true for semi polar organic contaminants that have to be derivatized prior to gas chromatography-mass spectrometry (GC-MS) analysis with electron impact ionization (EI), for which it is barely possible to find any records. In particular, there is a severe lack of datasets of GC-EI-MS spectra generated and made publicly available for the purpose of development, validation and performance evaluation of cheminformatics-assisted compound structure identification (CSI) approaches, including novel cutting-edge machine learning (ML)-based approaches.