InChI Tag: Working Group

33 posts

Capturing Mixtures — Bringing Informatics to the World of Practical Chemistry Recorded live December 19, 2019

Capturing Mixtures — Bringing Informatics to the World of Practical Chemistry

Recorded live December 19, 2019
CDD Bault Webinar
https://www.collaborativedrug.com/recorded-cdd-webinar-capturing-mixtures-bringing-informatics-world-practical-chemistry/

Chris Jakober, Leah McEwen and Alex Clark
Slides and Video Available at CDD VAULT

 

Webinar Summary

Our industry has been using cheminformatics to support drug discovery for decades, leveraging formats for describing organic molecules, such as Molfile, SMILES, and InChI. These are idealized concepts rather than a description of the laboratory reality: it is rare that a substance can be accurately described with a single molecule.

Almost every sample has an impurity level, or is dissolved in solvent, or exists as an adduct, or is explicitly combined with other substances. Mixtures often combine certainty and uncertainty within the same description: some components can be well-defined molecules with an accurately measured molar concentration, while others are estimated portions, or amorphous adjuncts.

The cheminformatics community has yet to select a standard format for describing mixtures in a machine-readable, standardized, and interoperable way, and most publications fall back to using a text description. Electronic lab notebooks and inventory databases are forced to choose between using text, proprietary formats, or ignoring mixture composition altogether.

We will describe our work toward two new data structures: Mixfile and MInChI, which are intended to fill roles that are analogous to Molfile and InChI, respectively. We will describe the ways in which we expect that mixtures-based informatics tools will affect all industries that intersect with chemistry.

Join our expert panelists as we discuss the future of mixture representation, including:

 

The RInChI Project

The RInChI Project

https://www-rinchi.ch.cam.ac.uk/
https://www-rinchi.ch.cam.ac.uk/rinchi.php

 

The aim of the RInChI Project is to create a unique, canonical, text string to describe a reaction. Different researchers working on the same reaction should be able to generate the same RInChI without needing to confer with each other

A possible extension to the RInChI as a means of providing machine readable process data

A possible extension to the RInChI as a means of providing machine readable process data

Philipp-Maximilian Jacob, Tian Lan, Jonathan M. Goodman & Alexei A. Lapkin
Journal of Cheminformatics volume 9, Article number: 23 (2017)

Abstract: The algorithmic, large-scale use and analysis of reaction databases such as Reaxys is currently hindered by the absence of widely adopted standards for publishing reaction data in machine readable formats. Crucial data such as yields of all products or stoichiometry are frequently not explicitly stated in the published papers and, hence, not reported in the database entry for those reactions, limiting their usefulness for algorithmic analysis. This paper presents a possible extension to the IUPAC RInChI standard via an auxiliary layer, termed ProcAuxInfo, which is a standardised, extensible form in which to report certain key reaction parameters such as declaration of all products and reactants as well as auxiliaries known in the reaction, reaction stoichiometry, amounts of substances used, conversion, yield and operating conditions. The standard is demonstrated via creation of the RInChI including the ProcAuxInfo layer based on three published reactions and demonstrates accurate data recoverability via reverse translation of the created strings. Implementation of this or another method of reporting process data by the publishing community would ensure that databases, such as Reaxys, would be able to abstract crucial data for big data analysis of their contents.

Analysing a billion reactions with the RInChI

Analysing a billion reactions with the RInChI

Jonathan M. Goodman, Gerd Blanke and Hans Kraut
From the journal Pure and Applied Chemistry

Abstract: The RInChI is a canonical identifier for reactions which is widely used in reaction databases. It can be used to handle large collections of reactions and to link information from diverse data sources. How much information can it handle? Studies of the SAVI database, which contains more than a billion reactions, demonstrate that the RInChI is useful in analysing such a large collection of molecular data, and the reduced form of the Web-RInChIKey contains enough information to be an effective differentiator of reactions. Issues of NH tautomerism and stereochemistry are handled effectively. The RInChI illustrates that some of the properties of the algorithmically-generated SAVI database differ from SPRESI, which is a collection of experimental data. The RInChI has different properties to Reaction SMILES and both approaches provide useful and distinct information. We recommend that the RInChI be included in data models for reactions.

International chemical identifier for reactions (RInChI)

International chemical identifier for reactions (RInChI)

Guenter Grethe, Jonathan M Goodman & Chad HG Allen
Journal of Cheminformatics volume 5, Article number: 45 (2013)

Abstract: The IUPAC International Chemical Identifier (InChI) provides a method to generate a unique text descriptor of molecular structures. Building on this work, we report a process to generate a unique text descriptor for reactions, RInChI. By carefully selecting the information that is included and by ordering the data carefully, different scientists studying the same reaction should produce the same RInChI. If differences arise, these are most likely the minor layers of the InChI, and so may be readily handled. RInChI provides a concise description of the key data in a chemical reaction, and will help enable the rapid searching and analysis of reaction databases.

International chemical identifier for reactions (RInChI)

International chemical identifier for reactions (RInChI)

Guenter Grethe, Gerd Blanke, Hans Kraut & Jonathan M. Goodman
Journal of Cheminformatics volume 10, Article number: 22 (2018)
May 9, 2018

Abstract: The Reaction InChI (RInChI) extends the idea of the InChI, which provides a unique descriptor of molecular structures, towards reactions. Prototype versions of the RInChI have been available since 2011. The first official release (RInChI-V1.00), funded by the InChI Trust, is now available for download (http://www.inchi-trust.org/downloads/). This release defines the format and generates hashed representations (RInChIKeys) suitable for database and web operations. The RInChI provides a concise description of the key data in chemical processes, and facilitates the manipulation and analysis of reaction data.

NCI CADD Tautomerizer

Tautomerizer – Predict tautomers based on 80+ rules

https://cactus.nci.nih.gov/tautomerizer/

Introduction from Web Service (11/24/2022):

Experimental service that allows you to test a set of tautomeric transforms with your own molecules. The predefined set of transforms comprises both the current 24 standard rules used by the chemoinformatics toolkit CACTVS and 55+ additional rules compiled in the context of the IUPAC project of Redesign of the Handling of Tautomerism in InChI V2.

Please be aware that this is a chemoinformatics tool, i.e. the tautomer generation process is strictly pattern-based and does not take energetics into account in any way. Some of the generated tautomers may therefore be of high energy and not detectable experimentally.

Enumeration of Ring–Chain Tautomers Based on SMIRKS Rules

Enumeration of Ring–Chain Tautomers Based on SMIRKS Rules

Laura Guasch, Markus Sitzmann, and Marc C. Nicklaus
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170818/
J Chem Inf Model. 2014 Sep 22; 54(9): 2423–2432.

Abstract: A compound exhibits (prototropic) tautomerism if it can be represented by two or more structures that are related by a formal intramolecular movement of a hydrogen atom from one heavy atom position to another. When the movement of the proton is accompanied by the opening or closing of a ring it is called ring–chain tautomerism. This type of tautomerism is well observed in carbohydrates, but it also occurs in other molecules such as warfarin. In this work, we present an approach that allows for the generation of all ring–chain tautomers of a given chemical structure. Based on Baldwin’s Rules estimating the likelihood of ring closure reactions to occur, we have defined a set of transform rules covering the majority of ring–chain tautomerism cases. The rules automatically detect substructures in a given compound that can undergo a ring–chain tautomeric transformation. Each transformation is encoded in SMIRKS line notation. All work was implemented in the chemoinformatics toolkit CACTVS. We report on the application of our ring–chain tautomerism rules to a large database of commercially available screening samples in order to identify ring–chain tautomers

Tautomerism of Warfarin: Combined Chemoinformatics, Quantum Chemical, and NMR Investigation

Tautomerism of Warfarin: Combined Chemoinformatics, Quantum Chemical, and NMR Investigation

Laura Guasch, Megan L. Peach, and Marc C. Nicklaus
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7724503/
J Org Chem. 2015 Oct 16; 80(20): 9900–9909.

Abstract: Warfarin, an important anticoagulant drug, can exist in solution in 40 distinct tautomeric forms through both prototropic tautomerism and ring–chain tautomerism. We have investigated all warfarin tautomers with computational and NMR approaches. Relative energies calculated at the B3LYP/6-311G+ +(d,p) level of theory indicate that the 4-hydroxycoumarin cyclic hemiketal tautomer is the most stable tautomer in aqueous solution, followed by the 4-hydroxycoumarin open-chain tautomer. This is in agreement with our NMR experiments where the spectral assignments indicate that warfarin exists mainly as a mixture of cyclic hemiketal diastereomers, with an open-chain tautomer as a minor component. We present a diagram of the interconversion of warfarin created taking into account the calculated equilibrium constants (pKT) for all tautomeric reactions. These findings help with gaining further understanding of proton transfer and ring closure tautomerization processes. We also discuss the results in the context of chemoinformatics rules for handling tautomerism.

Experimental and Chemoinformatics Study of Tautomerism in a Database of Commercially Available Screening Samples

Experimental and Chemoinformatics Study of Tautomerism in a Database of Commercially Available Screening Samples

Laura Guasch, Waruna Yapamudiyansel, Megan L. Peach, James A. Kelley, Joseph J. Barchi, Jr., and Marc C. Nicklaus
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5129033/
 2016 Nov 28; 56(11): 2149–2161.

Abstract: We investigated how many cases of the same chemical sold as different products (at possibly different prices) occurred in a prototypical large aggregated database and simultaneously tested the tautomerism definitions in the chemoinformatics toolkit CACTVS. We applied the standard CACTVS tautomeric transforms plus a set of recently developed ring–chain transforms to the Aldrich Market Select (AMS) database of 6 million screening samples and building blocks. In 30 000 cases, two or more AMS products were found to be just different tautomeric forms of the same compound. We purchased and analyzed 166 such tautomer pairs and triplets by 1H and 13C NMR to determine whether the CACTVS transforms accurately predicted what is the same “stuff in the bottle”. Essentially all prototropic transforms with examples in the AMS were confirmed. Some of the ring–chain transforms were found to be too “aggressive”, i.e. to equate structures with one another that were different compounds

Tautomerism in large databases

Tautomerism in large databases

Markus Sitzmann, Wolf-Dietrich Ihlenfeldt, and Marc C. Nicklaus
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2886898/
Journal of Computer-Aided Molecular Design volume 24pages521–551 (2010)
https://link.springer.com/article/10.1007/s10822-010-9346-4#Ack1

 

Abstract: We have used the Chemical Structure DataBase (CSDB) of the NCI CADD Group, an aggregated collection of over 150 small-molecule databases totaling 103.5 million structure records, to conduct tautomerism analyses on one of the largest currently existing sets of real (i.e. not computer-generated) compounds. This analysis was carried out using calculable chemical structure identifiers developed by the NCI CADD Group, based on hash codes available in the chemoinformatics toolkit CACTVS and a newly developed scoring scheme to define a canonical tautomer for any encountered structure. CACTVS’s tautomerism definition, a set of 21 transform rules expressed in SMIRKS line notation, was used, which takes a comprehensive stance as to the possible types of tautomeric interconversion included. Tautomerism was found to be possible for more than 2/3 of the unique structures in the CSDB. A total of 680 million tautomers were calculated from, and including, the original structure records. Tautomerism overlap within the same individual database (i.e. at least one other entry was present that was really only a different tautomeric representation of the same compound) was found at an average rate of 0.3% of the original structure records, with values as high as nearly 2% for some of the databases in CSDB. Projected onto the set of unique structures (by FICuS identifier), this still occurred in about 1.5% of the cases. Tautomeric overlap across all constituent databases in CSDB was found for nearly 10% of the records in the collection.

Tautomer Database: A Comprehensive Resource for Tautomerism Analyses

Tautomer Database: A Comprehensive Resource for Tautomerism Analyses

Devendra K. Dhaked, Laura Guasch, and Marc C. Nicklaus
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8456363/

 

Abstract: We report a database of tautomeric structures that contains 2819 tautomeric tuples extracted from 171 publications. Each tautomeric entry has been annotated with experimental conditions reported in the respective publication, plus bibliographic details, structural identifiers (e.g., NCI/CADD identifiers FICTS, FICuS, uuuuu, and Standard InChI), and chemical information (e.g., SMILES, molecular weight). The majority of tautomeric tuples found were pairs; the remaining 10% were triples, quadruples, or quintuples, amounting to a total number of structures of 5977. The types of tautomerism were mainly prototropic tautomerism (79%), followed by ring–chain (13%) and valence tautomerism (8%). The experimental conditions reported in the publications included about 50 pure solvents and 9 solvent mixtures with 26 unique spectroscopic or nonspectroscopic methods. 1H and 13C NMR were the most frequently used methods. A total of 77 different tautomeric transform rules (SMIRKS) are covered by at least one example tuple in the database. This database is freely available as a spreadsheet at https://cactus.nci.nih.gov/download/tautomer/.

Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2

Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2

Devendra K. Dhaked, Wolf-Dietrich Ihlenfeldt, Hitesh Patel, Victorien Delannée, and Marc C. Nicklaus*

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8459712/

Abstract: We have collected 86 different transforms of tautomeric interconversions. Out of those, 54 are for prototropic (non-ring–chain) tautomerism, 21 for ring–chain tautomerism, and 11 for valence tautomerism. The majority of these rules have been extracted from experimental literature. Twenty rules, covering the most well-known types of tautomerism such as keto–enol tautomerism, were taken from the default handling of tautomerism by the chemoinformatics toolkit CACTVS. The rules were analyzed against nine differerent databases totaling over 400 million (non-unique) structures as to their occurrence rates, mutual overlap in coverage, and recapitulation of the rules’ enumerated tautomer sets by InChI V.1.05, both in InChI’s Standard and a Nonstandard version with the increased tautomer-handling options 15T and KET turned on. These results and the background of this study are discussed in the context of the IUPAC InChI Project tasked with the redesign of handling of tautomerism for an InChI version 2. Applying the rules presented in this paper would approximately triple the number of compounds in typical small-molecule databases that would be affected by tautomeric interconversion by InChI V2. A web tool has been created to test these rules at https://cactus.nci.nih.gov/tautomerizer.

Crowdsourced Evaluation of InChI-based Tautomer Identification

Crowdsourced Evaluation of InChI-based Tautomer Identification
precisionFDA Challenge

https://precision.fda.gov/challenges/29

Challenge Time Period

November 1, 2022 – March 1, 2023

This challenge focuses on the International Chemical Identifier (InChI), which was developed and is maintained under the auspices of the International Union of Pure and Applied Chemistry (IUPAC) and the InChI Trust. The InChI Trust, the IUPAC Working Group on Tautomers, and the U.S. Food and Drug Administration (FDA) call on the scientific community dealing with chemical repositories/data sets and analytics of compounds to test the recently modified InChI algorithm, which was designed for advanced recognition of tautomers. Participants will evaluate this algorithm against real chemical samples in this Crowdsourced Evaluation of InChI-based Tautomer Identification.

Note: You can download a PDF of the Fall 2022 ACS Presentation

Isotope Enumerator Web GUI (isoenum-webgui)

Isotope Enumerator Web GUI

isoenum-webgui provides Flask-based web user interface that uses isoenum package to generate accurate InChI (International Chemical Identifier) for NMR metabolite features based on standard NMR experimental descriptions (currently 1D-1H and 1D-CHSQC) in order to improve data reusability of metabolomics data

isoenum @ GitHub
isoenum @ PyPI
isoenum @ ReadTheDocs

Isotope Enumerator (isoenum)

Isotope Enumerator

Isotopic (iso) enumerator (enum) – enumerates isotopically resolved InChI (International Chemical Identifier) for metabolites.

The isoenum Python package provides command-line interface that allows you to enumerate the possible isotopically-resolved InChI from one of the Chemical Table file (CTfile) formats (i.e. molfile, SDfile) used to describe chemical molecules and reactions as well as from InChI itself.

Documentation: https://isoenum.readthedocs.io/en/latest/
Code: https://github.com/MoseleyBioinformaticsLab/isoenum

Capturing mixture composition: an open machine-readable format for representing mixed substances

Capturing mixture composition: an open machine-readable format for representing mixed substances

Alex M. Clark, Leah R. McEwen, Peter Gedeck & Barry A. Bunin
Journal of Cheminformatics volume 11, Article number: 33 (2019)

Abstract: We describe a file format that is designed to represent mixtures of compounds in a way that is fully machine readable. This Mixfile format is intended to fill the same role for substances that are composed of multiple components as the venerable Molfile does for specifying individual structures. This much needed datastructure is intended to replace current practices for communicating information about mixtures, which usually relies on human-readable text descriptions, drawing several species within a single molecular diagram, or mutually incompatible ad hoc solutions. We describe an open source software application for editing mixture files, which can also be used as web-ready tools for manipulating the file format. We also present a corpus of mixture examples, which we have extracted from collections of text-based descriptions. Furthermore, we present an early look at the proposed IUPAC Mixtures InChI specification, instances of which can be automatically generated using the Mixfile format as a precursor.