Test (All Tags)
InChI Markush and Variability
Jonathan M. Goodman, Gerd Blanke, István Öri & Anthony Baston
PDF of Presentation given during NIH Virtual Workshop on InChI,
March 22, 2021
Tautomers in InChI
NIH InChI Workshop, March 22, 2021
Presentation by Marc C. Nicklaus
Redesign of Handling of Tautomerism for InChI V2
IUPAC Project 2012-023-2-800
Tautomerizer – Predict tautomers based on 80+ rules
Introduction from Web Service (11/24/2022):
Experimental service that allows you to test a set of tautomeric transforms with your own molecules. The predefined set of transforms comprises both the current 24 standard rules used by the chemoinformatics toolkit CACTVS and 55+ additional rules compiled in the context of the IUPAC project of Redesign of the Handling of Tautomerism in InChI V2.
Please be aware that this is a chemoinformatics tool, i.e. the tautomer generation process is strictly pattern-based and does not take energetics into account in any way. Some of the generated tautomers may therefore be of high energy and not detectable experimentally.
Enumeration of Ring–Chain Tautomers Based on SMIRKS Rules
Laura Guasch, Markus Sitzmann, and Marc C. Nicklaus
J Chem Inf Model. 2014 Sep 22; 54(9): 2423–2432.
Abstract: A compound exhibits (prototropic) tautomerism if it can be represented by two or more structures that are related by a formal intramolecular movement of a hydrogen atom from one heavy atom position to another. When the movement of the proton is accompanied by the opening or closing of a ring it is called ring–chain tautomerism. This type of tautomerism is well observed in carbohydrates, but it also occurs in other molecules such as warfarin. In this work, we present an approach that allows for the generation of all ring–chain tautomers of a given chemical structure. Based on Baldwin’s Rules estimating the likelihood of ring closure reactions to occur, we have defined a set of transform rules covering the majority of ring–chain tautomerism cases. The rules automatically detect substructures in a given compound that can undergo a ring–chain tautomeric transformation. Each transformation is encoded in SMIRKS line notation. All work was implemented in the chemoinformatics toolkit CACTVS. We report on the application of our ring–chain tautomerism rules to a large database of commercially available screening samples in order to identify ring–chain tautomers
Tautomerism of Warfarin: Combined Chemoinformatics, Quantum Chemical, and NMR Investigation
Laura Guasch, Megan L. Peach, and Marc C. Nicklaus
J Org Chem. 2015 Oct 16; 80(20): 9900–9909.
Abstract: Warfarin, an important anticoagulant drug, can exist in solution in 40 distinct tautomeric forms through both prototropic tautomerism and ring–chain tautomerism. We have investigated all warfarin tautomers with computational and NMR approaches. Relative energies calculated at the B3LYP/6-311G+ +(d,p) level of theory indicate that the 4-hydroxycoumarin cyclic hemiketal tautomer is the most stable tautomer in aqueous solution, followed by the 4-hydroxycoumarin open-chain tautomer. This is in agreement with our NMR experiments where the spectral assignments indicate that warfarin exists mainly as a mixture of cyclic hemiketal diastereomers, with an open-chain tautomer as a minor component. We present a diagram of the interconversion of warfarin created taking into account the calculated equilibrium constants (pKT) for all tautomeric reactions. These findings help with gaining further understanding of proton transfer and ring closure tautomerization processes. We also discuss the results in the context of chemoinformatics rules for handling tautomerism.
Experimental and Chemoinformatics Study of Tautomerism in a Database of Commercially Available Screening Samples
Laura Guasch, Waruna Yapamudiyansel, Megan L. Peach, James A. Kelley, Joseph J. Barchi, Jr., and Marc C. Nicklaus
J Chem Inf Model. 2016 Nov 28; 56(11): 2149–2161.
Abstract: We investigated how many cases of the same chemical sold as different products (at possibly different prices) occurred in a prototypical large aggregated database and simultaneously tested the tautomerism definitions in the chemoinformatics toolkit CACTVS. We applied the standard CACTVS tautomeric transforms plus a set of recently developed ring–chain transforms to the Aldrich Market Select (AMS) database of 6 million screening samples and building blocks. In 30 000 cases, two or more AMS products were found to be just different tautomeric forms of the same compound. We purchased and analyzed 166 such tautomer pairs and triplets by 1H and 13C NMR to determine whether the CACTVS transforms accurately predicted what is the same “stuff in the bottle”. Essentially all prototropic transforms with examples in the AMS were confirmed. Some of the ring–chain transforms were found to be too “aggressive”, i.e. to equate structures with one another that were different compounds
Tautomerism in large databases
Markus Sitzmann, Wolf-Dietrich Ihlenfeldt, and Marc C. Nicklaus
Journal of Computer-Aided Molecular Design volume 24, pages521–551 (2010)
Abstract: We have used the Chemical Structure DataBase (CSDB) of the NCI CADD Group, an aggregated collection of over 150 small-molecule databases totaling 103.5 million structure records, to conduct tautomerism analyses on one of the largest currently existing sets of real (i.e. not computer-generated) compounds. This analysis was carried out using calculable chemical structure identifiers developed by the NCI CADD Group, based on hash codes available in the chemoinformatics toolkit CACTVS and a newly developed scoring scheme to define a canonical tautomer for any encountered structure. CACTVS’s tautomerism definition, a set of 21 transform rules expressed in SMIRKS line notation, was used, which takes a comprehensive stance as to the possible types of tautomeric interconversion included. Tautomerism was found to be possible for more than 2/3 of the unique structures in the CSDB. A total of 680 million tautomers were calculated from, and including, the original structure records. Tautomerism overlap within the same individual database (i.e. at least one other entry was present that was really only a different tautomeric representation of the same compound) was found at an average rate of 0.3% of the original structure records, with values as high as nearly 2% for some of the databases in CSDB. Projected onto the set of unique structures (by FICuS identifier), this still occurred in about 1.5% of the cases. Tautomeric overlap across all constituent databases in CSDB was found for nearly 10% of the records in the collection.
Tautomer Database: A Comprehensive Resource for Tautomerism Analyses
Devendra K. Dhaked, Laura Guasch, and Marc C. Nicklaus
Abstract: We report a database of tautomeric structures that contains 2819 tautomeric tuples extracted from 171 publications. Each tautomeric entry has been annotated with experimental conditions reported in the respective publication, plus bibliographic details, structural identifiers (e.g., NCI/CADD identifiers FICTS, FICuS, uuuuu, and Standard InChI), and chemical information (e.g., SMILES, molecular weight). The majority of tautomeric tuples found were pairs; the remaining 10% were triples, quadruples, or quintuples, amounting to a total number of structures of 5977. The types of tautomerism were mainly prototropic tautomerism (79%), followed by ring–chain (13%) and valence tautomerism (8%). The experimental conditions reported in the publications included about 50 pure solvents and 9 solvent mixtures with 26 unique spectroscopic or nonspectroscopic methods. 1H and 13C NMR were the most frequently used methods. A total of 77 different tautomeric transform rules (SMIRKS) are covered by at least one example tuple in the database. This database is freely available as a spreadsheet at https://cactus.nci.nih.gov/download/tautomer/.
Toward a Comprehensive Treatment of Tautomerism in Chemoinformatics Including in InChI V2
Devendra K. Dhaked, Wolf-Dietrich Ihlenfeldt, Hitesh Patel, Victorien Delannée, and Marc C. Nicklaus*
Abstract: We have collected 86 different transforms of tautomeric interconversions. Out of those, 54 are for prototropic (non-ring–chain) tautomerism, 21 for ring–chain tautomerism, and 11 for valence tautomerism. The majority of these rules have been extracted from experimental literature. Twenty rules, covering the most well-known types of tautomerism such as keto–enol tautomerism, were taken from the default handling of tautomerism by the chemoinformatics toolkit CACTVS. The rules were analyzed against nine differerent databases totaling over 400 million (non-unique) structures as to their occurrence rates, mutual overlap in coverage, and recapitulation of the rules’ enumerated tautomer sets by InChI V.1.05, both in InChI’s Standard and a Nonstandard version with the increased tautomer-handling options 15T and KET turned on. These results and the background of this study are discussed in the context of the IUPAC InChI Project tasked with the redesign of handling of tautomerism for an InChI version 2. Applying the rules presented in this paper would approximately triple the number of compounds in typical small-molecule databases that would be affected by tautomeric interconversion by InChI V2. A web tool has been created to test these rules at https://cactus.nci.nih.gov/tautomerizer.
Crowdsourced Evaluation of InChI-based Tautomer Identification
Challenge Time Period
November 1, 2022 – March 1, 2023
This challenge focuses on the International Chemical Identifier (InChI), which was developed and is maintained under the auspices of the International Union of Pure and Applied Chemistry (IUPAC) and the InChI Trust. The InChI Trust, the IUPAC Working Group on Tautomers, and the U.S. Food and Drug Administration (FDA) call on the scientific community dealing with chemical repositories/data sets and analytics of compounds to test the recently modified InChI algorithm, which was designed for advanced recognition of tautomers. Participants will evaluate this algorithm against real chemical samples in this Crowdsourced Evaluation of InChI-based Tautomer Identification.
Note: You can download a PDF of the Fall 2022 ACS Presentation
PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifcations applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifer (InChI) software, as manifested by conversion of the InChI back into a chemical structure.
Over the past several centuries, chemistry has permeated virtually every facet of human lifestyle, enriching fields as diverse as medicine, agriculture, manufacturing, warfare, and electronics, among numerous others. Unfortunately, application-specific, incompatible chemical information formats and representation strategies have emerged as a result of such diverse adoption of chemistry. Although a number of efforts have been dedicated to unifying the computational representation of chemical information, disparities between the various chemical databases still persist and stand in the way of cross-domain, interdisciplinary investigations. Through a common syntax and formal semantics, Semantic Web technology offers the ability to accurately represent, integrate, reason about and query across diverse chemical information.
Here we specify and implement the Chemical Entity Semantic Specification (CHESS) for the representation of polyatomic chemical entities, their substructures, bonds, atoms, and reactions using Semantic Web technologies. CHESS provides means to capture aspects of their corresponding chemical descriptors, connectivity, functional composition, and geometric structure while specifying mechanisms for data provenance. We demonstrate that using our readily extensible specification, it is possible to efficiently integrate multiple disparate chemical data sources, while retaining appropriate correspondence of chemical descriptors, with very little additional effort. We demonstrate the impact of some of our representational decisions on the performance of chemically-aware knowledgebase searching and rudimentary reaction candidate selection. Finally, we provide access to the tools necessary to carry out chemical entity encoding in CHESS, along with a sample knowledgebase.
By harnessing the power of Semantic Web technologies with CHESS, it is possible to provide a means of facile cross-domain chemical knowledge integration with full preservation of data correspondence and provenance. Our representation builds on existing cheminformatics technologies and, by the virtue of RDF specification, remains flexible and amenable to application- and domain-specific annotations without compromising chemical data integration. We conclude that the adoption of a consistent and semantically-enabled chemical specification is imperative for surviving the coming chemical data deluge and supporting systems science research.
The Blue Obelisk movement was established in 2005 as a response to the lack of Open Data, Open Standards and Open Source (ODOSOS) in chemistry. It aims to make it easier to carry out chemistry research by promoting interoperability between chemistry software, encouraging cooperation between Open Source developers, and developing community resources and Open Standards.
This contribution looks back on the work carried out by the Blue Obelisk in the past 5 years and surveys progress and remaining challenges in the areas of Open Data, Open Standards, and Open Source in chemistry.
We show that the Blue Obelisk has been very successful in bringing together researchers and developers with common interests in ODOSOS, leading to development of many useful resources freely available to the chemistry community.
There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string.
I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% of the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset.
The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain – such as the development of a standard aromatic model for SMILES – the ability to create the same SMILES using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.
An algorithm is introduced that enables a fast generation of all possible prototropic tautomers resulting from the mobile H atoms and associated heteroatoms as defined in the InChI code. The InChI-derived set of possible tautomers comprises (1,3)-shifts for open-chain molecules and (1,n)-shifts (with n being an odd number >3) for ring systems. In addition, our algorithm includes also, as extension to the InChI scope, those larger (1,n)-shifts that can be constructed from joining separate but conjugated InChI sequences of tautomer-active heteroatoms. The developed algorithm is described in detail, with all major steps illustrated through explicit examples. Application to ∼72 500 organic compounds taken from EINECS (European Inventory of Existing Commercial Chemical Substances) shows that around 11% of the substances occur in different heteroatom−prototropic tautomeric forms. Additional QSAR (quantitative structure−activity relationship) predictions of their soil sorption coefficient and water solubility reveal variations across tautomers up to more than two and 4 orders of magnitude, respectively. For a small subset of nine compounds, analysis of quantum chemically predicted tautomer energies supports the view that among all tautomers of a given compound, those restricted to H atom exchanges between heteroatoms usually include the thermodynamically most stable structures.
A modified InChI (International Chemical Identifier) string scheme, yaInChI (yet another InChI), is suggested as a method for including the structural information of a given molecule, making it straightforward and more easily readable. The yaInChI theme is applicable for checking the structural identity with higher sensitivity and generating three-dimensional (3-D) structures from the one-dimensional (1-D) string with less ambiguity than the general InChI method. The modifications to yaInChI provide non-rotatable single bonds, stereochemistry of organometallic compounds, allene and cumulene, and parity of atoms with a lone pair. Additionally, yaInChI better preserves the original information of the given input file (SDF) using the protonation information, hydrogen count +1, and original bond type, which are not considered or restrictively considered in InChI and SMILES. When yaInChI is used to perform a duplication check on a 3D chemical structure database, Ligand.Info, it shows more discriminating power than InChI. The structural information provided by yaInChI is in a compact format, making it a promising solution for handling large chemical structure databases.
UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related
molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.
This paper documents the design, layout and algorithms of the IUPAC International Chemical Identifier, InChI.
IUPAC Standards Online is a database built from IUPAC’s (The International Union of Pure and Applied Chemistry) standards and recommendations, which are extracted from the journal Pure and Applied Chemistry (PAC).
The International Union of Pure and Applied Chemistry (IUPAC) is the organization responsible for setting the standards in chemistry that are internationally binding for scientists in industry and academia, patent lawyers, toxicologists, environmental scientists, legislation, etc. “Standards” are definitions of terms, standard values, procedures, rules for naming compounds and materials, names and properties of elements in the periodic table, and many more.
The database will be the only product that provides for the quick and easy search and retrieval of IUPAC’s standards and recommendations which until now have remained unsorted within the huge Pure and Applied Chemistry archive.
Nomenclature and Terminology
Theoretical & Computational Chemistry
Learn about technology that solves the issue of interpreting 3D stereochemical information implied in 2D structure representations.
The IUPAC International Chemical Identifier (InChI) is a non-proprietary, machine-readable chemical structure representation format enabling electronic searching, and interlinking and combining, of chemical information from different sources. It was developed from 2001 onwards at the U.S. National Institute of Standards and Technology under the auspices of IUPAC’s Chemical Identifier project. Since 2009, the InChI Trust, a consortium of (mostly) publishers and software developers, has taken over responsibility for funding and oversight of InChI maintenance and development. Funding and responsibility for scientific aspects of InChI development remain with the IUPAC Division VIII (Chemical Nomenclature and Structure Representation) and InChI Subcommittee.
Project for Cheminformatics Fall 2012. Part 2/2. Presentation on encodings, SMILES and InChI.
Optimal descriptors calculated with International Chemical Identifier (InChI) have been used to construct one-variable model of the solubility of fullerene C60 in organic solvents . Attempts to calculate the model for three splits into training and test sets gave stable results.
This presentation is a part of Google Tech Talks which was added to the GoogleTalksArchive on August 22, 2006. The original presentation date took place on November 2, 2006.
ABSTRACT (Imported From YouTube Source)
The central token of information in Chemistry is a chemical substance, an entity that can often be represented as a well-defined chemical structure. With InChI we have a means of representing this entity as a unique string of characters, which is otherwise represented by various of 2-D and 3-D chemical drawings, ‘connection tables’ and synonyms. InChI therefore represents a discrete physical entity, to which is associated as array of chemical properties and data. NIST has long been involved in disseminating chemical reference data associated with such discrete substances. A InChI is therefore the key index to this data. Many other types of data and information are also naturally tied to it, including biological information, commercial availability, toxicity, drug effectiveness and so forth. Because of the diversity of properties and interactions of a chemical substance, effective location of chemical information generally requires further qualifiers, which may be represented coarsely as a key word, but more precisely using a controlled vocabulary. There are no simple separations between information sought by difference disciplines and for different objectives. However, reference data may be organized according the disciplines most directly involved in making the measurements: -isolated substance – mass, infrared, NMR, spectra; physical properties -substance in the context of others – solubility, affinity, .. -properties of a mixture containing the substance The desired data can be a number, vector or image, usually associated with dimensions and links to source information. In some cases, this information is typically converted to a curve or diagram for use by an expert and may be further processed by specialized software. In other cases, a single numerical values is the target. Also, some complexities of structure that must be dealt with in practical search is represented in InChI, but must be decoded for use in searching.
This document contains a brief intro to InChI suitable for undergraduate students and two exercises, with answer keys. The first assignment asks about the information encoded in a sample InChI. The last question in this assignment asks students to use the InChI Key as a search term – this will be a lot easier to do if this information is available digitally so that students can simply copy and paste the InChI key rather than typing it by hand into wikipedia.
The second exercise asks students to draw several simple organic compounds with an appropriate computer application and generate the InChI and InChI key. Most commonly used structure drawing programs will readily generate the InChI and InChI key for a structure. In addition there are a number of online services that have a structure drawing application that will generate an InChI or InChI key. Grading this exercise will be much easier if done digitally.
Both exercises were written with the intention that students would complete them on line using a Learning Management System (LMS) such as Blackboard, Canvas, Moodle, etc., where the students would copy and paste an appropriate InChI or InChI KEY into a text box, which the LMS would compare with the correct answer which was submitted by the instructor.