PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifcations applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifer (InChI) software, as manifested by conversion of the InChI back into a chemical structure.
Over the past several centuries, chemistry has permeated virtually every facet of human lifestyle, enriching fields as diverse as medicine, agriculture, manufacturing, warfare, and electronics, among numerous others. Unfortunately, application-specific, incompatible chemical information formats and representation strategies have emerged as a result of such diverse adoption of chemistry. Although a number of efforts have been dedicated to unifying the computational representation of chemical information, disparities between the various chemical databases still persist and stand in the way of cross-domain, interdisciplinary investigations. Through a common syntax and formal semantics, Semantic Web technology offers the ability to accurately represent, integrate, reason about and query across diverse chemical information.
Here we specify and implement the Chemical Entity Semantic Specification (CHESS) for the representation of polyatomic chemical entities, their substructures, bonds, atoms, and reactions using Semantic Web technologies. CHESS provides means to capture aspects of their corresponding chemical descriptors, connectivity, functional composition, and geometric structure while specifying mechanisms for data provenance. We demonstrate that using our readily extensible specification, it is possible to efficiently integrate multiple disparate chemical data sources, while retaining appropriate correspondence of chemical descriptors, with very little additional effort. We demonstrate the impact of some of our representational decisions on the performance of chemically-aware knowledgebase searching and rudimentary reaction candidate selection. Finally, we provide access to the tools necessary to carry out chemical entity encoding in CHESS, along with a sample knowledgebase.
By harnessing the power of Semantic Web technologies with CHESS, it is possible to provide a means of facile cross-domain chemical knowledge integration with full preservation of data correspondence and provenance. Our representation builds on existing cheminformatics technologies and, by the virtue of RDF specification, remains flexible and amenable to application- and domain-specific annotations without compromising chemical data integration. We conclude that the adoption of a consistent and semantically-enabled chemical specification is imperative for surviving the coming chemical data deluge and supporting systems science research.
The Reaction InChI (RInChI) extends the idea of the InChI, which provides a unique descriptor of molecular structures, towards reactions. Prototype versions of the RInChI have been available since 2011. The frst ofcial release (RInChIV1.00), funded by the InChI Trust, is now available for download (http://www.inchi-trust.org/downloads/). This release defnes the format and generates hashed representations (RInChIKeys) suitable for database and web operations. The RInChI provides a concise description of the key data in chemical processes, and facilitates the manipulation and analysis of reaction data.
An important step in the reconstruction of a metabolic network is annotation of metabolites. Metabolites are generally annotated with various database or structure based identifiers. Metabolite annotations in metabolic reconstructions may be incorrect or incomplete and thus need to be updated prior to their use.
Genome-scale metabolic reconstructions generally include hundreds of metabolites. Manually updating annotations is therefore highly laborious. This prompted us to look for open-source software applications that could facilitate automatic updating of annotations by mapping between available metabolite identifiers. We identified three applications developed for the metabolomics and chemical informatics communities as potential solutions. The applications were MetMask, the Chemical Translation System, and UniChem. The first implements a “metabolite masking” strategy for mapping between identifiers whereas the latter two implement different versions of an InChI based strategy. Here we evaluated the suitability of these applications for the task of mapping between metabolite identifiers in genome-scale metabolic reconstructions. We applied the best suited application to updating identifiers in Recon 2, the latest reconstruction of human metabolism.
All three applications enabled partially automatic updating of metabolite identifiers, but significant manual effort was still required to fully update identifiers. We were able to reduce this manual effort by searching for new identifiers using multiple types of information about metabolites. When multiple types of information were combined, the Chemical Translation System enabled us to update over 3,500 metabolite identifiers in Recon 2. All but approximately 200 identifiers were updated automatically.
We found that an InChI based application such as the Chemical Translation System was better suited to the task of mapping between metabolite identifiers in genome-scale metabolic reconstructions. We identified several features, however, that could be added to such an application in order to tailor it to this task.
Correctness of structures and associated metadata within public and commercial chemical databases
greatly impacts drug discovery research activities such as quantitative structure–property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation.
The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%).
We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.
There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string.
I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% of the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset.
The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain – such as the development of a standard aromatic model for SMILES – the ability to create the same SMILES using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.
Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools.
We present an IUPAC name recognizer with an F1 measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F1 measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run.
We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component.
An algorithm is introduced that enables a fast generation of all possible prototropic tautomers resulting from the mobile H atoms and associated heteroatoms as defined in the InChI code. The InChI-derived set of possible tautomers comprises (1,3)-shifts for open-chain molecules and (1,n)-shifts (with n being an odd number >3) for ring systems. In addition, our algorithm includes also, as extension to the InChI scope, those larger (1,n)-shifts that can be constructed from joining separate but conjugated InChI sequences of tautomer-active heteroatoms. The developed algorithm is described in detail, with all major steps illustrated through explicit examples. Application to ∼72 500 organic compounds taken from EINECS (European Inventory of Existing Commercial Chemical Substances) shows that around 11% of the substances occur in different heteroatom−prototropic tautomeric forms. Additional QSAR (quantitative structure−activity relationship) predictions of their soil sorption coefficient and water solubility reveal variations across tautomers up to more than two and 4 orders of magnitude, respectively. For a small subset of nine compounds, analysis of quantum chemically predicted tautomer energies supports the view that among all tautomers of a given compound, those restricted to H atom exchanges between heteroatoms usually include the thermodynamically most stable structures.
A modified InChI (International Chemical Identifier) string scheme, yaInChI (yet another InChI), is suggested as a method for including the structural information of a given molecule, making it straightforward and more easily readable. The yaInChI theme is applicable for checking the structural identity with higher sensitivity and generating three-dimensional (3-D) structures from the one-dimensional (1-D) string with less ambiguity than the general InChI method. The modifications to yaInChI provide non-rotatable single bonds, stereochemistry of organometallic compounds, allene and cumulene, and parity of atoms with a lone pair. Additionally, yaInChI better preserves the original information of the given input file (SDF) using the protonation information, hydrogen count +1, and original bond type, which are not considered or restrictively considered in InChI and SMILES. When yaInChI is used to perform a duplication check on a 3D chemical structure database, Ligand.Info, it shows more discriminating power than InChI. The structural information provided by yaInChI is in a compact format, making it a promising solution for handling large chemical structure databases.
InChIKey is a 27-character compacted (hashed) version of InChI which is intended for Internet and database searching/indexing and is based on an SHA-256 hash of the InChI character string. The first block of InChIKey encodes molecular skeleton while the second block represents various kinds of isomerism (stereo, tautomeric, etc.). InChIKey is designed to be a nearly unique substitute for the parent InChI. However, a single InChIKey may occasionally map to two or more InChI strings (collision). The appearance of collision itself does not compromise the signature as collision-free hashing is impossible; the only viable approach is to set and keep a reasonable level of collision resistance which is sufficient for typical applications. We tested, in computational experiments, how well the real-life InChIKey collision resistance corresponds to the theoretical estimates expected by design. For this purpose, we analyzed the statistical characteristics of InChIKey for datasets of variable size in comparison to the theoretical statistical frequencies. For the relatively short second block, an exhaustive direct testing was performed. We computed and compared to theory the numbers of collisions for the stereoisomers of Spongistatin I (using the whole set of 67,108,864 isomers and its subsets). For the longer first block, we generated, using custom-made software, InChIKeys for more than 3 × 1010 chemical structures. The statistical behavior of this block was tested by comparison of experimental and theoretical frequencies for the various four-letter sequences which may appear in the first block body. From the results of our computational experiments we conclude that the observed characteristics of InChIKey collision resistance are in good agreement with theoretical expectations.
While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open-web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin retrieves ~200 low-redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false-positive rate. Results encompass less familiar but potentially useful sources and can be extended to isomer capture by using just the skeleton layer of the IK. Google Advanced Search can be used to filter large result sets. Image searching with the IK is also effective and complementary to open-web queries. Results can be particularly useful for less-common structures as exemplified by a major metabolite of atorvastatin giving only three hits. Testing also demonstrated document-to-document and document-to-database joins via structure matching. The necessary generation of an IK from chemical names can be accomplished using open tools and resources for patents, papers, abstracts or other text sources. Active global sharing of local IK-linked information can be accomplished via surfacing in open laboratory notebooks, blogs, Twitter, figshare and other routes. While information-rich chemistry (e.g. approved drugs) can exhibit swamping and redundancy effects, the much smaller IK result sets for link-poor structures become a transformative first-pass option. The IK indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records. The simplicity, specificity and speed of matching make it a useful option for biologists or others less familiar with chemical searching. However, compared to rigorously maintained major databases, users need to be circumspect about the consistency of Google results and provenance of retrieved links. In addition, community engagement may be necessary to ameliorate possible future degradation of utility.
UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related
molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.
Background: There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones.
Results: We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (28.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links.
Conclusions: We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method.
This paper documents the design, layout and algorithms of the IUPAC International Chemical Identifier, InChI.
Learn about technology that solves the issue of interpreting 3D stereochemical information implied in 2D structure representations.
The IUPAC International Chemical Identifier (InChI) is a non-proprietary, machine-readable chemical structure representation format enabling electronic searching, and interlinking and combining, of chemical information from different sources. It was developed from 2001 onwards at the U.S. National Institute of Standards and Technology under the auspices of IUPAC’s Chemical Identifier project. Since 2009, the InChI Trust, a consortium of (mostly) publishers and software developers, has taken over responsibility for funding and oversight of InChI maintenance and development. Funding and responsibility for scientific aspects of InChI development remain with the IUPAC Division VIII (Chemical Nomenclature and Structure Representation) and InChI Subcommittee.
The InChI algorithms are written in C++ and not available as Java library. Integration into software written in Java therefore requires a bridge between C and Java libraries, provided by the Java Native Interface (JNI) technology.
We here describe how the InChI library is used in the Bioclipse workbench and the Chemistry Development Kit (CDK) cheminformatics library. To make this possible, a JNI bridge to the InChI library was developed, JNI-InChI, allowing Java software to access the InChI algorithms. By using this bridge, the CDK project packages the InChI binaries in a module and offers easy access from Java using the CDK API. The Bioclipse project packages and offers InChI as a dynamic OSGi bundle that can easily be used by any OSGi-compliant software, in addition to the regular Java Archive and Maven bundles. Bioclipse itself uses the InChI as a key component and calculates it on the fly when visualizing and editing chemical structures. We demonstrate the utility of InChI with various applications in CDK and Bioclipse, such as decision support for chemical liability assessment, tautomer generation, and for knowledge aggregation using a linked data approach.
These results show that the InChI library can be used in a variety of Java library dependency solutions, making the functionality easily accessible by Java software, such as in the CDK. The applications show various ways the InChI has been used in Bioclipse, to enrich its functionality.
InChI, InChIKey, Chemical structures, JNI-InChI, The Chemistry Development Kit, OSGi, Bioclipse, Decision
support, Linked data, Tautomers, Databases, Semantic web
The HIV structural database (HIVSDB) is a comprehensive collection of the structures of HIV protease, both of unliganded enzyme and of its inhibitor complexes. It contains abstracts and crystallographic data such as inhibitor and protein coordinates for 248 data sets, of which only 141 are from the Protein Data Bank (PDB). Efficient annotation, indexing, and querying of the inhibitor data is crucial for their effective use for technological and industrial applications. The application of IUPAC International Chemical Identifier (InChI) to index, curate, and query inhibitor structures HIVSDB is described. Proteins 2005. Published 2005 Wiley‐Liss, Inc.
Optimal descriptors calculated with International Chemical Identifier (InChI) have been used to construct one-variable model of the solubility of fullerene C60 in organic solvents . Attempts to calculate the model for three splits into training and test sets gave stable results.
This presentation is a part of Google Tech Talks which was added to the GoogleTalksArchive on August 22, 2006. The original presentation date took place on November 2, 2006.
ABSTRACT (Imported From YouTube Source)
The central token of information in Chemistry is a chemical substance, an entity that can often be represented as a well-defined chemical structure. With InChI we have a means of representing this entity as a unique string of characters, which is otherwise represented by various of 2-D and 3-D chemical drawings, ‘connection tables’ and synonyms. InChI therefore represents a discrete physical entity, to which is associated as array of chemical properties and data. NIST has long been involved in disseminating chemical reference data associated with such discrete substances. A InChI is therefore the key index to this data. Many other types of data and information are also naturally tied to it, including biological information, commercial availability, toxicity, drug effectiveness and so forth. Because of the diversity of properties and interactions of a chemical substance, effective location of chemical information generally requires further qualifiers, which may be represented coarsely as a key word, but more precisely using a controlled vocabulary. There are no simple separations between information sought by difference disciplines and for different objectives. However, reference data may be organized according the disciplines most directly involved in making the measurements: -isolated substance – mass, infrared, NMR, spectra; physical properties -substance in the context of others – solubility, affinity, .. -properties of a mixture containing the substance The desired data can be a number, vector or image, usually associated with dimensions and links to source information. In some cases, this information is typically converted to a curve or diagram for use by an expert and may be further processed by specialized software. In other cases, a single numerical values is the target. Also, some complexities of structure that must be dealt with in practical search is represented in InChI, but must be decoded for use in searching.
This post consist of a simple spreadsheet that takes that splits an InChI in its layers to facilitate its conceptualisation and its teaching. It considers the six layers currently detailed in the InChI TechnicalFAQ, https://www.inchi-trust.org/technical-faq-2/#4.3.
The spreadsheet also facilitates looking up an InChI by entering the molecule name or its SMILES representation.
Common tools for conversions, including some spreadsheet-based options included in this site, are hard to use for hundred or thousands of compounds we may want to use in cheminformatics projects. This resource includes a diferent approach to the conversion. By using the PubChem Power User Gateway it allows converting hundreds of chemical identifiers on a single call the a webservice.
Two files are included in this OER: an Excel file, that includes two UDF functions for doing the conversions, documentation and examples; and a VBA module that can be imported to any Excel file to include this functions to any existing spreadsheet.
Summary: Metabolomic publications and databases use different database identifiers or even trivial names which disable queries across databases or between studies. The best way to annotate metabolites is by chemical structures, encoded by the International Chemical Identifier code (InChI) or InChIKey. We have implemented a web-based Chemical Translation Service that performs batch conversions of the most common compound identifiers, including CAS, CHEBI, compound formulas, Human Metabolome Database HMDB, InChI, InChIKey, IUPAC name, KEGG, LipidMaps, PubChem CID+SID, SMILES and chemical synonym names. Batch conversion downloads of 1410 CIDs are performed in 2.5 min. Structures are automatically displayed.
Implementation: The software was implemented in Groovy and JAVA, the web frontend was implemented in GRAILS and the database used was PostgreSQL.
Availability: The source code and an online web interface are freely available. Chemical Translation Service (CTS): http://cts.fiehnlab.ucdavis.edu
Video introduction to InChI.
Videos produced by the InChI-Trust.
This is a 2013 Journal of Cheminformatics article
Since its public introduction in 2005 the IUPAC InChI chemical structure identifier standard has become the
international, worldwide standard for defined chemical structures. This article will describe the extensive use and
dissemination of the InChI and InChIKey structure representations by and for the world-wide chemistry community,
the chemical information community, and major publishers and disseminators of chemical and related scientific
offerings in manuscripts and databases.
This presentation was given during the 2017 IUPAC General Assembly on July 13, 2017 in Sao Paulo Brazil.
This presentation was given on august 25, 2011 by Steve Heller during the 5th meeting on U.S. Government Chemical Databases and Open Chemistry, which was held in Frederick MD.
This presentation was given June of 2014 at the 10th International Conference on Chemical Structures, which was held in Noordwijkerhout, the Netherlands. https://www.discngine.com/events/2014/5/14/10th-international-conference-on-chemical-structures-iccs
Comprehensive 2015 article published in Springer’s Journal of Computer-Aided Molecular Design. Here is the abstract,
The IUPAC International Chemical Identifier (InChI) is a non-proprietary, international standard to represent chemical structures. It was conceived 15 years ago, and has been is use for 10 years. The InChI Trust is developing and improving on the current standard, further enabling the interlinking of chemical structures on the web. This mini-review looks at the widespread adoption of InChI in software and databases.
This is an article in Catalan that provides an introduction to chemical information and describes InChI along with other chemical identifiers. Its abstract reads:
“Chemical information, once managed in books paradigmatically in Chemical Abstracts and several handbooks, has now migrated to Internet. Nowadays many large databases, both commercial and freely available, have much more information than we have ever had. But accessing them requires some skills that are not yet taught in the official chemistry degrees. This paper presents a brief introduction to the notations and codes that are currently used to identify the chemical species in computer environments. At the same time, some freely available chemistry databases are presented.”