PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifcations applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifer (InChI) software, as manifested by conversion of the InChI back into a chemical structure.
Over the past several centuries, chemistry has permeated virtually every facet of human lifestyle, enriching fields as diverse as medicine, agriculture, manufacturing, warfare, and electronics, among numerous others. Unfortunately, application-specific, incompatible chemical information formats and representation strategies have emerged as a result of such diverse adoption of chemistry. Although a number of efforts have been dedicated to unifying the computational representation of chemical information, disparities between the various chemical databases still persist and stand in the way of cross-domain, interdisciplinary investigations. Through a common syntax and formal semantics, Semantic Web technology offers the ability to accurately represent, integrate, reason about and query across diverse chemical information.
Here we specify and implement the Chemical Entity Semantic Specification (CHESS) for the representation of polyatomic chemical entities, their substructures, bonds, atoms, and reactions using Semantic Web technologies. CHESS provides means to capture aspects of their corresponding chemical descriptors, connectivity, functional composition, and geometric structure while specifying mechanisms for data provenance. We demonstrate that using our readily extensible specification, it is possible to efficiently integrate multiple disparate chemical data sources, while retaining appropriate correspondence of chemical descriptors, with very little additional effort. We demonstrate the impact of some of our representational decisions on the performance of chemically-aware knowledgebase searching and rudimentary reaction candidate selection. Finally, we provide access to the tools necessary to carry out chemical entity encoding in CHESS, along with a sample knowledgebase.
By harnessing the power of Semantic Web technologies with CHESS, it is possible to provide a means of facile cross-domain chemical knowledge integration with full preservation of data correspondence and provenance. Our representation builds on existing cheminformatics technologies and, by the virtue of RDF specification, remains flexible and amenable to application- and domain-specific annotations without compromising chemical data integration. We conclude that the adoption of a consistent and semantically-enabled chemical specification is imperative for surviving the coming chemical data deluge and supporting systems science research.
The Blue Obelisk movement was established in 2005 as a response to the lack of Open Data, Open Standards and Open Source (ODOSOS) in chemistry. It aims to make it easier to carry out chemistry research by promoting interoperability between chemistry software, encouraging cooperation between Open Source developers, and developing community resources and Open Standards.
This contribution looks back on the work carried out by the Blue Obelisk in the past 5 years and surveys progress and remaining challenges in the areas of Open Data, Open Standards, and Open Source in chemistry.
We show that the Blue Obelisk has been very successful in bringing together researchers and developers with common interests in ODOSOS, leading to development of many useful resources freely available to the chemistry community.
There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string.
I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% of the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset.
The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain – such as the development of a standard aromatic model for SMILES – the ability to create the same SMILES using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.
An algorithm is introduced that enables a fast generation of all possible prototropic tautomers resulting from the mobile H atoms and associated heteroatoms as defined in the InChI code. The InChI-derived set of possible tautomers comprises (1,3)-shifts for open-chain molecules and (1,n)-shifts (with n being an odd number >3) for ring systems. In addition, our algorithm includes also, as extension to the InChI scope, those larger (1,n)-shifts that can be constructed from joining separate but conjugated InChI sequences of tautomer-active heteroatoms. The developed algorithm is described in detail, with all major steps illustrated through explicit examples. Application to ∼72 500 organic compounds taken from EINECS (European Inventory of Existing Commercial Chemical Substances) shows that around 11% of the substances occur in different heteroatom−prototropic tautomeric forms. Additional QSAR (quantitative structure−activity relationship) predictions of their soil sorption coefficient and water solubility reveal variations across tautomers up to more than two and 4 orders of magnitude, respectively. For a small subset of nine compounds, analysis of quantum chemically predicted tautomer energies supports the view that among all tautomers of a given compound, those restricted to H atom exchanges between heteroatoms usually include the thermodynamically most stable structures.
A modified InChI (International Chemical Identifier) string scheme, yaInChI (yet another InChI), is suggested as a method for including the structural information of a given molecule, making it straightforward and more easily readable. The yaInChI theme is applicable for checking the structural identity with higher sensitivity and generating three-dimensional (3-D) structures from the one-dimensional (1-D) string with less ambiguity than the general InChI method. The modifications to yaInChI provide non-rotatable single bonds, stereochemistry of organometallic compounds, allene and cumulene, and parity of atoms with a lone pair. Additionally, yaInChI better preserves the original information of the given input file (SDF) using the protonation information, hydrogen count +1, and original bond type, which are not considered or restrictively considered in InChI and SMILES. When yaInChI is used to perform a duplication check on a 3D chemical structure database, Ligand.Info, it shows more discriminating power than InChI. The structural information provided by yaInChI is in a compact format, making it a promising solution for handling large chemical structure databases.
UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related
molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.
This paper documents the design, layout and algorithms of the IUPAC International Chemical Identifier, InChI.
IUPAC Standards Online is a database built from IUPAC’s (The International Union of Pure and Applied Chemistry) standards and recommendations, which are extracted from the journal Pure and Applied Chemistry (PAC).
The International Union of Pure and Applied Chemistry (IUPAC) is the organization responsible for setting the standards in chemistry that are internationally binding for scientists in industry and academia, patent lawyers, toxicologists, environmental scientists, legislation, etc. “Standards” are definitions of terms, standard values, procedures, rules for naming compounds and materials, names and properties of elements in the periodic table, and many more.
The database will be the only product that provides for the quick and easy search and retrieval of IUPAC’s standards and recommendations which until now have remained unsorted within the huge Pure and Applied Chemistry archive.
Nomenclature and Terminology
Theoretical & Computational Chemistry
Learn about technology that solves the issue of interpreting 3D stereochemical information implied in 2D structure representations.
The IUPAC International Chemical Identifier (InChI) is a non-proprietary, machine-readable chemical structure representation format enabling electronic searching, and interlinking and combining, of chemical information from different sources. It was developed from 2001 onwards at the U.S. National Institute of Standards and Technology under the auspices of IUPAC’s Chemical Identifier project. Since 2009, the InChI Trust, a consortium of (mostly) publishers and software developers, has taken over responsibility for funding and oversight of InChI maintenance and development. Funding and responsibility for scientific aspects of InChI development remain with the IUPAC Division VIII (Chemical Nomenclature and Structure Representation) and InChI Subcommittee.
Project for Cheminformatics Fall 2012. Part 2/2. Presentation on encodings, SMILES and InChI.
Optimal descriptors calculated with International Chemical Identifier (InChI) have been used to construct one-variable model of the solubility of fullerene C60 in organic solvents . Attempts to calculate the model for three splits into training and test sets gave stable results.
This presentation is a part of Google Tech Talks which was added to the GoogleTalksArchive on August 22, 2006. The original presentation date took place on November 2, 2006.
ABSTRACT (Imported From YouTube Source)
The central token of information in Chemistry is a chemical substance, an entity that can often be represented as a well-defined chemical structure. With InChI we have a means of representing this entity as a unique string of characters, which is otherwise represented by various of 2-D and 3-D chemical drawings, ‘connection tables’ and synonyms. InChI therefore represents a discrete physical entity, to which is associated as array of chemical properties and data. NIST has long been involved in disseminating chemical reference data associated with such discrete substances. A InChI is therefore the key index to this data. Many other types of data and information are also naturally tied to it, including biological information, commercial availability, toxicity, drug effectiveness and so forth. Because of the diversity of properties and interactions of a chemical substance, effective location of chemical information generally requires further qualifiers, which may be represented coarsely as a key word, but more precisely using a controlled vocabulary. There are no simple separations between information sought by difference disciplines and for different objectives. However, reference data may be organized according the disciplines most directly involved in making the measurements: -isolated substance – mass, infrared, NMR, spectra; physical properties -substance in the context of others – solubility, affinity, .. -properties of a mixture containing the substance The desired data can be a number, vector or image, usually associated with dimensions and links to source information. In some cases, this information is typically converted to a curve or diagram for use by an expert and may be further processed by specialized software. In other cases, a single numerical values is the target. Also, some complexities of structure that must be dealt with in practical search is represented in InChI, but must be decoded for use in searching.