InChI Tag: Publication

79 posts

Detection of IUPAC and IUPAC-like chemical names

Abstract

Motivation:

Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools.

Results:

We present an IUPAC name recognizer with an F1 measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F1 measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run.

Availability:

We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component.

Contact:[email protected]

QSPR modeling of octanol water partition coefficient of platinum complexes by InChI-based optimal descriptors

Abstract

Comparison of the quantitative structure—property relationships (QSPR) based on optimal descriptors calculated with the International Chemical Identifier (InChI) and QSPR based on optimal descriptors calculated with simplified molecular input line entry system has shown that the InChI-based optimal descriptors give more accurate prediction for the logarithm of octanol/water partition coefficient of platinum complexes.

Tautomer Identification and Tautomer Structure Generation Based on the InChI Code

Abstract

An algorithm is introduced that enables a fast generation of all possible prototropic tautomers resulting from the mobile H atoms and associated heteroatoms as defined in the InChI code. The InChI-derived set of possible tautomers comprises (1,3)-shifts for open-chain molecules and (1,n)-shifts (with n being an odd number >3) for ring systems. In addition, our algorithm includes also, as extension to the InChI scope, those larger (1,n)-shifts that can be constructed from joining separate but conjugated InChI sequences of tautomer-active heteroatoms. The developed algorithm is described in detail, with all major steps illustrated through explicit examples. Application to ∼72 500 organic compounds taken from EINECS (European Inventory of Existing Commercial Chemical Substances) shows that around 11% of the substances occur in different heteroatom−prototropic tautomeric forms. Additional QSAR (quantitative structure−activity relationship) predictions of their soil sorption coefficient and water solubility reveal variations across tautomers up to more than two and 4 orders of magnitude, respectively. For a small subset of nine compounds, analysis of quantum chemically predicted tautomer energies supports the view that among all tautomers of a given compound, those restricted to H atom exchanges between heteroatoms usually include the thermodynamically most stable structures.

yaInChI: Modified InChI string scheme for line notation of chemical structures

A modified InChI (International Chemical Identifier) string scheme, yaInChI (yet another InChI), is suggested as a method for including the structural information of a given molecule, making it straightforward and more easily readable. The yaInChI theme is applicable for checking the structural identity with higher sensitivity and generating three-dimensional (3-D) structures from the one-dimensional (1-D) string with less ambiguity than the general InChI method. The modifications to yaInChI provide non-rotatable single bonds, stereochemistry of organometallic compounds, allene and cumulene, and parity of atoms with a lone pair. Additionally, yaInChI better preserves the original information of the given input file (SDF) using the protonation information, hydrogen count +1, and original bond type, which are not considered or restrictively considered in InChI and SMILES. When yaInChI is used to perform a duplication check on a 3D chemical structure database, Ligand.Info, it shows more discriminating power than InChI. The structural information provided by yaInChI is in a compact format, making it a promising solution for handling large chemical structure databases.

Keywords: line notationduplication checkInChIchemical databaseSMILES

InChIKey collision resistance: an experimental testing

Abstract

InChIKey is a 27-character compacted (hashed) version of InChI which is intended for Internet and database searching/indexing and is based on an SHA-256 hash of the InChI character string. The first block of InChIKey encodes molecular skeleton while the second block represents various kinds of isomerism (stereo, tautomeric, etc.). InChIKey is designed to be a nearly unique substitute for the parent InChI. However, a single InChIKey may occasionally map to two or more InChI strings (collision). The appearance of collision itself does not compromise the signature as collision-free hashing is impossible; the only viable approach is to set and keep a reasonable level of collision resistance which is sufficient for typical applications. We tested, in computational experiments, how well the real-life InChIKey collision resistance corresponds to the theoretical estimates expected by design. For this purpose, we analyzed the statistical characteristics of InChIKey for datasets of variable size in comparison to the theoretical statistical frequencies. For the relatively short second block, an exhaustive direct testing was performed. We computed and compared to theory the numbers of collisions for the stereoisomers of Spongistatin I (using the whole set of 67,108,864 isomers and its subsets). For the longer first block, we generated, using custom-made software, InChIKeys for more than 3 × 1010 chemical structures. The statistical behavior of this block was tested by comparison of experimental and theoretical frequencies for the various four-letter sequences which may appear in the first block body. From the results of our computational experiments we conclude that the observed characteristics of InChIKey collision resistance are in good agreement with theoretical expectations.

InChI: connecting and navigating chemistry

Abstract

The International Chemical Identifier (InChI) has had a dramatic impact on providing a means by which to
deduplicate, validate and link together chemical compounds and related information across databases. Its influence
has been especially valuable as the internet has exploded in terms of the amount of chemistry related information
available online. This thematic issue aggregates a number of contributions demonstrating the value of InChI as an
enabling technology in the world of cheminformatics and its continuing value for linking chemistry data.

InChI in the wild: an assessment of InChIKey searching in Google

Abstract

While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open-web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin retrieves ~200 low-redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false-positive rate. Results encompass less familiar but potentially useful sources and can be extended to isomer capture by using just the skeleton layer of the IK. Google Advanced Search can be used to filter large result sets. Image searching with the IK is also effective and complementary to open-web queries. Results can be particularly useful for less-common structures as exemplified by a major metabolite of atorvastatin giving only three hits. Testing also demonstrated document-to-document and document-to-database joins via structure matching. The necessary generation of an IK from chemical names can be accomplished using open tools and resources for patents, papers, abstracts or other text sources. Active global sharing of local IK-linked information can be accomplished via surfacing in open laboratory notebooks, blogs, Twitter, figshare and other routes. While information-rich chemistry (e.g. approved drugs) can exhibit swamping and redundancy effects, the much smaller IK result sets for link-poor structures become a transformative first-pass option. The IK indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records. The simplicity, specificity and speed of matching make it a useful option for biologists or others less familiar with chemical searching. However, compared to rigorously maintained major databases, users need to be circumspect about the consistency of Google results and provenance of retrieved links. In addition, community engagement may be necessary to ameliorate possible future degradation of utility.

UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers

Abstract

UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related
molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.

On InChI and Evaluating the Quality of Cross-reference Links

Abstract

Background: There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones.

Results: We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (28.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links.

Conclusions: We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method.

Current Status and Future Development in Relation to IUPAC Activities

Abstract

The IUPAC International Chemical Identifier (InChI) is a non-proprietary, machine-readable chemical structure representation format enabling electronic searching, and interlinking and combining, of chemical information from different sources. It was developed from 2001 onwards at the U.S. National Institute of Standards and Technology under the auspices of IUPAC’s Chemical Identifier project. Since 2009, the InChI Trust, a consortium of (mostly) publishers and software developers, has taken over responsibility for funding and oversight of InChI maintenance and development. Funding and responsibility for scientific aspects of InChI development remain with the IUPAC Division VIII (Chemical Nomenclature and Structure Representation) and InChI Subcommittee.

Applications of the InChI in cheminformatics with the CDK and Bioclipse

Abstract

Background

The InChI algorithms are written in C++ and not available as Java library. Integration into software written in Java therefore requires a bridge between C and Java libraries, provided by the Java Native Interface (JNI) technology.

Results

We here describe how the InChI library is used in the Bioclipse workbench and the Chemistry Development Kit (CDK) cheminformatics library. To make this possible, a JNI bridge to the InChI library was developed, JNI-InChI, allowing Java software to access the InChI algorithms. By using this bridge, the CDK project packages the InChI binaries in a module and offers easy access from Java using the CDK API. The Bioclipse project packages and offers InChI as a dynamic OSGi bundle that can easily be used by any OSGi-compliant software, in addition to the regular Java Archive and Maven bundles. Bioclipse itself uses the InChI as a key component and calculates it on the fly when visualizing and editing chemical structures. We demonstrate the utility of InChI with various applications in CDK and Bioclipse, such as decision support for chemical liability assessment, tautomer generation, and for knowledge aggregation using a linked data approach.

Conclusions

These results show that the InChI library can be used in a variety of Java library dependency solutions, making the functionality easily accessible by Java software, such as in the CDK. The applications show various ways the InChI has been used in Bioclipse, to enrich its functionality.

Keywords:

InChI, InChIKey, Chemical structures, JNI-InChI, The Chemistry Development Kit, OSGi, Bioclipse, Decision
support, Linked data, Tautomers, Databases, Semantic web

Application of InChI to curate, index, and query 3-D structures

Abstract

The HIV structural database (HIVSDB) is a comprehensive collection of the structures of HIV protease, both of unliganded enzyme and of its inhibitor complexes. It contains abstracts and crystallographic data such as inhibitor and protein coordinates for 248 data sets, of which only 141 are from the Protein Data Bank (PDB). Efficient annotation, indexing, and querying of the inhibitor data is crucial for their effective use for technological and industrial applications. The application of IUPAC International Chemical Identifier (InChI) to index, curate, and query inhibitor structures HIVSDB is described. Proteins 2005. Published 2005 Wiley‐Liss, Inc.

Additive InChI-based optimal descriptors: QSPR modeling of fullerene C60 solubility in organic solvents

Abstract

Optimal descriptors calculated with International Chemical Identifier (InChI) have been used to construct one-variable model of the solubility of fullerene C60 in organic solvents . Attempts to calculate the model for three splits into training and test sets gave stable results.

Capturing mixture composition: an open machine-readable format for representing mixed substances

Capturing mixture composition: an open machine-readable format for representing mixed substances

Alex M. Clark, Leah R. McEwen, Peter Gedeck & Barry A. Bunin
Journal of Cheminformatics volume 11, Article number: 33 (2019)

Abstract: We describe a file format that is designed to represent mixtures of compounds in a way that is fully machine readable. This Mixfile format is intended to fill the same role for substances that are composed of multiple components as the venerable Molfile does for specifying individual structures. This much needed datastructure is intended to replace current practices for communicating information about mixtures, which usually relies on human-readable text descriptions, drawing several species within a single molecular diagram, or mutually incompatible ad hoc solutions. We describe an open source software application for editing mixture files, which can also be used as web-ready tools for manipulating the file format. We also present a corpus of mixture examples, which we have extracted from collections of text-based descriptions. Furthermore, we present an early look at the proposed IUPAC Mixtures InChI specification, instances of which can be automatically generated using the Mixfile format as a precursor.

QSAR-modeling of toxicity of organometallic compounds by means of the balance of correlations for InChI-based optimal descriptors

Toropov, A. A., Toropova, A. P., & Benfenati, E. (2010). QSAR-modeling of toxicity of organometallic compounds by means of the balance of correlations for InChI-based optimal descriptors. Molecular diversity14(1), 183-192.

This paper present a use of InChI-based molecular descriptors to predict toxicity. Its abstract follows.

“Quantitative structure–activity relationships (QSAR) for toxicity toward rats (pLD50) have been built by means of optimal descriptors. Comparison of the optimal descriptors calculated using the International Chemical Identifier (InChI) with the optimal descriptors calculated using the simplified molecular input line entry system (SMILES) has shown that the InChI-based models give more accurate prediction for the abovementioned toxicity of organometallic compounds. These models were obtained by means of the balance of correlation: one subset of the training set (subtraining set) plays role of the training; the second subset (calibration set) plays role of the preliminary check of the models. It has been shown that the balance of correlations is a more robust predictor for the toxicity than the classic scheme (training set—test set: without the calibration set). Three splits into the subtraining set, calibration set, and test set were examined.”

 

InChI-based optimal descriptors: QSAR analysis of fullerene[C60]-based HIV-1 PR inhibitors by correlation balance

The International Chemical Identifier (InChI) has been used to construct InChI-based optimal descriptors to model the  binding affinity for fullerene[C60]-based inhibitors of human immunodeficiency virus type 1 aspartic protease (HIV-1 PR). Statistical characteristics of the one-variable model obtained by the balance of correlations are as follows: n = 8, r2 = 0.9769, q2LOO = 0.9646, s = 0.099, F = 254 (subtraining set); n = 7, r2 = 0.7616, s = 0.681, F = 16 (calibration set); n = 5, r2 = 0.9724, s = 0.271, F = 106, Rm2 = 0.9495 (test set). Predictability of this approach has been checked with three random splits of the data: into the subtraining set, calibration set, and test set.

Use of the international chemical identifier for constructing QSPR-model of normal boiling points of acyclic carbonyl substances

Optimal descriptors calculated with international chemical identifier have been used to construct one-variable model of the normal boiling points of acyclic carbonyl substances. Attempts to calculate the model for three splits into training and test sets gave stable results. Statistical quality of the model is n = 150, r 2 = 0.9825, s = 4.96 °C, F = 8,312 (training set) and n = 50, r 2 = 0.9791, s = 4.68 °C, F = 2,249 (test set).

The Chemical Translation Service—a web-based tool to improve standardization of metabolomic report

Summary: Metabolomic publications and databases use different database identifiers or even trivial names which disable queries across databases or between studies. The best way to annotate metabolites is by chemical structures, encoded by the International Chemical Identifier code (InChI) or InChIKey. We have implemented a web-based Chemical Translation Service that performs batch conversions of the most common compound identifiers, including CAS, CHEBI, compound formulas, Human Metabolome Database HMDB, InChI, InChIKey, IUPAC name, KEGG, LipidMaps, PubChem CID+SID, SMILES and chemical synonym names. Batch conversion downloads of 1410 CIDs are performed in 2.5 min. Structures are automatically displayed.

Implementation: The software was implemented in Groovy and JAVA, the web frontend was implemented in GRAILS and the database used was PostgreSQL.

Availability: The source code and an online web interface are freely available. Chemical Translation Service (CTS): http://cts.fiehnlab.ucdavis.edu

Failures of fractional crystallization: ordered co‐crystals of isomers and near isomers

A list of 270 structures of ordered co‐crystals of isomers, near isomers and molecules that are almost the same has been compiled. Searches for structures containing isomers could be automated by the use of IUPAC International Chemical Identifier (InChI™) strings but searches for co‐crystals of very similar molecules were more labor intensive. Compounds in which the heteromolecular AB interactions are clearly better than the average of the homomolecular AA and BB interactions were excluded. The two largest structural classes found include co‐crystals of configurational diastereomers and of quasienantiomers (or quasiracemates). These two groups overlap. There are 114 co‐crystals of diastereomers and the same number of quasiracemates, with 71 structures being counted in both groups; together the groups account for 157 structures or 58% of the total. The large number of quasiracemates is strong evidence for inversion symmetry being very favorable for crystal packing. Co‐crystallization of two diastereomers is especially likely if a 1,1 switch of a methyl group and an H atom, or of an inversion of a [2.2.1] or [2.2.2] cage, in one of the diastereomers would make the two molecules enantiomers.

Simplified molecular input-line entry system and International Chemical Identifier in the QSAR analysis of styrylquinoline derivatives as HIV-1 integrase inhibitors

The simplified molecular input-line entry system (SMILES) and IUPAC International Chemical Identifier (InChI) were examined as representations of the molecular structure for quantitative structure-activity relationships (QSAR), which can be used to predict the inhibitory activity of styrylquinoline derivatives against the human immunodeficiency virus type 1 (HIV-1). Optimal SMILES-based descriptors give a best model with n = 26, r(2) = 0.6330, q(2) = 0.5812, s = 0.502, F = 41 for the training set and n = 10, r(2) = 0.7493, r(pred)(2) = 0.6235, R(m)(2) = 0.537, s = 0.541, F = 24 for the validation set. Optimal InChI-based descriptors give a best model with n = 26, r(2) = 0.8673, q(2) = 0.8456, s = 0.302, F = 157 for the training set and n = 10, r(2) = 0.8562, r(pred)(2) = 0.7715, R(m)(2) = 0.819, s = 0.329, F = 48 for the validation set. Thus, the InChI-based model is preferable. The described SMILES-based and InChI-based approaches have been checked with five random splits into the training and test sets.

Representation of chemical structures

Abstract:
At the root of applications for substructure and similarity searching, reaction retrieval, synthesis planning, drug discovery, and physicochemical property prediction is the need for a machine‐readable representation of a structure. Systematic nomenclature is unsuitable, and notations and fragment codes have been superseded, except in certain specific applications. Connection tables are widely used, but there is no formal standard. Recently the International Union of Pure and Applied Chemistry (IUPAC) International Chemical Identifier (InChI) has started to attract interest. This review also summarizes the representation of chemical reactions and three‐dimensional structures.

InChI: a user’s perspective

Exchange of chemical structures between practicing chemists is essential to chemical communication. The International Chemical Identifier (InChI) provides a means for lossless communication of structures without resort to any proprietary software or databases nor does it require any payment or royalty fees. This perspective describes why the InChI is valuable to all chemists and how it will be an essential component of creating the chemical web.

InChI As a Research Data Management Tool

Chemistry International, Volume 38, Issue 3-4, Pages 24–26

Abstract

Progress in science has always been driven by data as a primary research output. This is especially true of the data-centric fields of molecular sciences. Scholarly journals in chemistry in the 19th century captured a (probably small) proportion of research data in printed journals, books, and compendia. The curation of this data from its origins in the 1880s and for most of the 20th century was largely driven by a few organisations as a commercial and proprietary activity. The online era, dating from around 1995, saw much experimentation centred around the presentation and delivery of journals, but less so of the data. The latter evolved, almost by accident, into what is now known as electronic supporting or supplemental information (SI), associated with journal articles. [1] That there was still a general problem in science was revealed by the “Climategate” events in 2009, where a lack of access to the data on which climate models are based induced all manner of unfortunate conspiracy theories. [2] These events catalysed a change in policy at, amongst others, UK research funders. One outcome of this change was seen in May 2015 with the introduction of new research data management (RDM) requirements for funded researchers. This centred around the precept that primary research data should be made openly available [3] and coincided with the evolution of the open science tripod of open data, open access articles, and open science notebooks. [4]

 

On InChI and evaluating the quality of cross-reference links

Galgonek and Vondrášek Journal of Cheminformatics 2014, 6:15

Abstract

Background: There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones.

Results: We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (28.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links.

Conclusions: We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method.

UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers

Chambers et al. Journal of Cheminformatics 2014, 6:43

Abstract

UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.

Keywords: UniChem, Standard InChI, InChIKey, Chemical databases, Data integration, Connectivity search

Data Formats for Elementary Gas Phase Kinetics, Part 1: Unique Representations of Species at the Molecular Level

BURGESS, D. R., MANION, J. A. and HAYES, C. J. (2014), Data Formats for Elementary Gas Phase Kinetics, Part 1: Unique Representations of Species at the Molecular Level. Int. J. Chem. Kinet., 46: 640-650. doi:10.1002/kin.20875

Abstract

Standardized electronic formats for data are needed to efficiently and transparently communicate the results of scientific studies. A format for the unique identification of chemical species is a requirement in the field of chemistry, and the IUPAC International Chemical Identifier (InChI) has been widely adopted for this purpose. The InChI identifier has proved to be very useful. The InChI identifier, however, is currently insufficient to uniquely specify some types of molecular entities at a detailed molecular level needed to fully characterize their chemical nature, to differentiate between chemically distinct conformers, to uniquely identify structures used in quantum chemical calculations, and to completely describe elementary chemical reactions. To address this limitation, we propose an augmented form of InChI, denoted as InChI–ER, which contains additional optional layers that allow the unique and unambiguous identification of molecules at a detailed molecular level. The new layers proposed herein are optional extensions of the existing InChI formalism and, like all other InChI layers, would not interfere with InChI identifiers currently in use. The focus of the present work is the better specification of required molecular entities such as rotational conformations, ring conformations, and electronic states. In companion articles, we propose additional reaction layers using an extended InChI format that will enable the unique identification of elementary chemical reactions, including specification of associated transition states, specification of the changes in bonds that occur during reaction, and classification of reaction types.

Applications of the InChI in cheminformatics with the CDK and Bioclipse

Spjuth et al. Journal of Cheminformatics 2013, 5:14

Abstract

Background: The InChI algorithms are written in C++ and not available as Java library. Integration into software written in Java therefore requires a bridge between C and Java libraries, provided by the Java Native Interface (JNI) technology. Results: We here describe how the InChI library is used in the Bioclipse workbench and the Chemistry Development Kit (CDK) cheminformatics library. To make this possible, a JNI bridge to the InChI library was developed, JNI-InChI, allowing Java software to access the InChI algorithms. By using this bridge, the CDK project packages the InChI binaries in a module and offers easy access from Java using the CDK API. The Bioclipse project packages and offers InChI as a dynamic OSGi bundle that can easily be used by any OSGi-compliant software, in addition to the regular Java Archive and Maven bundles. Bioclipse itself uses the InChI as a key component and calculates it on the fly when visualizing and editing chemical structures. We demonstrate the utility of InChI with various applications in CDK and Bioclipse, such as decision support for chemical liability assessment, tautomer generation, and for knowledge aggregation using a linked data approach. Conclusions: These results show that the InChI library can be used in a variety of Java library dependency solutions, making the functionality easily accessible by Java software, such as in the CDK. The applications show various ways the InChI has been used in Bioclipse, to enrich its functionality.

Keywords: InChI, InChIKey, Chemical structures, JNI-InChI, The Chemistry Development Kit, OSGi, Bioclipse, Decision support, Linked data, Tautomers, Databases, Semantic web

CVDHD: a cardiovascular disease herbal database for drug discovery and network pharmacology

Gu et al. Journal of Cheminformatics 2013, 5:51

Abstract
Background: Cardiovascular disease (CVD) is the leading cause of death and associates with multiple risk factors.
Herb medicines have been used to treat CVD long ago in china and several natural products or derivatives (e.g.,
aspirin and reserpine) are most common drugs all over the world. The objective of this work was to construct a
systematic database for drug discovery based on natural products separated from CVD-related medicinal herbs and
to research on action mechanism of herb medicines.

Description: The cardiovascular disease herbal database (CVDHD) was designed to be a comprehensive resource for
virtual screening and drug discovery from natural products isolated from medicinal herbs for cardiovascular-related
diseases. CVDHD comprises 35230 distinct molecules and their identification information (chemical name, CAS registry
number, molecular formula, molecular weight, international chemical identifier (InChI) and SMILES), calculated molecular
properties (AlogP, number of hydrogen bond acceptor and donors, etc.), docking results between all molecules and
2395 target proteins, cardiovascular-related diseases, pathways and clinical biomarkers. All 3D structures were optimized
in the MMFF94 force field and can be freely accessed.

Conclusions: CVDHD integrated medicinal herbs, natural products, CVD-related target proteins, docking results, diseases
and clinical biomarkers. By using the methods of virtual screening and network pharmacology, CVDHD will provide a
platform to streamline drug/lead discovery from natural products and explore the action mechanism of medicinal herbs.
CVDHD is freely available at http://pkuxxj.pku.edu.cn/CVDHD.

Keywords: Cardiovascular disease, Drug discovery, Network pharmacology, Molecular docking, Virtual screening, Herbal
formula, Natural products, Medicinal herbs, Traditional Chinese medicine

InChI – the worldwide chemical structure identifier standard

This is a 2013 Journal of Cheminformatics article

Abstract
Since its public introduction in 2005 the IUPAC InChI chemical structure identifier standard has become the
international, worldwide standard for defined chemical structures. This article will describe the extensive use and
dissemination of the InChI and InChIKey structure representations by and for the world-wide chemistry community,
the chemical information community, and major publishers and disseminators of chemical and related scientific
offerings in manuscripts and databases.

Breu introducció a la digitalització de la informació química

This is an article in Catalan that provides an introduction to chemical information and describes InChI along with other chemical identifiers. Its abstract reads:

“Chemical information, once managed in books paradigmatically in Chemical Abstracts and several handbooks, has now migrated to Internet. Nowadays many large databases, both commercial and freely available, have much more information than we have ever had. But accessing them requires some skills that are not yet taught in the official chemistry degrees. This paper presents a brief introduction to the notations and codes that are currently used to identify the chemical species in computer environments. At the same time, some freely available chemistry databases are presented.”

 

International Chemical Identifier for Reactions (RInChI)

This May 2018 open access Journal of Cheminformatics article by Guenter Grethe et al., describes the first official version (RInChI-V1.00) that was released in March of 2017 and is available for download at the InChI Trust (https://www.inchi-trust.org/wp/downloads/).

RInChI provides a standard for the representation of chemical reactions . As different databases use different methods for representing chemical reactions the adoption of RInChI and the ability to transform other representations into RInChI should allow for more thorough discovery across different databases of information related to chemical reactions. This article discusses the layers of the RInChI, the InChIKey and the Web-InChIKey. It also describes generation of RInChI from RD and RXN files, and the generation of RXN from RInChI. A database of over a million RInChI at the University of Cambridge is also described,  www-rinchi.ch.cam.ac.uk .