InChI Tag: Undergraduate

45 posts

Chemistry Programming with Python – Web Scraping Wikipedia For Chemical Identifiers (Tutorial)

Andrew P. Cornell, Robert E. Belford

Chemistry Department, University of Arkansas at Little Rock, Little Rock, Arkansas 72204

 

Abstract

Many individual chemicals have a specific page on Wikipedia that will give information about the use, manufacture and properties of that chemical. The properties that are displayed off to the side include the relevant chemical identifiers along with alternate names and reaction information. There are several different identifier formats displayed within the properties box that include InChI (International Chemical Identifier), SMILES (Simplified Molecular-Input Line-Entry System) and various registration numbers. This lesson will explain how Python can be used to web scrape Wikipedia and retrieve the InChI after the user inputs a chemical name. Web scraping is a process for extracting the contents of a web page. This is often useful for working with online sources that do not offer an API (Application Programming Interface) for certain types of data. Wikipedia does have API’s for a lot of the information published, however this tutorial would like to look at the technique of web scraping with Python as an alternate method.

This program will work by importing a few helper modules that will allow the Python program to go onto the web, grab an HTML file and then parse the file specifically for the InChI string. Retrieving a valid result means that the user must input a chemical name that has a page designated on Wikipedia. Many chemicals have multiple names, so Wikipedia handles this through making the most commonly used name to be expressed in the URL (Uniform Resource Locator). All other naming formats will redirect to the URL that uses the chemicals common name.

Learning Objectives

  • Import Python Library
  • Create and Define Functions
  • Parse HTML Text
  • Display Results

Recommended Reading

  • Internet of Chemistry Things Activity 1 (https://ioct.tech/edu/ioct1) Page that explains basic Instructions for setting up Python on a computer. The Python Activities listed in the sidebar may also help to explain some of the background information.
  • Spring ChemInformatics OLCC Course (http://olcc.ccce.divched.org/Spring2017OLCC) This site provides lots of information on working with chemical data.
  • Python Documentation (https://docs.python.org/3/) Python 3 documentation that correlates to the version used within this tutorial.
  • Beautiful Soup Module  (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Documentation on the installation and use of this module with Python.

Methods

The Python File used in this tutorial can be located within the following GitHub Page along with a DOI (Digital Object Identifier) on FigShare.1 Python will run on many different operating systems, however this tutorial uses the Thonny IDE (Integrated Development Environment) to design, run and test the code.2 The following code will take a chemical name and insert this into a preformatted URL that will pull all of the html from a corresponding Wikipedia page. The code will then parse and separate out everything in the html from the InChI identifier displaying the results.

Python 3 has been used for all code in this tutorial so make sure to consult the correct version documentation if additional reference is needed. Should the syntax or format change with future updates to the Python Language, it may be necessary to approach the task in a different way. The steps are broken down into sections which should be placed into the file one after the other from top to bottom.

Step 1

Starting with the libraries and modules that need to be declared, enter the code in step 1. The first line will import a function called “urlopen” from a library module called “urllib.request”. This will be responsible for allowing the program to fetch URL’s. The second line will import a library called “BeautifulSoup” from the package “bs4”. This module will be responsible for isolating the html text that we would like to retrieve as a result. The last module that will be imported is called “re” and this will be used to make some regular expressions that look for the pattern defined in the programs code which will contain the results.

The Python documentation recommended may be helpful with getting a deeper understanding of how importation of libraries into the program works. Be sure that the following code in step 1 is placed at the top of the file.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

Step 2

After making the imports, add the following code which will define the first variable stored by Python. The name of the variable will be called “chemical” and it will store the value given through the text input displayed to the user. The variable stored should be a type of chemical identified by either its common name or systematic chemical name.

chemical = input("Put in the name of the chemical you want the InChI for: ")

Step 3

Two more variables will be necessary in setting up the preformatted URL structure needed to find corresponding chemical page on Wikipedia. The following code will set “html” as the variable and it will be assigned a full URL that is a combination of a preformatted section that does not change along with a piece that takes the user input defined in step 2. The URL will be pieced together into a single string matching where the chemical page is located on Wikipedia. In programming, this process is often called concatenation. The command “urlopen” will serve as the function or assignment to that page defining how to use the variable when called. After the “html” variable has been stored, a second variable called “wikiExtract” will store the text retrieved from this webpage. The first piece in the parenthesis will define what variable to call for the URL assignment followed by the format and the command for what should be done. The command “get_text” will then store everything on the page to the variable defined.

html = urlopen("https://en.wikipedia.org/wiki/" + chemical)
wikiExtract = BeautifulSoup(html, "lxml").get_text()

Step 4

After the html of the webpage has been retrieved, the next few lines will search, isolate and store the InChI value independently of the other text from the webpage. The first line will perform a search of the html for the pattern “InChI=.*” and put this into memory as a value. The star and dot in the pattern will tell the program to grab everything found after “InChI=” as a parameter. Once the value has been found, the second line of code will then break off all text that follows the InChI string and store only the string to a new variable. The last function will provide the most refinement in isolating the InChI string by removing any added space or unwanted characters. The variable “inchiFinal” will be sent to the users display as the result of the search.

inchiMatch = re.findall("InChI=.*", wikiExtract)
inchiClean = inchiMatch[0].split('H\\')
inchiFinal = inchiClean[0].split()

Step 5

Before the user receives the results, the following code can be inserted to give a nice little message that is followed by the actual InChI string. This will help to keep things looking nice and clean.

print("\n" + "Wikipedia says the InChI is:" + '\n')
print(inchiFinal[0])

If you would like to just copy the entire program in sequence, below is the completed code containing everything that is needed to perform retrieving an InChI from Wikipedia.

Completed Code Example

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

chemical = input("Put in the name of the chemical you want the InChI for: ")

html = urlopen("https://en.wikipedia.org/wiki/" + chemical)
wikiExtract = BeautifulSoup(html, "lxml").get_text()

inchiMatch = re.findall("InChI=.*", wikiExtract)
inchiClean = inchiMatch[0].split('H\\')
inchiFinal = inchiClean[0].split()

print("\n" + "Wikipedia says the InChI is:" + '\n')
print(inchiFinal[0])

Program Demonstration

An interactive demo of this program is provided by Trinket in the online publication.5 The trinket can be visited at this location (https://trinket.io/python3/437c1f516a). The stored and printable copies will only contain screenshots below with descriptions as they cannot display the trinket program in a live environment.

Figure 1 The above image shows the interpreter requesting a chemical name from the user.

 

Figure 2 The above image shows the user inserting the chemical acetone into the request.

 

Figure 3 The above image shows the Python program displaying the resulting InChI after web scraping wikipedia for the answer.

 

References

(1) Cornell, A. Cheminformatics-Python. Figshare 2018. https://doi.org/10.6084/m9.figshare.7255901.
(2) Annamaa, A. Introducing Thonny, a Python IDE for Learning Programming. In Proceedings of the 15th Koli Calling Conference on Computing Education Research – Koli Calling ’15; ACM Press: Koli, Finland, 2015; pp 117–121. https://doi.org/10.1145/2828959.2828969.

Chemistry Programming with Python – Retrieving InChI From PubChem (Tutorial)

Andrew P. Cornell, Robert E. Belford

Chemistry Department, University of Arkansas at Little Rock, Little Rock, Arkansas 72204

 

Abstract

In this tutorial, a program written in Python will take a user specified chemical name and retrieve the associated chemical identifier or basic property using an online chemical database. This program can be used as both an aid for learning to programmatically work with chemical data and as a short general lesson for using Python with an API (Application Programming Interface). The database that will be used for this lesson is known as PubChem, which is a publicly accessible platform run by NCBI (National Center for Biotechnology Information).

PubChem offers public REST (Representational State Transfer) based programmatic access to a lot of the data contained on the servers which is defined with a specific syntax.1 Review the recommended reading to become familiar with the syntax if the need arises to pull data from PubChem that differs from this tutorial. This tutorial will use the InChI (International Chemical Identifier), InChI Key, molecular formula, Canonical SMILES (Simplified Molecular-Input Line-Entry System) and molecular weight with InChI being used in the demonstration section.2

Learning Objectives

  • Import Python Library
  • Create and Define Functions
  • Make API Request with Python
  • Parse and Display Results

Recommended Reading

  • Internet of Chemistry Things Activity 1 (https://ioct.tech/edu/ioct1) Page that explains basic Instructions for setting up Python on a computer. The Python Activities listed in the sidebar may also help to explain some of the background information.
  • Spring ChemInformatics OLCC Course (http://olcc.ccce.divched.org/Spring2017OLCC) This site provides lots of information on working with chemical data.
  • Python Documentation (https://docs.python.org/3/) Python 3 documentation that correlates to the version used within this tutorial.
  • PubChem REST Documentation (https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest) Provides instructions on syntax structure, methods and request procedures for accessing data.

Methods

The files used in this tutorial can be located within the following GitHub Page (https://github.com/boots7458/Cheminformatics-Python) along with a DOI on FigShare (https://doi.org/10.6084/m9.figshare.7255901).3 Python will run on many different operating systems, however this tutorial will use the Thonny IDE (Integrated Development Environment) to design, run and test the code.4 The following code performs the commands that will retrieve either an InChI, InChI Key, molecular formula, SMILES or the molecular weight of a compound from PubChem. Each code section will be explained in detail with the full completed code located at the end of the tutorial. The completed code can be copied directly into a Python File and run as a fully functional program.

Python 3 has been used for all code in this tutorial so make sure to consult the correct version documentation if additional reference is needed. Should the syntax or format change with future updates to the Python Language, it may be necessary to approach the task in a different way. The steps are broken down into sections which should be placed into the file one after the other from top to bottom.

Step 1

When you run a python file it typically does not load user added modules.  These modules are stored in specific libraries and are only loaded when programs need those features.  So the first step will be to load a user requested library, the urllib.request – Extensible library for opening URLs (https://docs.python.org/3/library/urllib.request.html), which will define several functions and classes to help the program open URLs. This is done through the import command as shown in the following code.

import urllib.request

Step 2

After making the library import declaration, it is time to start defining a few functions that will build the different parts of the program. The first function is called “firstChoice” and will declare the “string2” variable as global. This will allow the variable to be called from any function without having to specifically pass it within the code. The function will be responsible for asking the user to input a text string that will be stored in memory as a variable for later use.

def firstChoice():
    global string2
    string2 = input("Enter a chemical name: ")

Step 3

The second function is much longer, despite performing a simple task. The function is called “choices” and it will ask the user to pick a number for which option to retrieve based on what the program is asking. The numbers inside the parenthesis next to the print statements define some formatting that will be displayed such as indents, line dividers and spacing for aesthetic purposes. The most important section is where “idChoice” is set as a global variable to store the value chosen by the user. The “if” and “elif” commands define what to do, depending on whether the user selects a valid number from the options or not. A valid option will simply pass the choice to be used in constructing the API (Application Programming Interface) URL for retrieving the user’s request. An incorrect option
will simply display a message stating the problem and to try again.

def choices():
    print(40 * "_")
    print(3 * " " + "Select the value below to retrieve")
    print(40 * "_")
    print('{:>23}'.format("INCHI[0]"))
    print('{:>23}'.format("INCHIKEY[1]"))
    print('{:>23}'.format("MOLECULAR FORMULA[2]"))
    print('{:>23}'.format("SMILES[3]"))
    print('{:>23}'.format("MOLECULAR WEIGHT[4]"))
    print(40 * "_")
    global idChoice
    idChoice = int(input("Enter a number choice? "))
    if 0 <= idChoice <= 4:
        choiceID()
    elif idChoice != range(0,4):
        print(2 * '\n' + 38 * '*')
        print("* Incorrect Number Choice, Try Again *")
        print(38 * '*' + 2 * '\n')
        choices()

Step 4

The next function in the program is called “choiceID” and this will take an array of values for the input and combine what the user has chosen into a string. The string will then be used for building the URL that will pull in the requested data value at the end. This is the function that will utilize the library from step 1 that was declared at the top of the file. The last part of the function will ask the user if another value should be requested or not. The program will end if the user types in “no” and will repeat the entire program if “yes” is entered as the user’s choice.

def choiceID():
    inputList = ['INCHI', 'INCHIKEY', 'MolecularFormula','CanonicalSMILES', 'MolecularWeight']
    selChoice = inputList[idChoice]

    string1 = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/"
    string3 = "/property/" 
    string4 = "/TXT"
    html = urllib.request.urlopen(string1 + string2 + string3 + selChoice + string4).read()
    html2 = html.decode('UTF-8')
    print(2 * '\n')
    print(html2)
    print(2 * '\n')
    redoProg = input("Would you like to check another compound, yes or no? ")
    if redoProg == "yes":
        print ("\n" * 2)
        main()
    else:
        print("Done!")

Step 5

This function will be called “main” and it simply defines the order in which all other functions should be run. The program will run this first even though it is located towards the bottom of the Python File.

def main():
    firstChoice()
    choices()

Step 6

Here is where the program is told which function to start with and it says to start with the function titled “main” which will define the order of which individual functions to run. This line of code is the starting point at which the program will read from after all libraries have been loaded. It is very important that this line of code take place after the functions along with being loaded after “Import Time” so that libraries are loaded before any code is executed. For this reason “main()” is written last in the program.

## Program Assignments ##
main()

Completed Code Example

import urllib.request

def firstChoice():
    global string2
    string2 = input("Enter a chemical name: ")

def choices():
    print(40 * "_")
    print(3 * " " + "Select the value below to retrieve")
    print(40 * "_")
    print('{:>23}'.format("INCHI[0]"))
    print('{:>23}'.format("INCHIKEY[1]"))
    print('{:>23}'.format("MOLECULAR FORMULA[2]"))
    print('{:>23}'.format("SMILES[3]"))
    print('{:>23}'.format("MOLECULAR WEIGHT[4]"))
    print(40 * "_")
    global idChoice
    idChoice = int(input("Enter a number choice? "))
    if 0 <= idChoice <= 4:
        choiceID()
    elif idChoice != range(0,4):
        print(2 * '\n' + 38 * '*')
        print("* Incorrect Number Choice, Try Again *")
        print(38 * '*' + 2 * '\n')
        choices()
def choiceID():
    inputList = ['INCHI', 'INCHIKEY', 'MolecularFormula','CanonicalSMILES', 'MolecularWeight']
    selChoice = inputList[idChoice]

    string1 = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/"
    string3 = "/property/" 
    string4 = "/TXT"
    html = urllib.request.urlopen(string1 + string2 + string3 + selChoice + string4).read()
    html2 = html.decode('UTF-8')
    print(2 * '\n')
    print(html2)
    print(2 * '\n')
    redoProg = input("Would you like to check another compound, yes or no? ")
    if redoProg == "yes":
        print ("\n" * 2)
        main()
    else:
        print("Done!")
def main():
    firstChoice()
    choices()
## Program Assignments ##
main()

NOTES ON USING THE PROGRAM
This program will take a single chemical name without any modification needed. For example, if the user wants to look for data related to acetone, then the term can be entered as just “acetone”. However, if a user wants to enter a chemical name that contains more than one word, then a place marker must be put in so that the spaces can be accounted for in the URL Syntax. To enter the term acetic acid, the user should enter “acetic%20acid” by putting %20 anywhere that should contain a space, or the program will throw several errors as a result.

Program Demonstration


An interactive demo of this program is provided by Trinket in the online publication.5 The trinket can be visited at this location (https://trinket.io/python3/c845b6bdbd). The stored and printable copies will only contain screenshots below with descriptions as they cannot display the trinket program in a live environment.

Figure 1 Shown above, “Acetone” has been inserted into the text input.

 

Figure 2 Shown above, option 1 for INCHIKEY has been selected.

 

Figure 3 Shown above is the InChI Key Result from the PubChem API.

Suggested Questions for Classroom Use:

  1. What is a library in python, and why would python use libraries
  2. What library was used in the above code, and what does it do?
  3. What command allows you to make a function in python?
  4. What are the names of the functions the above program created?
  5. What does the command “global” do? Why is it needed?
  6. What is the URL that is generated to get the molar mass of aspirin?
  7. What does the code ‘{:>23}’ do?
  8. For the following code
    idChoice = int(input("Enter a number choice? "))
        if 0 <= idChoice <= 4:
            choiceID()
        elif idChoice != range(0,4):
            print(2 * '\n' + 38 * '*')
            print("* Incorrect Number Choice, Try Again *")
            print(38 * '*' + 2 * '\n')
            choices()
    
      • (a) What does the following statements say?
    if 0 <= idChoice <= 4:
            choiceID()
    
      • (b) What does the following statements say? Can you think of another way of doing this?
    idChoice != range(0,4):
    
  9. Go to the Compound Property Tables on this page  https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest$_Toc494865567
    • Add a 6th option (option 5) that would allow you to get the IUPAC Name for a compound
    • Add a 7th option (option 6) that would give you an indication if the compound is water or fat soluble.

References

(1) Kim, S.; Thiessen, P. A.; Cheng, T.; Yu, B.; Bolton, E. E. An Update on PUG-REST: RESTful Interface for Programmatic Access to PubChem. Nucleic Acids Research 2018, 46 (W1), W563–W570.
(2) Heller, S. R.; McNaught, A.; Pletnev, I.; Stein, S.; Tchekhovskoi, D. InChI, the IUPAC International Chemical Identifier. Journal of Cheminformatics 2015, 7 (1). https://doi.org/10.1186/s13321-015-0068-4.
(3) Cornell, A. Cheminformatics-Python. Figshare 2018. https://doi.org/10.6084/m9.figshare.7255901.
(4) Annamaa, A. Introducing Thonny, a Python IDE for Learning Programming. In Proceedings of the 15th Koli Calling Conference on Computing Education Research – Koli Calling ’15; ACM Press: Koli, Finland, 2015; pp 117–121. https://doi.org/10.1145/2828959.2828969.
(5) Elliott Hauser; Brian Marks; Ben Wheeler. Trinket; 2019.

PubChem chemical structure standardization

Abstract

Background:

PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to  substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be  rejected, and modifcations applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifer (InChI) software, as manifested by conversion of the InChI back into a chemical structure.

UniChem: a unified chemical structure cross-referencing and identifier tracking system

Abstract

UniChem is a freely available compound identifier mapping service on the internet, designed to optimize the efficiency with which structure-based hyperlinks may be built and maintained between chemistry-based resources. In the past, the creation and maintenance of such links at EMBL-EBI, where several chemistry-based resources exist, has required independent efforts by each of the separate teams. These efforts were complicated by the different data models, release schedules, and differing business rules for compound normalization and identifier nomenclature that exist across the organization. UniChem, a large-scale, non-redundant database of Standard InChIs with pointers between these structures and chemical identifiers from all the separate chemistry resources, was developed as a means of efficiently sharing the maintenance overhead of creating these links. Thus, for each source represented in UniChem, all links to and from all other sources are automatically calculated and immediately available for all to use. Updated mappings are immediately available upon loading of new data releases from the sources. Web services in UniChem provide users with a single simple automatable mechanism for maintaining all links from their resource to all other sources represented in UniChem. In addition, functionality to track changes in identifier usage allows users to monitor which identifiers are current, and which are obsolete. Lastly, UniChem has been deliberately designed to allow additional resources to be included with minimal effort. Indeed, the recent inclusion of data sources external to EMBL-EBI has provided a simple means of providing users with an even wider selection of resources with which to link to, all at no extra cost, while at the same time providing a simple mechanism for external resources to link to all EMBL-EBI chemistry resources.

International chemical identifier for reactions (RInChI)

Abstract

The Reaction InChI (RInChI) extends the idea of the InChI, which provides a unique descriptor of molecular structures, towards reactions. Prototype versions of the RInChI have been available since 2011. The frst ofcial release (RInChIV1.00), funded by the InChI Trust, is now available for download (http://www.inchi-trust.org/downloads/). This release defnes the format and generates hashed representations (RInChIKeys) suitable for database and web operations. The RInChI provides a concise description of the key data in chemical processes, and facilitates the manipulation and analysis of reaction data.

Consistency of systematic chemical identifiers within and between small-molecule databases

Abstract

Background:

Correctness of structures and associated metadata within public and commercial chemical databases
greatly impacts drug discovery research activities such as quantitative structure–property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation.

Results:

The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%).

Conclusions:

We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.

Towards a Universal SMILES representation – A standard method to generate canonical SMILES based on the InChI

Abstract

Background:

There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string.

Results:

I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% of the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset.

Conclusions:

The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain – such as the development of a standard aromatic model for SMILES – the ability to create the same SMILES using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.

Enhancement of the chemical semantic web through the use of InChI identifiers

Abstract

Molecules, as defined by connectivity specified via the International Chemical Identifier (InChI), are precisely indexed by major web search engines so that Internet tools can be transparently used for unique structure searches.

Detection of IUPAC and IUPAC-like chemical names

Abstract

Motivation:

Chemical compounds like small signal molecules or other biological active chemical substances are an important entity class in life science publications and patents. Several representations and nomenclatures for chemicals like SMILES, InChI, IUPAC or trivial names exist. Only SMILES and InChI names allow a direct structure search, but in biomedical texts trivial names and Iupac like names are used more frequent. While trivial names can be found with a dictionary-based approach and in such a way mapped to their corresponding structures, it is not possible to enumerate all IUPAC names. In this work, we present a new machine learning approach based on conditional random fields (CRF) to find mentions of IUPAC and IUPAC-like names in scientific text as well as its evaluation and the conversion rate with available name-to-structure tools.

Results:

We present an IUPAC name recognizer with an F1 measure of 85.6% on a MEDLINE corpus. The evaluation of different CRF orders and offset conjunction orders demonstrates the importance of these parameters. An evaluation of hand-selected patent sections containing large enumerations and terms with mixed nomenclature shows a good performance on these cases (F1 measure 81.5%). Remaining recognition problems are to detect correct borders of the typically long terms, especially when occurring in parentheses or enumerations. We demonstrate the scalability of our implementation by providing results from a full MEDLINE run.

Availability:

We plan to publish the corpora, annotation guideline as well as the conditional random field model as a UIMA component.

Contact:roman.klinger@scai.fraunhofer.de

InChI: connecting and navigating chemistry

Abstract

The International Chemical Identifier (InChI) has had a dramatic impact on providing a means by which to
deduplicate, validate and link together chemical compounds and related information across databases. Its influence
has been especially valuable as the internet has exploded in terms of the amount of chemistry related information
available online. This thematic issue aggregates a number of contributions demonstrating the value of InChI as an
enabling technology in the world of cheminformatics and its continuing value for linking chemistry data.

InChI in the wild: an assessment of InChIKey searching in Google

Abstract

While chemical databases can be queried using the InChI string and InChIKey (IK) the latter was designed for open-web searching. It is becoming increasingly effective for this since more sources enhance crawling of their websites by the Googlebot and consequent IK indexing. Searchers who use Google as an adjunct to database access may be less familiar with the advantages of using the IK as explored in this review. As an example, the IK for atorvastatin retrieves ~200 low-redundancy links from a Google search in 0.3 of a second. These include most major databases and a very low false-positive rate. Results encompass less familiar but potentially useful sources and can be extended to isomer capture by using just the skeleton layer of the IK. Google Advanced Search can be used to filter large result sets. Image searching with the IK is also effective and complementary to open-web queries. Results can be particularly useful for less-common structures as exemplified by a major metabolite of atorvastatin giving only three hits. Testing also demonstrated document-to-document and document-to-database joins via structure matching. The necessary generation of an IK from chemical names can be accomplished using open tools and resources for patents, papers, abstracts or other text sources. Active global sharing of local IK-linked information can be accomplished via surfacing in open laboratory notebooks, blogs, Twitter, figshare and other routes. While information-rich chemistry (e.g. approved drugs) can exhibit swamping and redundancy effects, the much smaller IK result sets for link-poor structures become a transformative first-pass option. The IK indexing has therefore turned Google into a de-facto open global chemical information hub by merging links to most significant sources, including over 50 million PubChem and ChemSpider records. The simplicity, specificity and speed of matching make it a useful option for biologists or others less familiar with chemical searching. However, compared to rigorously maintained major databases, users need to be circumspect about the consistency of Google results and provenance of retrieved links. In addition, community engagement may be necessary to ameliorate possible future degradation of utility.

UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers

Abstract

UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related
molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.

On InChI and Evaluating the Quality of Cross-reference Links

Abstract

Background: There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones.

Results: We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (28.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links.

Conclusions: We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method.

IUPAC STANDARDS ONLINE

Abstract

IUPAC Standards Online is a database built from IUPAC’s (The International Union of Pure and Applied Chemistry) standards and recommendations, which are extracted from the journal Pure and Applied Chemistry (PAC).

The International Union of Pure and Applied Chemistry (IUPAC) is the organization responsible for setting the standards in chemistry that are internationally binding for scientists in industry and academia, patent lawyers, toxicologists, environmental scientists, legislation, etc. “Standards” are definitions of terms, standard values, procedures, rules for naming compounds and materials, names and properties of elements in the periodic table, and many more.

The database will be the only product that provides for the quick and easy search and retrieval of IUPAC’s standards and recommendations which until now have remained unsorted within the huge Pure and Applied Chemistry archive.

Covered topics:

Analytical Chemistry
Biochemistry
Chemical Safety
Data Management
Education
Environmental Chemistry
Inorganic Chemistry
Materials
Medicinal Chemistry
Nomenclature and Terminology
Nuclear Chemistry
Organic Chemistry
Physical Chemistry
Theoretical & Computational Chemistry
Toxicology

Current Status and Future Development in Relation to IUPAC Activities

Abstract

The IUPAC International Chemical Identifier (InChI) is a non-proprietary, machine-readable chemical structure representation format enabling electronic searching, and interlinking and combining, of chemical information from different sources. It was developed from 2001 onwards at the U.S. National Institute of Standards and Technology under the auspices of IUPAC’s Chemical Identifier project. Since 2009, the InChI Trust, a consortium of (mostly) publishers and software developers, has taken over responsibility for funding and oversight of InChI maintenance and development. Funding and responsibility for scientific aspects of InChI development remain with the IUPAC Division VIII (Chemical Nomenclature and Structure Representation) and InChI Subcommittee.

Applications of the InChI in cheminformatics with the CDK and Bioclipse

Abstract

Background

The InChI algorithms are written in C++ and not available as Java library. Integration into software written in Java therefore requires a bridge between C and Java libraries, provided by the Java Native Interface (JNI) technology.

Results

We here describe how the InChI library is used in the Bioclipse workbench and the Chemistry Development Kit (CDK) cheminformatics library. To make this possible, a JNI bridge to the InChI library was developed, JNI-InChI, allowing Java software to access the InChI algorithms. By using this bridge, the CDK project packages the InChI binaries in a module and offers easy access from Java using the CDK API. The Bioclipse project packages and offers InChI as a dynamic OSGi bundle that can easily be used by any OSGi-compliant software, in addition to the regular Java Archive and Maven bundles. Bioclipse itself uses the InChI as a key component and calculates it on the fly when visualizing and editing chemical structures. We demonstrate the utility of InChI with various applications in CDK and Bioclipse, such as decision support for chemical liability assessment, tautomer generation, and for knowledge aggregation using a linked data approach.

Conclusions

These results show that the InChI library can be used in a variety of Java library dependency solutions, making the functionality easily accessible by Java software, such as in the CDK. The applications show various ways the InChI has been used in Bioclipse, to enrich its functionality.

Keywords:

InChI, InChIKey, Chemical structures, JNI-InChI, The Chemistry Development Kit, OSGi, Bioclipse, Decision
support, Linked data, Tautomers, Databases, Semantic web

IUPAC InChI (Video)

This presentation is a part of Google Tech Talks which was added to the GoogleTalksArchive on August 22, 2006. The original presentation date took place on November 2, 2006.

ABSTRACT (Imported From YouTube Source)

The central token of information in Chemistry is a chemical substance, an entity that can often be represented as a well-defined chemical structure. With InChI we have a means of representing this entity as a unique string of characters, which is otherwise represented by various of 2-D and 3-D chemical drawings, ‘connection tables’ and synonyms. InChI therefore represents a discrete physical entity, to which is associated as array of chemical properties and data. NIST has long been involved in disseminating chemical reference data associated with such discrete substances. A InChI is therefore the key index to this data. Many other types of data and information are also naturally tied to it, including biological information, commercial availability, toxicity, drug effectiveness and so forth. Because of the diversity of properties and interactions of a chemical substance, effective location of chemical information generally requires further qualifiers, which may be represented coarsely as a key word, but more precisely using a controlled vocabulary. There are no simple separations between information sought by difference disciplines and for different objectives. However, reference data may be organized according the disciplines most directly involved in making the measurements: -isolated substance – mass, infrared, NMR, spectra; physical properties -substance in the context of others – solubility, affinity, .. -properties of a mixture containing the substance The desired data can be a number, vector or image, usually associated with dimensions and links to source information. In some cases, this information is typically converted to a curve or diagram for use by an expert and may be further processed by specialized software. In other cases, a single numerical values is the target. Also, some complexities of structure that must be dealt with in practical search is represented in InChI, but must be decoded for use in searching.

Capturing mixture composition: an open machine-readable format for representing mixed substances

Abstract: We describe a file format that is designed to represent mixtures of compounds in a way that is fully machine readable. This Mixfile format is intended to fill the same role for substances that are composed of multiple components as the venerable Molfile does for specifying individual structures. This much needed datastructure is intended to replace current practices for communicating information about mixtures, which usually relies on human-readable text descriptions, drawing several species within a single molecular diagram, or mutually incompatible ad hoc solutions. We describe an open source software application for editing mixture files, which can also be used as web-ready tools for manipulating the file format. We also present a corpus of mixture examples, which we have extracted from collections of text-based descriptions. Furthermore, we present an early look at the proposed IUPAC Mixtures InChI specification, instances of which can be automatically generated using the Mixfile format as a precursor.

InChILayersExplorer – An Spreadsheet to tech and learn the structure of an InChI

This post consist of a simple spreadsheet that takes that splits an InChI in its layers to facilitate its conceptualisation and its teaching. It considers the six layers currently detailed in the InChI TechnicalFAQ, https://www.inchi-trust.org/technical-faq-2/#4.3.

The spreadsheet also facilitates looking up an InChI by entering the molecule name or its SMILES representation.

 

 

RDKit InChI Calculation with Jupyter Notebook

This RDKit InChI Calculation with Jupyter Notebook tutorial is useful to teach the basics of how to interact with InChI using a cheminformatics toolkit in a Jupyter Notebook. The notebook has the following learning objectives:

  1. Setup RDKit with a Jupyter Notebook
  2. Construct a molecule (RDKit molecular object) from a SMILES string
  3. Display molecule images
  4. Calculate an InChI for a molecule
  5. Calculate InChIs for a list of molecules

 

InChI Student Worksheet

This document contains a brief intro to InChI suitable for undergraduate students and two exercises, with answer keys. The first assignment asks about the information encoded in a sample InChI. The last question in this assignment asks students to use the InChI Key as a search term – this will be a lot easier to do if this information is available digitally so that students can simply copy and paste the InChI key rather than typing it by hand into wikipedia.

The second exercise asks students to draw several simple organic compounds with an appropriate computer application and generate the InChI and InChI key. Most commonly used structure drawing programs will readily generate the InChI and InChI key for a structure. In addition there are a number of online services that have a structure drawing application that will generate an InChI or InChI key. Grading this exercise will be much easier if done digitally.

Both exercises were written with the intention that students would complete them on line using a Learning Management System (LMS) such as Blackboard, Canvas, Moodle, etc., where the students would copy and paste an appropriate InChI or InChI KEY into a text box, which the LMS would compare with the correct answer which was submitted by the instructor.

InChI: a user’s perspective

Exchange of chemical structures between practicing chemists is essential to chemical communication. The International Chemical Identifier (InChI) provides a means for lossless communication of structures without resort to any proprietary software or databases nor does it require any payment or royalty fees. This perspective describes why the InChI is valuable to all chemists and how it will be an essential component of creating the chemical web.