Page MenuHomePhabricator

Scrape chemical names and data for WikiData from chemical databases
Open, Needs TriagePublic

Description

ChemSpider is a database of 63 million chemical structures and features. The URLS for these pages are systematic -> www.chemspider.com/Chemical-Structure.{X}.html where {X} is an integer. Addition of this data to WikiData would more than double the number of datapoints within this database.

Event Timeline

S9a8m created this task.Oct 26 2018, 11:53 AM
S9a8m added a project: Wikidata.

Studying the HTML file of a typical page (http://www.chemspider.com/Chemical-Structure.175.html) shows that the tag <h1 class="h4"> is only used once, in close proximity to the chemical name. Need to learn how to open HTML and extract characters following this tag. Will approach with Python.

Loikke moved this task from incoming to in progress on the Wikidata board.Oct 26 2018, 12:20 PM
Loikke moved this task from Backlog to Doing on the Wikistorm-2018 board.
Husky added a subscriber: Husky.Oct 26 2018, 12:22 PM

@S9a8m let me know if you need some assistance with scraping, i can help.

Any help would be much appreciated, I have no experience scraping with Python so likely to be a bit slow

I'd like to help with this - I can help with the scraping (if that kind of thing is actually allowed). Where can I find you?

An alternative place to get data would be https://www.accessdata.fda.gov/scripts/cder/daf/index.cfm?event=overview.process&ApplNo={X} (again {X} being an integer) which lists FDA-approved medicinal compounds

Currently working in the main hacking room, at the small table close to the door

FYI: there is an API to query ChemSpider https://developer.rsc.org/ but it is limited to 1000 calls a month

Be careful some entries are deprecated. E.g.: http://www.chemspider.com/Chemical-Structure.176.html

Other chemical databases are available:
https://www.ebi.ac.uk/chembl/downloads in SQL
http://mychem.info/ in RDF
More coming :)

str.find(stringName, stringToFind, startPos) will return -1 if not found

PubChem is open, but I don't know what is inside
https://pubchemdocs.ncbi.nlm.nih.gov/downloads

Okay, we can now print names from Chemspider, which seems to slow down after 11 entries - will try and apply a similar approach to PubChem

S9a8m added a comment.EditedOct 26 2018, 12:52 PM

Below code will scrape chemical names from PubChem really nicely, will try to get more data out. Could someone make a csv writer module?

#Scrape chemical information from PubChem, using tag <title>
import requests

for n in range(1000):
    r = requests.get("https://pubchem.ncbi.nlm.nih.gov/compound/"+str(n))
    body=r.text
    title1 = body.split('<title>')[1]
    t = title1.split('</title>')[0]
    print(t)
S9a8m added a comment.Oct 26 2018, 1:29 PM

Scraper now works well, however getting the relative molecular mass out is difficult - easier to build a function to calculate this for ourselves, using the data found here https://www.science.co.il/elements/?s=Weight

This comment was removed by Vemonet.

Getting informations about PubChem RDF: https://pubchemdocs.ncbi.nlm.nih.gov/rdf

Downloading it: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/

SELECT ?chemical_compound ?chemical_compoundLabel WHERE {

SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
?chemical_compound wdt:P31 wd:Q11173.

}
LIMIT 100

S9a8m renamed this task from Scrape chemical names and data for WikiData from ChemSpider to Scrape chemical names and data for WikiData from chemical databases.Oct 26 2018, 2:45 PM
BorDeh added a subscriber: BorDeh.EditedOct 26 2018, 2:50 PM

Here is a small script to get molecular mass from chemical formula (with aceton as an example). It uses the python package periodic to get the element masses (pip install periodic to get it)

import re
from periodic import element

def calcmolmass(chemformula):
    subparts = re.findall('[A-Z][^A-Z]*', chemformula)
    firstdigits = [re.search("\d", sp) for sp in subparts]
    firstdigits = [fd.start() if fd is not None else None for fd in firstdigits]
    nums = [float(sp[fd:]) if fd is not None else 1  for sp, fd in zip(subparts, firstdigits)]
    els  = [sp[:fd]        if fd is not None else sp for sp, fd in zip(subparts, firstdigits)]
    mass = sum(element(el).mass*num for el,num in zip(els,nums))
    return mass

print(calcmolmass('C3H6O'))
Vemonet added a comment.EditedOct 26 2018, 3:27 PM

Apparently most of PubChem is already uploaded to WikiData
But we could still do an exhaustive comparison to check for part that have not been converted

Another option would be to work with a DrugBank dataset to load drug - drug interactions data (using property https://www.wikidata.org/wiki/Property:P129 )

To watch out for later: DrugBank will be soon available as RDF
https://blog.drugbankplus.com/ontotexts-partnership-with-to-open-new-perspectives-in-pharma-research/

The file make to upload the file will be written to Java due to problems with unicode

Apparently all PubChem hasn't been totally uploaded to WikiData:

SELECT ?s WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?s wdt:P662 "122876803" .
}

Get the files here: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor/compound/

Husky moved this task from Doing to Backlog on the Wikistorm-2018 board.Oct 26 2018, 8:21 PM
samuwmde moved this task from Backlog to Doing on the Wikistorm-2018 board.Oct 27 2018, 7:21 AM
Vemonet added a comment.EditedOct 27 2018, 8:37 AM

Cannot find the entry for https://pubchem.ncbi.nlm.nih.gov/compound/134797139 in Wikidata (at least not with the PubChem CID, so PubChem data for this entry (and other entries) might be missing

Cannot find this compound using the InChiKey or Chemical formula either. It could be interesting to create the compounds that are not present in Wikidata

sql
SELECT ?p ?propLabel ?o WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } 
  #wd:Q153 ?p ?o . # Get Ethanol
  #?s wdt:P662 "702" . # get ethanol by PubChem ID
  #?s wdt:P235 "LFQSCWFLJHTTHZ-UHFFFAOYSA-N" . # Search ethanol by InChiKey
  #?s wdt:P274 "C₂H₆O" . # Search Ethanol by Chemical Formula

  #?s wdt:P662 "134797139" . # Get one of the last PubChem ID
  #?s wdt:P235 "XZHCKCNVYAHADQ-UHFFFAOYSA-N" . # Search by InChiKey
  ?s wdt:P274 "C₂₃H₃₀ClN₅O₃S" . # Search by Chemical Formula
  
  ?s ?p ?o .
  ?prop wikibase:directClaim ?p . #resolve prop Label 
  FILTER ( ?p != schema:description )
  FILTER ( ?p != rdfs:label )
  FILTER ( ?p != skos:altLabel )  
}
S9a8m added a comment.Oct 27 2018, 8:39 AM

Current program running as so:

# -*- coding: utf-8 -*-
"""
Created on Fri Oct 26 14:49:17 2018

@author: Sam
"""
#Scrape chemical information from PubChem, using tag <h1 class="h4">
import requests
import pandas as pd
import re

element=["H","He","Li","Be","B","C","N","O","F","Ne","Na","Mg","Al","Si","P","S","Cl","K","Ar","Ca","Sc","Ti","V","Cr","Mn","Fe","Ni","Co","Cu","Zn","Ga","Ge","As","Se","Br","Kr","Rb","Sr","Y","Zr","Nb","Mo","Tc","Ru","Rh","Pd","Ag","Cd","In","Sn","Sb","I","Te","Xe","Cs","Ba","La","Ce","Pr","Nd","Pm","Sm","Eu","Gd","Tb","Dy","Ho","Er","Tm","Yb","Lu","Hf","Ta","W","Re","Os","Ir","Pt","Au","Hg","Tl","Pb","Bi","Po","At","Rn","Fr","Ra","Ac","Pa","Th","Np","U","Am","Pu","Cm","Bk","Cf","Es","Fm","Md","No","Rf","Lr","Db","Bh","Sg","Mt","Hs"]
el_mass=[1.008,4.003,6.941,9.012,10.811,12.011,14.007,15.999,18.998,20.18,22.99,24.305,26.982,28.086,30.974,32.065,35.453,39.098,39.948,40.078,44.956,47.867,50.942,51.996,54.938,55.845,58.693,58.933,63.546,65.39,69.723,72.64,74.922,78.96,79.904,83.8,85.468,87.62,88.906,91.224,92.906,95.94,98,101.07,102.906,106.42,107.868,112.411,114.818,118.71,121.76,126.905,127.6,131.293,132.906,137.327,138.906,140.116,140.908,144.24,145,150.36,151.964,157.25,158.925,162.5,164.93,167.259,168.934,173.04,174.967]
undef=[]

def calcmolmass(chemformula):
    wocharge=chemformula.split("+")[0].split("-")[0]
    subparts = re.findall('[A-Z][^A-Z]*', wocharge)
    firstdigits = [re.search("\d", sp) for sp in subparts]
    firstdigits = [fd.start() if fd is not None else None for fd in firstdigits]
    nums = [float(sp[fd:]) if fd is not None else 1  for sp, fd in zip(subparts, firstdigits)]
    els  = [sp[:fd]        if fd is not None else sp for sp, fd in zip(subparts, firstdigits)]
    mass = sum(el_mass[element.index(els[n])]*nums[n] for n in range(len(els)))
    return float("{:.3f}".format(mass))

chemicals=[]

for n in range(1,1000):
    url="https://pubchem.ncbi.nlm.nih.gov/compound/"+str(n)
    r = requests.get(url)
    body=r.text
    try:
        title1 = body.split('<title>')[1]
        t = title1.split('</title>')[0]
        title = t.split(' | ')
        
        chem_n = title[0]
        chem_f = title[1].split(' - ')[0]
        chem_mr=calcmolmass(chem_f)
        
        chemicals.append([chem_n, chem_f, chem_mr])
    except:
        title1 = body.split('<title>')[1]
        t = title1.split('</title>')[0]
        title = t.split(' | ')
        
        chem_n = title[0]
        chem_f = title[1].split(' - ')[0]
        
        print("{:} - ({:})".format(url, chem_f))

col_name=["Name", "Formulae", "Mr"]
df = pd.DataFrame(chemicals, columns=col_name)

df.to_csv('chemical_database.csv', sep=',', index=False)
print(df)

Bit chunky but WORKING - now need to upload csv to WikiData.

Informations about the different directories content: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/README

Apparently some of PubMed entries don't have a human readable name, they are named after their InChiKey. Do we want those entries too? Example: https://pubchem.ncbi.nlm.nih.gov/compound/134797139

  • CID - InChiKey here: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/inchikey/
  • Descriptors: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor

Link between compound and descriptors can be find here (may not be useful):
ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/compound/general/pc_compound2descriptor_000001.ttl.gz

Attached: a JSON list of chemicals with no mass on WikiData, but which have PubChem IDs

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor/compound/
Getting DefinedAtomStereoCount

Scrape mol weight from pubchem JSON:

molWeight = data.split("Molecular Weight")[1].split('"NumValue": ')[1].split(",")[0]

os.system(command) in python can be used to run terminal programs in python.

To compile a java file in terminal

javac filename.java

To run the generated class file in terminal, use

java filename

This is for linux

Using code below, running in blocks of 500. Make sure to change the destination for the csv file.

# -*- coding: utf-8 -*-
"""
Created on Sat Oct 27 11:42:39 2018

@author: Sam
"""
import requests
import pandas as pd
import json

nomass_pubchems = [int(chem["pubchem"]) for chem in json.loads(open("C:/Users/Sam/Desktop/no_mass_chems (1).json").read())]

chemicals=[]
for n in range(len(nomass_pubchems)):
    try:
        file = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/"+str(nomass_pubchems[n])+"/JSON/?")
        data = file.text
        name = data.split('"Record Title",\n                "StringValue": ')[1].split('\n')[0]
        try:
            IUPACname = data.split('"Name": "IUPAC')[1].split('Value": ')[1].split('\n')[0]
        except:
            IUPACname = 'N/A'
        formula = data.split('"Molecular Formula",\n                "StringValue": ')[1].split('\n')[0]
        molWeight = data.split("Molecular Weight")[1].split('"NumValue": ')[1].split(",")[0]
        def_stereocount = data.split("Defined Atom Stereocenter Count")[1].split('NumValue": ')[1].split('\n')[0]
        
        chemicals.append([name, IUPACname, formula, molWeight, def_stereocount])
    except:
        print("Exception raised: PubChem compound #{:}".format(nomass_pubchems[n]))
    
col_name=["name", "IUPAC", "Formulae", "Mr", "Stereo"]
df = pd.DataFrame(chemicals, columns=col_name)

df.to_csv('C:/Users/Sam/Desktop/chemical_database.csv', sep=',', index=False)
print(df)

I'll run 0-500

Vemonet added a comment.EditedOct 27 2018, 10:51 AM

Get the RDF description of a compound

curl -L -H "Accept: text/rdf" -o CID2244.rdf http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244

Or just https://pubchem.ncbi.nlm.nih.gov/rest/rdf/compound/CID2244.rdf

Updated code without the horrible horrible bugs below:

# -*- coding: utf-8 -*-
"""
Created on Sat Oct 27 11:42:39 2018

@author: Sam
"""
import requests
import pandas as pd
import json

nomass_pubchems = [int(chem["pubchem"]) for chem in json.loads(open("C:/Users/Sam/Desktop/no_mass_chems (1).json").read())]

chemicals=[]
for n in range(100):
    print("Processing PubChem compound #{:}".format(nomass_pubchems[n]))
    try:
        file = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/"+str(nomass_pubchems[n])+"/JSON/?")
        data = file.text
        name = data.split('"Name": "Record Title",')[1].split('"StringValue": "')[1].split('"')[0]
        #print(name)
        try:
            IUPACname = data.split('"Name": "IUPAC')[1].split('Value": "')[1].split('"')[0]
            #print(IUPACname)
        except:
            IUPACname = 'N/A'
        formula = data.split('"Name": "Molecular Formula"')[1].split('"StringValue": "')[1].split('"')[0]
        #print(formula)
        molWeight = data.split('"Molecular Weight')[1].split('NumValue": ')[1].split(',')[0]
        #print(molWeight)
        def_stereocount = ("".join(data.split('"Defined Atom Stereocenter Count"')[1:])).split('NumValue": ')[1].split('\n')[0]
        #print(def_stereocount)
        
        chemicals.append([name, IUPACname, formula, molWeight, def_stereocount])
    except Exception as e:
        print(e)
        print("Exception raised: PubChem compound #{:}".format(nomass_pubchems[n]))
    
col_name=["name", "IUPAC", "Formulae", "Mr", "Stereo"]
df = pd.DataFrame(chemicals, columns=col_name)

df.to_csv('C:/Users/Sam/Desktop/chemical_database0.csv', sep=',', index=False)
print(df)

So the last one still had horrible bugs, this one is better(-ish) I promise:

# -*- coding: utf-8 -*-
"""
Created on Sat Oct 27 11:42:39 2018

@author: Sam
"""
import requests
import pandas as pd
import json

nomass_pubchems = [int(chem["pubchem"]) for chem in json.loads(open("C:/Users/Sam/Desktop/no_mass_chems (1).json").read())]

chemicals=[]
for n in range(201,300):
    print("Processing PubChem compound #{:}".format(nomass_pubchems[n]))
    try:
        file = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/"+str(nomass_pubchems[n])+"/JSON/?")
        data = file.text
        name = ("".join(data.split('"Name": "Record Title",')[1:])).split('"StringValue": "')[1].split('"')[0]
        #print(name)
        try:
            IUPACname = ("".join(data.split('"Name": "IUPAC')[1:])).split('Value": "')[1].split('"')[0]
            #print(IUPACname)
        except:
            IUPACname = 'N/A'
        formula = ("".join(data.split('"Name": "Molecular Formula"')[1:])).split('"StringValue": "')[1].split('"')[0]
        #print(formula)
        molWeight = data.split('"Molecular Weight')[1].split('NumValue": ')[1].split(',')[0]
        #print(molWeight)
        def_stereocount = ("".join(data.split('"Defined Atom Stereocenter Count"')[1:])).split('NumValue": ')[1].split('\n')[0]
        #print(def_stereocount)
        
        chemicals.append([name, IUPACname, formula, molWeight, def_stereocount])
    except Exception as e:
        print(e)
        print("Exception raised: PubChem compound #{:}".format(nomass_pubchems[n]))
    
col_name=["name", "IUPAC", "Formulae", "Mr", "Stereo"]
df = pd.DataFrame(chemicals, columns=col_name)

df.to_csv('C:/Users/Sam/Desktop/chemical_database2.csv', sep=',', index=False)
print(df)

And finally, one that actually works:

# -*- coding: utf-8 -*-
"""
Created on Sat Oct 27 11:42:39 2018

@author: Sam
"""
import requests
import pandas as pd
import json

nomass_pubchems = [int(chem["pubchem"]) for chem in json.loads(open("C:/Users/Sam/Desktop/no_mass_chems (1).json").read())]

chemicals=[]
for n in range(500):
    print("Processing PubChem compound #{:}".format(nomass_pubchems[n]))
    try:
        file = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/"+str(nomass_pubchems[n])+"/JSON/?")
        data = file.text
        name = ("".join(data.split('"Name": "Record Title",')[1:])).split('"StringValue": "')[1].split('"')[0]
        #print(name)
        try:
            IUPACname = ("".join(data.split('"Name": "IUPAC')[1:])).split('Value": "')[1].split('"')[0]
            #print(IUPACname)
        except:
            IUPACname = 'N/A'
        formula = ("".join(data.split('"Name": "Molecular Formula"')[1:])).split('"StringValue": "')[1].split('"')[0]
        #print(formula)
        molWeight = ("".join(data.split('"Molecular Weight')[1:])).split('NumValue": ')[1].split(',')[0]
        #print(molWeight)
        def_stereocount = ("".join(data.split('"Defined Atom Stereocenter Count"')[1:])).split('NumValue": ')[1].split('\n')[0]
        #print(def_stereocount)
        
        chemicals.append([name, IUPACname, formula, molWeight, def_stereocount])
    except Exception as e:
        print(e)
        print("Exception raised: PubChem compound #{:}".format(nomass_pubchems[n]))
    
col_name=["name", "IUPAC", "Formulae", "Mr", "Stereo"]
df = pd.DataFrame(chemicals, columns=col_name)

df.to_csv('C:/Users/Sam/Desktop/chemical_database.csv', sep=',', index=False)
print(df)

Chems 0-500 formatted:

This isn't even my final form...

The final database!!

S9a8m added a comment.Oct 27 2018, 1:28 PM

Here is the small quickstatements code

Simple (and dirty) Python scripts to get data out of PubChem turtle files (here InChIKey id, but could be done with other informations)

https://github.com/vemonet/wikimedia-update-pubchem

Ecritures moved this task from Doing to Backlog on the Wikistorm-2018 board.May 22 2019, 3:41 PM