Maniphest T208036

Scrape chemical names and data for WikiData from chemical database chemspider.com
Open, LowPublic
Actions

Assigned To

None

Authored By

	S9a8m
	Oct 26 2018, 11:53 AM

Description

ChemSpider is a database of 63 million chemical structures and features. The URLS for these pages are systematic -> www.chemspider.com/Chemical-Structure.{X}.html where {X} is an integer. Addition of this data to WikiData would more than double the number of datapoints within this database.

Event Timeline

S9a8m created this task.Oct 26 2018, 11:53 AM

S9a8m added a project: Wikidata.

Studying the HTML file of a typical page (http://www.chemspider.com/Chemical-Structure.175.html) shows that the tag <h1 class="h4"> is only used once, in close proximity to the chemical name. Need to learn how to open HTML and extract characters following this tag. Will approach with Python.

Laffano subscribed.Oct 26 2018, 11:58 AM

Loikke moved this task from incoming to in progress on the Wikidata board.Oct 26 2018, 12:20 PM

Loikke moved this task from Backlog to Doing on the Wikistorm-2018 board.

@S9a8m let me know if you need some assistance with scraping, i can help.

Any help would be much appreciated, I have no experience scraping with Python so likely to be a bit slow

I'd like to help with this - I can help with the scraping (if that kind of thing is actually allowed). Where can I find you?

An alternative place to get data would be https://www.accessdata.fda.gov/scripts/cder/daf/index.cfm?event=overview.process&ApplNo={X} (again {X} being an integer) which lists FDA-approved medicinal compounds

Currently working in the main hacking room, at the small table close to the door

FYI: there is an API to query ChemSpider https://developer.rsc.org/ but it is limited to 1000 calls a month

Be careful some entries are deprecated. E.g.: http://www.chemspider.com/Chemical-Structure.176.html

Other chemical databases are available:
https://www.ebi.ac.uk/chembl/downloads in SQL
http://mychem.info/ in RDF
More coming :)

str.find(stringName, stringToFind, startPos) will return -1 if not found

PubChem is open, but I don't know what is inside
https://pubchemdocs.ncbi.nlm.nih.gov/downloads

Okay, we can now print names from Chemspider, which seems to slow down after 11 entries - will try and apply a similar approach to PubChem

Below code will scrape chemical names from PubChem really nicely, will try to get more data out. Could someone make a csv writer module?

#Scrape chemical information from PubChem, using tag <title>
import requests

for n in range(1000):
    r = requests.get("https://pubchem.ncbi.nlm.nih.gov/compound/"+str(n))
    body=r.text
    title1 = body.split('<title>')[1]
    t = title1.split('</title>')[0]
    print(t)

Scraper now works well, however getting the relative molecular mass out is difficult - easier to build a function to calculate this for ourselves, using the data found here https://www.science.co.il/elements/?s=Weight

Vemonet added a comment.Oct 26 2018, 1:48 PM

This comment was removed by Vemonet.

Getting informations about PubChem RDF: https://pubchemdocs.ncbi.nlm.nih.gov/rdf

Downloading it: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/

SELECT ?chemical_compound ?chemical_compoundLabel WHERE {

SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
?chemical_compound wdt:P31 wd:Q11173.

}
LIMIT 100

https://pubchem.ncbi.nlm.nih.gov/rest/rdf/descriptor/CID131475303_Exact_Mass.html

S9a8m renamed this task from Scrape chemical names and data for WikiData from ChemSpider to Scrape chemical names and data for WikiData from chemical databases.Oct 26 2018, 2:45 PM

Here is a small script to get molecular mass from chemical formula (with aceton as an example). It uses the python package periodic to get the element masses (pip install periodic to get it)

import re
from periodic import element

def calcmolmass(chemformula):
    subparts = re.findall('[A-Z][^A-Z]*', chemformula)
    firstdigits = [re.search("\d", sp) for sp in subparts]
    firstdigits = [fd.start() if fd is not None else None for fd in firstdigits]
    nums = [float(sp[fd:]) if fd is not None else 1  for sp, fd in zip(subparts, firstdigits)]
    els  = [sp[:fd]        if fd is not None else sp for sp, fd in zip(subparts, firstdigits)]
    mass = sum(element(el).mass*num for el,num in zip(els,nums))
    return mass

print(calcmolmass('C3H6O'))

Apparently most of PubChem is already uploaded to WikiData
But we could still do an exhaustive comparison to check for part that have not been converted

Another option would be to work with a DrugBank dataset to load drug - drug interactions data (using property https://www.wikidata.org/wiki/Property:P129 )

To watch out for later: DrugBank will be soon available as RDF
https://blog.drugbankplus.com/ontotexts-partnership-with-to-open-new-perspectives-in-pharma-research/

The file make to upload the file will be written to Java due to problems with unicode

Drug drug interactions

http://graphdb.dumontierlab.com/resource?uri=http:%2F%2Fdata2services%2Fmodel%2Fassociation%2Fdrug-interaction%2F15b0f4389e25a981b07f41741ea4d848

Apparently all PubChem hasn't been totally uploaded to WikiData:

SELECT ?s WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?s wdt:P662 "122876803" .
}

Get the files here: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor/compound/

Husky moved this task from Doing to Backlog on the Wikistorm-2018 board.Oct 26 2018, 8:21 PM

• samuwmde moved this task from Backlog to Doing on the Wikistorm-2018 board.Oct 27 2018, 7:21 AM

Cannot find the entry for https://pubchem.ncbi.nlm.nih.gov/compound/134797139 in Wikidata (at least not with the PubChem CID, so PubChem data for this entry (and other entries) might be missing

Cannot find this compound using the InChiKey or Chemical formula either. It could be interesting to create the compounds that are not present in Wikidata

SELECT ?p ?propLabel ?o WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } 
  #wd:Q153 ?p ?o . # Get Ethanol
  #?s wdt:P662 "702" . # get ethanol by PubChem ID
  #?s wdt:P235 "LFQSCWFLJHTTHZ-UHFFFAOYSA-N" . # Search ethanol by InChiKey
  #?s wdt:P274 "C₂H₆O" . # Search Ethanol by Chemical Formula

  #?s wdt:P662 "134797139" . # Get one of the last PubChem ID
  #?s wdt:P235 "XZHCKCNVYAHADQ-UHFFFAOYSA-N" . # Search by InChiKey
  ?s wdt:P274 "C₂₃H₃₀ClN₅O₃S" . # Search by Chemical Formula
  
  ?s ?p ?o .
  ?prop wikibase:directClaim ?p . #resolve prop Label 
  FILTER ( ?p != schema:description )
  FILTER ( ?p != rdfs:label )
  FILTER ( ?p != skos:altLabel )  
}

Current program running as so:

# -*- coding: utf-8 -*-
"""
Created on Fri Oct 26 14:49:17 2018

@author: Sam
"""
#Scrape chemical information from PubChem, using tag <h1 class="h4">
import requests
import pandas as pd
import re

element=["H","He","Li","Be","B","C","N","O","F","Ne","Na","Mg","Al","Si","P","S","Cl","K","Ar","Ca","Sc","Ti","V","Cr","Mn","Fe","Ni","Co","Cu","Zn","Ga","Ge","As","Se","Br","Kr","Rb","Sr","Y","Zr","Nb","Mo","Tc","Ru","Rh","Pd","Ag","Cd","In","Sn","Sb","I","Te","Xe","Cs","Ba","La","Ce","Pr","Nd","Pm","Sm","Eu","Gd","Tb","Dy","Ho","Er","Tm","Yb","Lu","Hf","Ta","W","Re","Os","Ir","Pt","Au","Hg","Tl","Pb","Bi","Po","At","Rn","Fr","Ra","Ac","Pa","Th","Np","U","Am","Pu","Cm","Bk","Cf","Es","Fm","Md","No","Rf","Lr","Db","Bh","Sg","Mt","Hs"]
el_mass=[1.008,4.003,6.941,9.012,10.811,12.011,14.007,15.999,18.998,20.18,22.99,24.305,26.982,28.086,30.974,32.065,35.453,39.098,39.948,40.078,44.956,47.867,50.942,51.996,54.938,55.845,58.693,58.933,63.546,65.39,69.723,72.64,74.922,78.96,79.904,83.8,85.468,87.62,88.906,91.224,92.906,95.94,98,101.07,102.906,106.42,107.868,112.411,114.818,118.71,121.76,126.905,127.6,131.293,132.906,137.327,138.906,140.116,140.908,144.24,145,150.36,151.964,157.25,158.925,162.5,164.93,167.259,168.934,173.04,174.967]
undef=[]

def calcmolmass(chemformula):
    wocharge=chemformula.split("+")[0].split("-")[0]
    subparts = re.findall('[A-Z][^A-Z]*', wocharge)
    firstdigits = [re.search("\d", sp) for sp in subparts]
    firstdigits = [fd.start() if fd is not None else None for fd in firstdigits]
    nums = [float(sp[fd:]) if fd is not None else 1  for sp, fd in zip(subparts, firstdigits)]
    els  = [sp[:fd]        if fd is not None else sp for sp, fd in zip(subparts, firstdigits)]
    mass = sum(el_mass[element.index(els[n])]*nums[n] for n in range(len(els)))
    return float("{:.3f}".format(mass))

chemicals=[]

for n in range(1,1000):
    url="https://pubchem.ncbi.nlm.nih.gov/compound/"+str(n)
    r = requests.get(url)
    body=r.text
    try:
        title1 = body.split('<title>')[1]
        t = title1.split('</title>')[0]
        title = t.split(' | ')
        
        chem_n = title[0]
        chem_f = title[1].split(' - ')[0]
        chem_mr=calcmolmass(chem_f)
        
        chemicals.append([chem_n, chem_f, chem_mr])
    except:
        title1 = body.split('<title>')[1]
        t = title1.split('</title>')[0]
        title = t.split(' | ')
        
        chem_n = title[0]
        chem_f = title[1].split(' - ')[0]
        
        print("{:} - ({:})".format(url, chem_f))

col_name=["Name", "Formulae", "Mr"]
df = pd.DataFrame(chemicals, columns=col_name)

df.to_csv('chemical_database.csv', sep=',', index=False)
print(df)

Bit chunky but WORKING - now need to upload csv to WikiData.

Informations about the different directories content: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/README

Apparently some of PubMed entries don't have a human readable name, they are named after their InChiKey. Do we want those entries too? Example: https://pubchem.ncbi.nlm.nih.gov/compound/134797139

CID - InChiKey here: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/inchikey/

Descriptors: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor

Link between compound and descriptors can be find here (may not be useful):
ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/compound/general/pc_compound2descriptor_000001.ttl.gz

Attached: a JSON list of chemicals with no mass on WikiData, but which have PubChem IDs

no_mass_chems.json217 KBDownload

Getting triples linked to a PubMed CID: https://pubchem.ncbi.nlm.nih.gov/rest/rdf/compound/CID2110.html

Using PubChem REST API: https://pubchemdocs.ncbi.nlm.nih.gov/rdf$_5-2

https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=compound&predicate=rdf:type

But returns only 10 000 results. The next 10 000 records (10 001 to 20 000) can be retrieved using the following query:
https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=compound&predicate=rdf:type&offset=10000

URL to properties JSON

https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/9/JSON/?

ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor/compound/
Getting DefinedAtomStereoCount

Scrape mol weight from pubchem JSON:

molWeight = data.split("Molecular Weight")[1].split('"NumValue": ')[1].split(",")[0]

os.system(command) in python can be used to run terminal programs in python.

To compile a java file in terminal

javac filename.java

To run the generated class file in terminal, use

java filename

This is for linux

Using code below, running in blocks of 500. Make sure to change the destination for the csv file.

# -*- coding: utf-8 -*-
"""
Created on Sat Oct 27 11:42:39 2018

@author: Sam
"""
import requests
import pandas as pd
import json

nomass_pubchems = [int(chem["pubchem"]) for chem in json.loads(open("C:/Users/Sam/Desktop/no_mass_chems (1).json").read())]

chemicals=[]
for n in range(len(nomass_pubchems)):
    try:
        file = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/"+str(nomass_pubchems[n])+"/JSON/?")
        data = file.text
        name = data.split('"Record Title",\n                "StringValue": ')[1].split('\n')[0]
        try:
            IUPACname = data.split('"Name": "IUPAC')[1].split('Value": ')[1].split('\n')[0]
        except:
            IUPACname = 'N/A'
        formula = data.split('"Molecular Formula",\n                "StringValue": ')[1].split('\n')[0]
        molWeight = data.split("Molecular Weight")[1].split('"NumValue": ')[1].split(",")[0]
        def_stereocount = data.split("Defined Atom Stereocenter Count")[1].split('NumValue": ')[1].split('\n')[0]
        
        chemicals.append([name, IUPACname, formula, molWeight, def_stereocount])
    except:
        print("Exception raised: PubChem compound #{:}".format(nomass_pubchems[n]))
    
col_name=["name", "IUPAC", "Formulae", "Mr", "Stereo"]
df = pd.DataFrame(chemicals, columns=col_name)

df.to_csv('C:/Users/Sam/Desktop/chemical_database.csv', sep=',', index=False)
print(df)

I'll run 0-500

Get the RDF description of a compound

curl -L -H "Accept: text/rdf" -o CID2244.rdf http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244

Or just https://pubchem.ncbi.nlm.nih.gov/rest/rdf/compound/CID2244.rdf

Updated code without the horrible horrible bugs below:

# -*- coding: utf-8 -*-
"""
Created on Sat Oct 27 11:42:39 2018

@author: Sam
"""
import requests
import pandas as pd
import json

nomass_pubchems = [int(chem["pubchem"]) for chem in json.loads(open("C:/Users/Sam/Desktop/no_mass_chems (1).json").read())]

chemicals=[]
for n in range(100):
    print("Processing PubChem compound #{:}".format(nomass_pubchems[n]))
    try:
        file = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/"+str(nomass_pubchems[n])+"/JSON/?")
        data = file.text
        name = data.split('"Name": "Record Title",')[1].split('"StringValue": "')[1].split('"')[0]
        #print(name)
        try:
            IUPACname = data.split('"Name": "IUPAC')[1].split('Value": "')[1].split('"')[0]
            #print(IUPACname)
        except:
            IUPACname = 'N/A'
        formula = data.split('"Name": "Molecular Formula"')[1].split('"StringValue": "')[1].split('"')[0]
        #print(formula)
        molWeight = data.split('"Molecular Weight')[1].split('NumValue": ')[1].split(',')[0]
        #print(molWeight)
        def_stereocount = ("".join(data.split('"Defined Atom Stereocenter Count"')[1:])).split('NumValue": ')[1].split('\n')[0]
        #print(def_stereocount)
        
        chemicals.append([name, IUPACname, formula, molWeight, def_stereocount])
    except Exception as e:
        print(e)
        print("Exception raised: PubChem compound #{:}".format(nomass_pubchems[n]))
    
col_name=["name", "IUPAC", "Formulae", "Mr", "Stereo"]
df = pd.DataFrame(chemicals, columns=col_name)

df.to_csv('C:/Users/Sam/Desktop/chemical_database0.csv', sep=',', index=False)
print(df)

So the last one still had horrible bugs, this one is better(-ish) I promise:

# -*- coding: utf-8 -*-
"""
Created on Sat Oct 27 11:42:39 2018

@author: Sam
"""
import requests
import pandas as pd
import json

nomass_pubchems = [int(chem["pubchem"]) for chem in json.loads(open("C:/Users/Sam/Desktop/no_mass_chems (1).json").read())]

chemicals=[]
for n in range(201,300):
    print("Processing PubChem compound #{:}".format(nomass_pubchems[n]))
    try:
        file = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/"+str(nomass_pubchems[n])+"/JSON/?")
        data = file.text
        name = ("".join(data.split('"Name": "Record Title",')[1:])).split('"StringValue": "')[1].split('"')[0]
        #print(name)
        try:
            IUPACname = ("".join(data.split('"Name": "IUPAC')[1:])).split('Value": "')[1].split('"')[0]
            #print(IUPACname)
        except:
            IUPACname = 'N/A'
        formula = ("".join(data.split('"Name": "Molecular Formula"')[1:])).split('"StringValue": "')[1].split('"')[0]
        #print(formula)
        molWeight = data.split('"Molecular Weight')[1].split('NumValue": ')[1].split(',')[0]
        #print(molWeight)
        def_stereocount = ("".join(data.split('"Defined Atom Stereocenter Count"')[1:])).split('NumValue": ')[1].split('\n')[0]
        #print(def_stereocount)
        
        chemicals.append([name, IUPACname, formula, molWeight, def_stereocount])
    except Exception as e:
        print(e)
        print("Exception raised: PubChem compound #{:}".format(nomass_pubchems[n]))
    
col_name=["name", "IUPAC", "Formulae", "Mr", "Stereo"]
df = pd.DataFrame(chemicals, columns=col_name)

df.to_csv('C:/Users/Sam/Desktop/chemical_database2.csv', sep=',', index=False)
print(df)

And finally, one that actually works:

# -*- coding: utf-8 -*-
"""
Created on Sat Oct 27 11:42:39 2018

@author: Sam
"""
import requests
import pandas as pd
import json

nomass_pubchems = [int(chem["pubchem"]) for chem in json.loads(open("C:/Users/Sam/Desktop/no_mass_chems (1).json").read())]

chemicals=[]
for n in range(500):
    print("Processing PubChem compound #{:}".format(nomass_pubchems[n]))
    try:
        file = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/"+str(nomass_pubchems[n])+"/JSON/?")
        data = file.text
        name = ("".join(data.split('"Name": "Record Title",')[1:])).split('"StringValue": "')[1].split('"')[0]
        #print(name)
        try:
            IUPACname = ("".join(data.split('"Name": "IUPAC')[1:])).split('Value": "')[1].split('"')[0]
            #print(IUPACname)
        except:
            IUPACname = 'N/A'
        formula = ("".join(data.split('"Name": "Molecular Formula"')[1:])).split('"StringValue": "')[1].split('"')[0]
        #print(formula)
        molWeight = ("".join(data.split('"Molecular Weight')[1:])).split('NumValue": ')[1].split(',')[0]
        #print(molWeight)
        def_stereocount = ("".join(data.split('"Defined Atom Stereocenter Count"')[1:])).split('NumValue": ')[1].split('\n')[0]
        #print(def_stereocount)
        
        chemicals.append([name, IUPACname, formula, molWeight, def_stereocount])
    except Exception as e:
        print(e)
        print("Exception raised: PubChem compound #{:}".format(nomass_pubchems[n]))
    
col_name=["name", "IUPAC", "Formulae", "Mr", "Stereo"]
df = pd.DataFrame(chemicals, columns=col_name)

df.to_csv('C:/Users/Sam/Desktop/chemical_database.csv', sep=',', index=False)
print(df)

chemical_database.csv51 KBDownload

chemical_database1000-1500.csv72 KBDownload

chemical_database500-1000.csv84 KBDownload

chemical_database1500-2000.csv79 KBDownload

Chems 0-500 formatted:

chemical_database0-500.csv58 KBDownload

This isn't even my final form...

chemical_database2000-2500.csv86 KBDownload

The final database!!

chemical_database.csv386 KBDownload

chemical_database_reduced.csv53 KBDownload

quickstatements data.txt82 KBDownload

Here is the small quickstatements code

Simple (and dirty) Python scripts to get data out of PubChem turtle files (here InChIKey id, but could be done with other informations)

https://github.com/vemonet/wikimedia-update-pubchem

Lydia_Pintscher moved this task from in progress to monitoring on the Wikidata board.Jan 4 2019, 2:55 PM

Ecritures moved this task from Doing to Backlog on the Wikistorm-2018 board.May 22 2019, 3:41 PM

Ecritures moved this task from Backlog to Sessions Wikidata (Newbee friendly) on the Wikistorm-2018 board.May 22 2019, 3:47 PM

Ecritures edited projects, added Wiki-Techstorm-2019; removed Wikistorm-2018.Jun 6 2019, 7:38 PM

Ecritures moved this task from Backlog to SPARQLstation on the Wiki-Techstorm-2019 board.Nov 4 2019, 4:24 PM

Ecritures moved this task from SPARQLstation to Backlog on the Wiki-Techstorm-2019 board.Nov 15 2019, 9:55 PM

Ecritures triaged this task as Medium priority.Nov 21 2019, 8:50 PM

Husky unsubscribed.May 25 2020, 12:08 PM

Aklapper renamed this task from Scrape chemical names and data for WikiData from chemical databases to Scrape chemical names and data for WikiData from chemical database chemspider.com.May 24 2021, 7:38 AM

Aklapper lowered the priority of this task from Medium to Low.

	F26886272: chemical_database_reduced.csv
	Oct 27 2018, 1:28 PM

	F26883951: chemical_database2000-2500.csv
	Oct 27 2018, 12:38 PM

	F26883824: chemical_database1500-2000.csv
	Oct 27 2018, 12:30 PM

	F26883775: chemical_database500-1000.csv
	Oct 27 2018, 12:22 PM

	F26883756: chemical_database1000-1500.csv
	Oct 27 2018, 12:14 PM

	F26884259: chemical_database.csv
	Oct 27 2018, 12:51 PM

Scrape chemical names and data for WikiData from chemical database chemspider.comOpen, LowPublicActions

Description

Event Timeline

Scrape chemical names and data for WikiData from chemical database chemspider.com
Open, LowPublic
Actions