ChemSpider is a database of 63 million chemical structures and features. The URLS for these pages are systematic -> www.chemspider.com/Chemical-Structure.{X}.html where {X} is an integer. Addition of this data to WikiData would more than double the number of datapoints within this database.
Description
Event Timeline
Studying the HTML file of a typical page (http://www.chemspider.com/Chemical-Structure.175.html) shows that the tag <h1 class="h4"> is only used once, in close proximity to the chemical name. Need to learn how to open HTML and extract characters following this tag. Will approach with Python.
Any help would be much appreciated, I have no experience scraping with Python so likely to be a bit slow
I'd like to help with this - I can help with the scraping (if that kind of thing is actually allowed). Where can I find you?
An alternative place to get data would be https://www.accessdata.fda.gov/scripts/cder/daf/index.cfm?event=overview.process&ApplNo={X} (again {X} being an integer) which lists FDA-approved medicinal compounds
FYI: there is an API to query ChemSpider https://developer.rsc.org/ but it is limited to 1000 calls a month
Be careful some entries are deprecated. E.g.: http://www.chemspider.com/Chemical-Structure.176.html
Other chemical databases are available:
https://www.ebi.ac.uk/chembl/downloads in SQL
http://mychem.info/ in RDF
More coming :)
PubChem is open, but I don't know what is inside
https://pubchemdocs.ncbi.nlm.nih.gov/downloads
Okay, we can now print names from Chemspider, which seems to slow down after 11 entries - will try and apply a similar approach to PubChem
Below code will scrape chemical names from PubChem really nicely, will try to get more data out. Could someone make a csv writer module?
#Scrape chemical information from PubChem, using tag <title> import requests for n in range(1000): r = requests.get("https://pubchem.ncbi.nlm.nih.gov/compound/"+str(n)) body=r.text title1 = body.split('<title>')[1] t = title1.split('</title>')[0] print(t)
Scraper now works well, however getting the relative molecular mass out is difficult - easier to build a function to calculate this for ourselves, using the data found here https://www.science.co.il/elements/?s=Weight
Getting informations about PubChem RDF: https://pubchemdocs.ncbi.nlm.nih.gov/rdf
Downloading it: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/
SELECT ?chemical_compound ?chemical_compoundLabel WHERE {
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } ?chemical_compound wdt:P31 wd:Q11173.
}
LIMIT 100
Here is a small script to get molecular mass from chemical formula (with aceton as an example). It uses the python package periodic to get the element masses (pip install periodic to get it)
import re from periodic import element def calcmolmass(chemformula): subparts = re.findall('[A-Z][^A-Z]*', chemformula) firstdigits = [re.search("\d", sp) for sp in subparts] firstdigits = [fd.start() if fd is not None else None for fd in firstdigits] nums = [float(sp[fd:]) if fd is not None else 1 for sp, fd in zip(subparts, firstdigits)] els = [sp[:fd] if fd is not None else sp for sp, fd in zip(subparts, firstdigits)] mass = sum(element(el).mass*num for el,num in zip(els,nums)) return mass print(calcmolmass('C3H6O'))
Apparently most of PubChem is already uploaded to WikiData
But we could still do an exhaustive comparison to check for part that have not been converted
Another option would be to work with a DrugBank dataset to load drug - drug interactions data (using property https://www.wikidata.org/wiki/Property:P129 )
To watch out for later: DrugBank will be soon available as RDF
https://blog.drugbankplus.com/ontotexts-partnership-with-to-open-new-perspectives-in-pharma-research/
The file make to upload the file will be written to Java due to problems with unicode
Apparently all PubChem hasn't been totally uploaded to WikiData:
SELECT ?s WHERE { SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } ?s wdt:P662 "122876803" . }
Get the files here: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor/compound/
Cannot find the entry for https://pubchem.ncbi.nlm.nih.gov/compound/134797139 in Wikidata (at least not with the PubChem CID, so PubChem data for this entry (and other entries) might be missing
Cannot find this compound using the InChiKey or Chemical formula either. It could be interesting to create the compounds that are not present in Wikidata
SELECT ?p ?propLabel ?o WHERE { SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } #wd:Q153 ?p ?o . # Get Ethanol #?s wdt:P662 "702" . # get ethanol by PubChem ID #?s wdt:P235 "LFQSCWFLJHTTHZ-UHFFFAOYSA-N" . # Search ethanol by InChiKey #?s wdt:P274 "C₂H₆O" . # Search Ethanol by Chemical Formula #?s wdt:P662 "134797139" . # Get one of the last PubChem ID #?s wdt:P235 "XZHCKCNVYAHADQ-UHFFFAOYSA-N" . # Search by InChiKey ?s wdt:P274 "C₂₃H₃₀ClN₅O₃S" . # Search by Chemical Formula ?s ?p ?o . ?prop wikibase:directClaim ?p . #resolve prop Label FILTER ( ?p != schema:description ) FILTER ( ?p != rdfs:label ) FILTER ( ?p != skos:altLabel ) }
Current program running as so:
# -*- coding: utf-8 -*- """ Created on Fri Oct 26 14:49:17 2018 @author: Sam """ #Scrape chemical information from PubChem, using tag <h1 class="h4"> import requests import pandas as pd import re element=["H","He","Li","Be","B","C","N","O","F","Ne","Na","Mg","Al","Si","P","S","Cl","K","Ar","Ca","Sc","Ti","V","Cr","Mn","Fe","Ni","Co","Cu","Zn","Ga","Ge","As","Se","Br","Kr","Rb","Sr","Y","Zr","Nb","Mo","Tc","Ru","Rh","Pd","Ag","Cd","In","Sn","Sb","I","Te","Xe","Cs","Ba","La","Ce","Pr","Nd","Pm","Sm","Eu","Gd","Tb","Dy","Ho","Er","Tm","Yb","Lu","Hf","Ta","W","Re","Os","Ir","Pt","Au","Hg","Tl","Pb","Bi","Po","At","Rn","Fr","Ra","Ac","Pa","Th","Np","U","Am","Pu","Cm","Bk","Cf","Es","Fm","Md","No","Rf","Lr","Db","Bh","Sg","Mt","Hs"] el_mass=[1.008,4.003,6.941,9.012,10.811,12.011,14.007,15.999,18.998,20.18,22.99,24.305,26.982,28.086,30.974,32.065,35.453,39.098,39.948,40.078,44.956,47.867,50.942,51.996,54.938,55.845,58.693,58.933,63.546,65.39,69.723,72.64,74.922,78.96,79.904,83.8,85.468,87.62,88.906,91.224,92.906,95.94,98,101.07,102.906,106.42,107.868,112.411,114.818,118.71,121.76,126.905,127.6,131.293,132.906,137.327,138.906,140.116,140.908,144.24,145,150.36,151.964,157.25,158.925,162.5,164.93,167.259,168.934,173.04,174.967] undef=[] def calcmolmass(chemformula): wocharge=chemformula.split("+")[0].split("-")[0] subparts = re.findall('[A-Z][^A-Z]*', wocharge) firstdigits = [re.search("\d", sp) for sp in subparts] firstdigits = [fd.start() if fd is not None else None for fd in firstdigits] nums = [float(sp[fd:]) if fd is not None else 1 for sp, fd in zip(subparts, firstdigits)] els = [sp[:fd] if fd is not None else sp for sp, fd in zip(subparts, firstdigits)] mass = sum(el_mass[element.index(els[n])]*nums[n] for n in range(len(els))) return float("{:.3f}".format(mass)) chemicals=[] for n in range(1,1000): url="https://pubchem.ncbi.nlm.nih.gov/compound/"+str(n) r = requests.get(url) body=r.text try: title1 = body.split('<title>')[1] t = title1.split('</title>')[0] title = t.split(' | ') chem_n = title[0] chem_f = title[1].split(' - ')[0] chem_mr=calcmolmass(chem_f) chemicals.append([chem_n, chem_f, chem_mr]) except: title1 = body.split('<title>')[1] t = title1.split('</title>')[0] title = t.split(' | ') chem_n = title[0] chem_f = title[1].split(' - ')[0] print("{:} - ({:})".format(url, chem_f)) col_name=["Name", "Formulae", "Mr"] df = pd.DataFrame(chemicals, columns=col_name) df.to_csv('chemical_database.csv', sep=',', index=False) print(df)
Bit chunky but WORKING - now need to upload csv to WikiData.
Informations about the different directories content: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/README
Apparently some of PubMed entries don't have a human readable name, they are named after their InChiKey. Do we want those entries too? Example: https://pubchem.ncbi.nlm.nih.gov/compound/134797139
- CID - InChiKey here: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/inchikey/
- Descriptors: ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor
Link between compound and descriptors can be find here (may not be useful):
ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/compound/general/pc_compound2descriptor_000001.ttl.gz
Attached: a JSON list of chemicals with no mass on WikiData, but which have PubChem IDs
Getting triples linked to a PubMed CID: https://pubchem.ncbi.nlm.nih.gov/rest/rdf/compound/CID2110.html
Using PubChem REST API: https://pubchemdocs.ncbi.nlm.nih.gov/rdf$_5-2
https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=compound&predicate=rdf:type
But returns only 10 000 results. The next 10 000 records (10 001 to 20 000) can be retrieved using the following query:
https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=compound&predicate=rdf:type&offset=10000
URL to properties JSON
https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/9/JSON/?
ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor/compound/
Getting DefinedAtomStereoCount
Scrape mol weight from pubchem JSON:
molWeight = data.split("Molecular Weight")[1].split('"NumValue": ')[1].split(",")[0]
os.system(command) in python can be used to run terminal programs in python.
To compile a java file in terminal
javac filename.java
To run the generated class file in terminal, use
java filename
This is for linux
Using code below, running in blocks of 500. Make sure to change the destination for the csv file.
# -*- coding: utf-8 -*- """ Created on Sat Oct 27 11:42:39 2018 @author: Sam """ import requests import pandas as pd import json nomass_pubchems = [int(chem["pubchem"]) for chem in json.loads(open("C:/Users/Sam/Desktop/no_mass_chems (1).json").read())] chemicals=[] for n in range(len(nomass_pubchems)): try: file = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/"+str(nomass_pubchems[n])+"/JSON/?") data = file.text name = data.split('"Record Title",\n "StringValue": ')[1].split('\n')[0] try: IUPACname = data.split('"Name": "IUPAC')[1].split('Value": ')[1].split('\n')[0] except: IUPACname = 'N/A' formula = data.split('"Molecular Formula",\n "StringValue": ')[1].split('\n')[0] molWeight = data.split("Molecular Weight")[1].split('"NumValue": ')[1].split(",")[0] def_stereocount = data.split("Defined Atom Stereocenter Count")[1].split('NumValue": ')[1].split('\n')[0] chemicals.append([name, IUPACname, formula, molWeight, def_stereocount]) except: print("Exception raised: PubChem compound #{:}".format(nomass_pubchems[n])) col_name=["name", "IUPAC", "Formulae", "Mr", "Stereo"] df = pd.DataFrame(chemicals, columns=col_name) df.to_csv('C:/Users/Sam/Desktop/chemical_database.csv', sep=',', index=False) print(df)
Get the RDF description of a compound
curl -L -H "Accept: text/rdf" -o CID2244.rdf http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244
Or just https://pubchem.ncbi.nlm.nih.gov/rest/rdf/compound/CID2244.rdf
Updated code without the horrible horrible bugs below:
# -*- coding: utf-8 -*- """ Created on Sat Oct 27 11:42:39 2018 @author: Sam """ import requests import pandas as pd import json nomass_pubchems = [int(chem["pubchem"]) for chem in json.loads(open("C:/Users/Sam/Desktop/no_mass_chems (1).json").read())] chemicals=[] for n in range(100): print("Processing PubChem compound #{:}".format(nomass_pubchems[n])) try: file = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/"+str(nomass_pubchems[n])+"/JSON/?") data = file.text name = data.split('"Name": "Record Title",')[1].split('"StringValue": "')[1].split('"')[0] #print(name) try: IUPACname = data.split('"Name": "IUPAC')[1].split('Value": "')[1].split('"')[0] #print(IUPACname) except: IUPACname = 'N/A' formula = data.split('"Name": "Molecular Formula"')[1].split('"StringValue": "')[1].split('"')[0] #print(formula) molWeight = data.split('"Molecular Weight')[1].split('NumValue": ')[1].split(',')[0] #print(molWeight) def_stereocount = ("".join(data.split('"Defined Atom Stereocenter Count"')[1:])).split('NumValue": ')[1].split('\n')[0] #print(def_stereocount) chemicals.append([name, IUPACname, formula, molWeight, def_stereocount]) except Exception as e: print(e) print("Exception raised: PubChem compound #{:}".format(nomass_pubchems[n])) col_name=["name", "IUPAC", "Formulae", "Mr", "Stereo"] df = pd.DataFrame(chemicals, columns=col_name) df.to_csv('C:/Users/Sam/Desktop/chemical_database0.csv', sep=',', index=False) print(df)
So the last one still had horrible bugs, this one is better(-ish) I promise:
# -*- coding: utf-8 -*- """ Created on Sat Oct 27 11:42:39 2018 @author: Sam """ import requests import pandas as pd import json nomass_pubchems = [int(chem["pubchem"]) for chem in json.loads(open("C:/Users/Sam/Desktop/no_mass_chems (1).json").read())] chemicals=[] for n in range(201,300): print("Processing PubChem compound #{:}".format(nomass_pubchems[n])) try: file = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/"+str(nomass_pubchems[n])+"/JSON/?") data = file.text name = ("".join(data.split('"Name": "Record Title",')[1:])).split('"StringValue": "')[1].split('"')[0] #print(name) try: IUPACname = ("".join(data.split('"Name": "IUPAC')[1:])).split('Value": "')[1].split('"')[0] #print(IUPACname) except: IUPACname = 'N/A' formula = ("".join(data.split('"Name": "Molecular Formula"')[1:])).split('"StringValue": "')[1].split('"')[0] #print(formula) molWeight = data.split('"Molecular Weight')[1].split('NumValue": ')[1].split(',')[0] #print(molWeight) def_stereocount = ("".join(data.split('"Defined Atom Stereocenter Count"')[1:])).split('NumValue": ')[1].split('\n')[0] #print(def_stereocount) chemicals.append([name, IUPACname, formula, molWeight, def_stereocount]) except Exception as e: print(e) print("Exception raised: PubChem compound #{:}".format(nomass_pubchems[n])) col_name=["name", "IUPAC", "Formulae", "Mr", "Stereo"] df = pd.DataFrame(chemicals, columns=col_name) df.to_csv('C:/Users/Sam/Desktop/chemical_database2.csv', sep=',', index=False) print(df)
And finally, one that actually works:
# -*- coding: utf-8 -*- """ Created on Sat Oct 27 11:42:39 2018 @author: Sam """ import requests import pandas as pd import json nomass_pubchems = [int(chem["pubchem"]) for chem in json.loads(open("C:/Users/Sam/Desktop/no_mass_chems (1).json").read())] chemicals=[] for n in range(500): print("Processing PubChem compound #{:}".format(nomass_pubchems[n])) try: file = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/"+str(nomass_pubchems[n])+"/JSON/?") data = file.text name = ("".join(data.split('"Name": "Record Title",')[1:])).split('"StringValue": "')[1].split('"')[0] #print(name) try: IUPACname = ("".join(data.split('"Name": "IUPAC')[1:])).split('Value": "')[1].split('"')[0] #print(IUPACname) except: IUPACname = 'N/A' formula = ("".join(data.split('"Name": "Molecular Formula"')[1:])).split('"StringValue": "')[1].split('"')[0] #print(formula) molWeight = ("".join(data.split('"Molecular Weight')[1:])).split('NumValue": ')[1].split(',')[0] #print(molWeight) def_stereocount = ("".join(data.split('"Defined Atom Stereocenter Count"')[1:])).split('NumValue": ')[1].split('\n')[0] #print(def_stereocount) chemicals.append([name, IUPACname, formula, molWeight, def_stereocount]) except Exception as e: print(e) print("Exception raised: PubChem compound #{:}".format(nomass_pubchems[n])) col_name=["name", "IUPAC", "Formulae", "Mr", "Stereo"] df = pd.DataFrame(chemicals, columns=col_name) df.to_csv('C:/Users/Sam/Desktop/chemical_database.csv', sep=',', index=False) print(df)
Simple (and dirty) Python scripts to get data out of PubChem turtle files (here InChIKey id, but could be done with other informations)