Page MenuHomePhabricator

Import arXiv ID (P818) and "full work available at" (P953) from unpaywall dataset
Open, Needs TriagePublic

Description

Example JSON (from https://api.unpaywall.org/v2/10.1103/PhysRevLett.121.043601?email=YOUR_EMAIL):

{
  "best_oa_location": {
    "evidence": "oa repository (via OAI-PMH title and first author match)",
    "host_type": "repository",
    "is_best": true,
    "license": null,
    "pmh_id": "oai:openaccess.leidenuniv.nl:1887/64966",
    "updated": "2018-09-05T20:32:18.374467",
    "url": "https://openaccess.leidenuniv.nl/bitstream/handle/1887/64966/PhysRevLett.121.pdf?sequence=1",
    "url_for_landing_page": "http://hdl.handle.net/1887/64966",
    "url_for_pdf": "https://openaccess.leidenuniv.nl/bitstream/handle/1887/64966/PhysRevLett.121.pdf?sequence=1",
    "version": "publishedVersion"
  },
  "data_standard": 2,
  "doi": "10.1103/physrevlett.121.043601",
  "doi_url": "https://doi.org/10.1103/physrevlett.121.043601",
  "genre": "journal-article",
  "is_oa": true,
  "journal_is_in_doaj": false,
  "journal_is_oa": false,
  "journal_issns": "0031-9007,1079-7114",
  "journal_name": "Physical Review Letters",
  "oa_locations": [
    {
      "evidence": "oa repository (via OAI-PMH title and first author match)",
      "host_type": "repository",
      "is_best": true,
      "license": null,
      "pmh_id": "oai:openaccess.leidenuniv.nl:1887/64966",
      "updated": "2018-09-05T20:32:18.374467",
      "url": "https://openaccess.leidenuniv.nl/bitstream/handle/1887/64966/PhysRevLett.121.pdf?sequence=1",
      "url_for_landing_page": "http://hdl.handle.net/1887/64966",
      "url_for_pdf": "https://openaccess.leidenuniv.nl/bitstream/handle/1887/64966/PhysRevLett.121.pdf?sequence=1",
      "version": "publishedVersion"
    },
    {
      "evidence": "oa repository (via OAI-PMH doi match)",
      "host_type": "repository",
      "is_best": false,
      "license": null,
      "pmh_id": "oai:arXiv.org:1803.10992",
      "updated": "2018-08-05T02:59:53.065628",
      "url": "http://arxiv.org/pdf/1803.10992",
      "url_for_landing_page": "http://arxiv.org/abs/1803.10992",
      "url_for_pdf": "http://arxiv.org/pdf/1803.10992",
      "version": "submittedVersion"
    }
  ],
  "published_date": "2018-07-23",
  "publisher": "American Physical Society (APS)",
  "title": "Observation of the Unconventional Photon Blockade",
  "updated": "2018-09-05T20:32:21.743193",
  "year": 2018,
  "z_authors": [
    {
      "family": "Snijders",
      "given": "H.\u2009J.",
      "sequence": "first"
    },
    {
      "family": "Frey",
      "given": "J.\u2009A.",
      "sequence": "additional"
    },
    {
      "family": "Norman",
      "given": "J.",
      "sequence": "additional"
    },
    {
      "family": "Flayac",
      "given": "H.",
      "sequence": "additional"
    },
    {
      "family": "Savona",
      "given": "V.",
      "sequence": "additional"
    },
    {
      "family": "Gossard",
      "given": "A.\u2009C.",
      "sequence": "additional"
    },
    {
      "family": "Bowers",
      "given": "J.\u2009E.",
      "sequence": "additional"
    },
    {
      "family": "van Exter",
      "given": "M.\u2009P.",
      "sequence": "additional"
    },
    {
      "family": "Bouwmeester",
      "given": "D.",
      "sequence": "additional"
    },
    {
      "family": "L\u00f6ffler",
      "given": "W.",
      "sequence": "additional"
    }
  ]
}

Event Timeline

Pipeline as of now:

Processing pipeline:

# download sources
wget [redacted]/unpaywall_snapshot_2018-06-21T164548_with_versions.jsonl.gz
wget https://dumps.inventaire.io/wd/wd_doi_ids.ndjson.gz

# rewrite sources into <DOI, arXiv> and <DOI, Qid> tuples
pv unpaywall_snapshot_2018-06-21T164548_with_versions.jsonl.gz | gunzip | grep arXiv | ./jq-linux64 --raw-output '[.doi, [.oa_locations[].pmh_id | strings | select(startswith("oai:arXiv")) | split(":")[2]][0]] | @tsv' > unpaywall_doi_to_arxiv
pv wd_doi_ids.ndjson.gz | gunzip | ./jq-linux64 --raw-output '[.doi, .id] | @tsv' > doi_to_qid

# make a subset of the unpaywall_doi_to_arxiv
cp unpaywall_doi_to_arxiv unpaywall_initial

# sort both
LOCALE=C sort -f -t $'\t' doi_to_qid > doi_to_qid_sorted
LOCALE=C sort -f -t $'\t' unpaywall_initial > unpaywall_initial_sorted

# join
LOCALE=C join -i --nocheck-order -t $'\t' unpaywall_initial doi_to_qid_sorted | awk '{print $3 "\tP818\t" $2 "\tS248\tQ38352586"}' > ~/public_html/quickstatements.tsv

the sort-and-join is a bit painful; it might make sense to do this in python (or MySQL)

One problem that still needs to be solved is the upper/lower case structure of dois. I don't know if we can generally assume they are case-insensitive, as that might be publisher dependent. At the same time, many sources do have different capitalization (e.g. PHYSREVLETT vs PhysRevLett).

valhallasw moved this task from Backlog to Doing on the Wikistorm board.Oct 27 2018, 1:44 PM

Split off second part of unpaywall data set using

tail -n +138152 unpaywall_doi_to_arxiv > unpaywall_doi_to_arxiv_zonder_initial
tools.unpaywall-importer@tools-bastion-02:~$ cat > sort.sh
LOCALE=C sort -f -t $'\t' unpaywall_doi_to_arxiv_zonder_initial > unpaywall_doi_to_arxiv_zonder_initial_sorted
tools.unpaywall-importer@tools-bastion-02:~$ jsub bash sort.sh

Pandas:

tools.unpaywall-importer@tools-bastion-02:~$ cat merge.py
import pandas
unpaywall = pandas.read_csv("unpaywall_doi_to_arxiv_zonder_initial_sorted", header=None, names=["doi", "arxiv"], sep='\t')
unpaywall['doi'] = unpaywall['doi'].str.upper()
qid_mapping = pandas.read_csv("doi_to_qid_sorted", header=None, names=["doi", "qid"], sep='\t')
merged = unpaywall.merge(qid_mapping)
merged.to_csv("public_html/merged.csv", sep='\t')
merged.apply(lambda x: "{0.qid}\tP818\t{0.arxiv}\tS248\tQ38352586".format(x), axis=1).to_csv("public_html/commands.csv")

tools.unpaywall-importer@tools-bastion-02:~$ jsub -mem 4G python merge.py

This leads to an additional 44319 arXiv indentifiers to import (yay). And it's easier to re-do later as well this way. What's still missing is a way to skip existing entries, but for this initial import that's not a huge issue.

Result: https://tools.wmflabs.org/unpaywall-importer/commands.csv

Note for the next time: I forgot the quotation marks around the arXiv identifier. Those should be added in the merged.appy() step.

Now hackily fixed with a regex replace: P818\t([^\t]+)\t -> P818\t"\1"\t

Processing the full 44k entries data set in one go makes QS quite unhappy, so I'll upload it in ~10k batches. First batch:

Q34309528	P818	"cond-mat/0012021"	S248	Q38352586
...
Q41868713	P818	"1605.03822"	S248	Q38352586

The arXiv identifier dataset has now been fully imported. Next question is how to keep this up to date with new entries in the unpaywall dataset / new dois/Qids being added to Wikidata.

Unpaywall seems to be updated twice a year; dois are of course added more often.

https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P818 also still needs a cleanup.

Update imports should also take care not to re-import manually removed statements such as https://www.wikidata.org/w/index.php?title=Q23042979&type=revision&diff=805387135&oldid=798100426

There are a number of 'nan' values imported. This seems to be due to the somewhat hacky filtering for unpaywall entries with arXiv identifiers. For example, there may be a link to http://cds.cern.ch/record/681502/files/arXiv:hep-ph_0411095.pdf, but not to the corresponding arXiv page. In those cases, unpaywall_doi_to_arxiv_zonder_initial_sorted would not contain an arXiv identifier, and pandas would substitute it with 'nan' (which is also an odd choice...)

The best way to solve this would be to use somewhat smarter parsing. This would maybe be slower but would definitely be more stable. For now, I will simply remove the nan entries.