Page MenuHomePhabricator

Work around incorrect matches for PMC IDs of AMS/PNAS papers
Open, MediumPublic

Assigned To
None
Authored By
Nemo_bis
Jan 6 2024, 10:08 AM
Referenced Files
F41656924: pnas.8.10.283_crossref.json.gz
Jan 7 2024, 6:36 PM
F41656925: S0002-9947-1922-1501216-9.json.gz
Jan 7 2024, 6:36 PM
F41656926: S0002-9947-1922-1501216-9_crossref.json.gz
Jan 7 2024, 6:36 PM
F41656927: pnas.8.10.283.json.gz
Jan 7 2024, 6:36 PM
Subscribers

Description

Two papers by the same author in the same year with the same title but published in different journals (TAMS/BAMS and PNAS) with different page numbers. PMC has scans of both journals.
https://en.wikipedia.org/wiki/Talk:Blumberg_theorem

The edit: https://en.wikipedia.org/?diff=1193778319

The original citation:

* {{cite journal|title=New properties of all real functions|first=Henry|last=Blumberg|journal=Transactions of the American Mathematical Society|volume=24|date=September 1922|issue=2|page=113-128|doi=10.1090/S0002-9947-1922-1501216-9 |url=https://www.ams.org/journals/tran/1922-024-02/S0002-9947-1922-1501216-9|doi-access=free|jstor=1989037|jstor-access=free}}

More examples from https://en.wikipedia.org/w/index.php?title=User_talk:OAbot&oldid=1194216712#PMC_for_wrong_version_of_paper

doi:10.1073/pnas.17.2.125 doi:10.1090/S0002-9947-1932-1501641-2
doi:10.1073/pnas.38.8.716 doi:10.1090/S0002-9904-1954-09848-8
doi:10.1073/pnas.3.4.314 doi:10.1090/S0002-9947-1917-1501070-3

Event Timeline

Both papers on Unpaywall have evidence "oa repository (via OAI-PMH title and first author match)" although the PMC side exposes a link to the correct DOI. The CrossRef API has the page range like "113-128", "283-288", so it may be possible to check for the number of pages.

The wrong one for doi:10.1090/S0002-9947-1922-1501216-9:

{
  "updated": "2023-01-20T07:05:57.468866",
  "url": "https://europepmc.org/articles/pmc1085149?pdf=render",
  "url_for_pdf": "https://europepmc.org/articles/pmc1085149?pdf=render",
  "url_for_landing_page": "https://europepmc.org/articles/pmc1085149",
  "evidence": "oa repository (via OAI-PMH title and first author match)",
  "license": null,
  "version": "publishedVersion",
  "host_type": "repository",
  "is_best": true,
  "pmh_id": "oai:europepmc.org:76743",
  "endpoint_id": "b5e840539009389b1a6",
  "repository_institution": "PubMed Central - Europe PMC",
  "oa_date": null
}

The correct one for doi:10.1073/pnas.8.10.283:

[
  {
    "updated": "2023-01-20T07:05:57.468866",
    "url": "https://europepmc.org/articles/pmc1085149?pdf=render",
    "url_for_pdf": "https://europepmc.org/articles/pmc1085149?pdf=render",
    "url_for_landing_page": "https://europepmc.org/articles/pmc1085149",
    "evidence": "oa repository (via OAI-PMH title and first author match)",
    "license": null,
    "version": "publishedVersion",
    "host_type": "repository",
    "is_best": true,
    "pmh_id": "oai:europepmc.org:76743",
    "endpoint_id": "b5e840539009389b1a6",
    "repository_institution": "PubMed Central - Europe PMC",
    "oa_date": null
  },
  {
    "updated": "2024-01-07T18:26:41.686445",
    "url": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1085149",
    "url_for_pdf": null,
    "url_for_landing_page": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1085149",
    "evidence": "oa repository (via pmcid lookup)",
    "license": null,
    "version": "publishedVersion",
    "host_type": "repository",
    "is_best": false,
    "pmh_id": null,
    "endpoint_id": null,
    "repository_institution": null,
    "oa_date": null
  }
]

There are over 6500k PMC matches and only 650k matches by title and author, of which some 60k appear without a PMCID match, so perhaps we can just ignore those europepmc matches:

$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep -c "oa repository (via pmcid lookup)"
6499014
$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep -c "oa repository (via OAI-PMH title and first author match)"
637491
$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep "oa repository (via OAI-PMH title and first author match)" | grep -vc "oa repository (via pmcid lookup)"
62310

Overall frequency of evidence:

$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep pmc | jq -r '.oa_locations | .[] | select( .url | contains("pmc") ) | .evidence' | LANG=C sort | LANG=C uniq -c
   4626 oa journal (via doaj)
     30 oa journal (via observed oa rate)
      2 oa journal (via publisher name)
5023777 oa repository (via OAI-PMH doi match)
  81775 oa repository (via OAI-PMH title and first author match)
   2227 oa repository (via OAI-PMH title and last author match)
  22357 oa repository (via OAI-PMH title match)
      1 oa repository (via page says license)
6499014 oa repository (via pmcid lookup)
     38 open (via crossref license)
     33 open (via crossref license, author manuscript)
     25 open (via free article)
   4977 open (via free pdf)
   1210 open (via page says license)
      4 open (via page says Open Access)

It's not just old years:

$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep "oa repository (via OAI-PMH title and first author match)" | grep -v "oa repository (via pmcid lookup)" | jq -r .year | sort | uniq -c | sort -nr | head -n 20
   8461 2018
   6534 2020
   4920 2021
   4375 2017
   3449 2013
   3172 2012
   2338 2019
   1773 2014
   1559 2011
   1510 2016
   1159 2010
   1090 1998
   1081 1999
    979 2009
    978 2005
    886 2007
    880 2015
    870 2006
    839 2008
    776 2001

Sample of these 60k matching DOIs:

$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep '"is_oa": true' | grep pmc | grep "oa repository (via OAI-PMH title and first author match)" | grep -v "oa repository (via pmcid lookup)" | jq -r .doi | shuf -n 40 
10.1096/fasebj.27.1_supplement.561.2
10.1128/jvi.73.3.2136-2142.1999
10.1093/genetics/47.4.367
10.21203/rs.3.rs-44103/v3
10.1242/dev.101253
10.1891/1061-3749.21.3.502
10.5993/ajhb.34.4.1
10.1101/2021.03.27.437308
10.1093/genetics/52.6.1187
10.36834/cmej.42236
10.1101/2020.11.27.401281
10.20944/preprints202012.0170.v1
10.1128/jvi.66.1.496-504.1992
10.4103/2228-7477.114377
10.1128/jcm.38.11.4049-4057.2000
10.13107/jocr.2021.v11.i04.2184
10.1093/annonc/mdx143.003
10.32607/20758251-2018-10-2-4-15
10.2139/ssrn.3881626
10.3390/s6010030
10.21203/rs.3.rs-453031/v1
10.1101/201830
10.2139/ssrn.1719327
10.7287/peerj.preprints.26971v1
10.1002/ange.202101478
10.1123/jpah.12.2.238
10.1128/jvi.68.10.6567-6577.1994
10.1101/339622
10.1055/s-0036-1586413
10.1155/2010/347142
10.1101/2020.06.12.149112
10.1101/286070
10.1016/s0022-2836(75)80158-6
10.1097/01.ogx.0000511937.49229.57
10.1101/308262
10.1145/3388440.3414208
10.2307/3454618
10.1101/257980
10.1083/jcb1765oia10
10.1172/jci200522079
Nemo_bis updated the task description. (Show Details)

For the non-Unpaywall side, continues at T228702

Nemo_bis triaged this task as Medium priority.Jan 7 2024, 9:25 PM
Nemo_bis updated the task description. (Show Details)