Page MenuHomePhabricator

Re-assess repository links Unpaywall found on CiteSeerX
Open, MediumPublicBUG REPORT

Description

doi:10.2307/2004316 (1966 paper) gets matched to https://arxiv.org/pdf/hep-th/0502233v1.pdf (2004 paper) through http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.263.7400 which apparently has a matching DOI: "oa repository (via OAI-PMH doi match)"

{
  "updated": "2017-10-20T15:56:27.009012",
  "url": "https://arxiv.org/pdf/hep-th/0502233v1.pdf",
  "url_for_pdf": "https://arxiv.org/pdf/hep-th/0502233v1.pdf",
  "url_for_landing_page": "http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.263.7400",
  "evidence": "oa repository (via OAI-PMH doi match)",
  "license": null,
  "version": "submittedVersion",
  "host_type": "repository",
  "is_best": true,
  "pmh_id": "oai:CiteSeerX.psu:10.1.1.263.7400",
  "endpoint_id": "CiteSeerX.psu",
  "repository_institution": "CiteSeerX.psu",
  "oa_date": null
}

https://en.wikipedia.org/w/index.php?title=User_talk:OAbot&diff=prev&oldid=1193800304

Event Timeline

Nemo_bis triaged this task as Low priority.
Nemo_bis created this task.
Nemo_bis updated the task description. (Show Details)

Not sure how to narrow this down, we're talking about some 500k matches from CiteSeerX (out of 900k):

$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep citeseerx | grep "oa repository (via OAI-PMH doi match)" | jq -r 'select(.oa_locations | .[] | .endpoint_id == "CiteSeerX.psu" and .evidence == "oa repository (via OAI-PMH doi match)" )|.doi' | wc -l
505747
$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep -c citeseerx
887759

Only 35k or so of these are in the best_oa_location (sometimes even when a separate match for arxiv exists, like doi:10.1002/rsa.20071 / oai:CiteSeerX.psu:10.1.1.237.8456 / oai:arXiv.org:math/0209357 ).

$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep citeseerx | grep "oa repository (via OAI-PMH doi match)" | jq -r 'select(.best_oa_location | .endpoint_id == "CiteSeerX.psu" and .evidence == "oa repository (via OAI-PMH doi match)" ) | select(.best_oa_location.url | contains("arxiv.org")) | .doi' | wc -l
35449

I guess I should just reverse the recent change and keep returning the CiteSeerX URL instead. So only the citeseerx parameter addition will be suggested, or nothing if another arxiv/pmc/hdl URL prevails.

Nemo_bis raised the priority of this task from Low to Medium.Jan 6 2024, 2:47 PM

A sample of what kind of URLs we're talking about

$ lbzip2 -dc unpaywall_snapshot_2022-03-09_sorted.jsonl.bz2 | grep citeseerx | grep "oa repository (via OAI-PMH doi match)" | jq -r 'select(.oa_locations | .[] | .endpoint_id == "CiteSeerX.psu" and .evidence == "oa repository (via OAI-PMH doi match)" )| [.doi, .best_oa_location.url] | @tsv' | LANG=C sort | LANG=C shuf -n 50

10.1093/mnras/stt464    https://academic.oup.com/mnras/article-pdf/432/1/307/3900254/stt464.pdf
10.1016/s0920-5632(99)00666-0   http://arxiv.org/pdf/hep-ph/9906320
10.1117/12.572024       http://ccplot.org/pub/resources/CALIPSO/Fully automated analysis of space-based lidar data.pdf
10.4007/annals.2009.170.609     http://annals.math.princeton.edu/wp-content/uploads/annals-v170-n2-p04-p.pdf
10.1007/s11854-016-0008-x       http://arxiv.org/pdf/1306.2231.pdf
10.1016/j.cpc.2008.10.005       https://iris.unito.it/bitstream/2318/99086/1/047_Phantom_COMPHY3649.pdf
10.2140/pjm.2010.246.199        http://msp.org/pjm/2010/246-1/pjm-v246-n1-p07-s.pdf
10.2172/799926  http://arxiv.org/pdf/hep-th/0205108v2.pdf
10.1109/icosp.2012.6491686      http://www.ece.umd.edu/DSPCAD/papers/zhou2012x1.pdf
10.1103/physrevlett.78.3741     http://arxiv.org/pdf/cond-mat/9611038
10.1007/s00013-008-2760-3       http://arxiv.org/pdf/0810.2773
10.2139/ssrn.307339     http://web.mit.edu/lewellen/www/Documents/MnReversion.pdf
10.1201/9781420034462.pt1       http://eprints.ma.man.ac.uk/587/01/covered/MIMS_ep2006_421.pdf
10.1128/mcb.00665-06    https://europepmc.org/articles/pmc1899963?pdf=render
10.1111/j.1468-2354.2011.00664.x        http://fmwww.bc.edu/ec-p/wp658.pdf
10.1016/s0550-3213(97)00163-6   http://arxiv.org/pdf/hep-th/9611036
10.1090/s0894-0347-08-00622-x   https://www.ams.org/jams/2009-22-02/S0894-0347-08-00622-X/S0894-0347-08-00622-X.pdf
10.1080/02602930410001689162    http://www.cse.ohio-state.edu/~neelam/abet/CGRs/cgrPaper.pdf
10.1016/s0550-3213(03)00312-2   http://arxiv.org/pdf/hep-th/0207144
10.1117/12.787384       http://arxiv.org/pdf/0807.0497
10.1007/978-3-642-25361-4_11    http://www.math.nyu.edu/faculty/kohn/papers/abel-symposium-paper.pdf
10.1175/1520-0450(1997)036<0847:morsdo>2.0.co;2 http://www.atmos.washington.edu/MG/PDFs/JAM97_yute_measurements.pdf
10.1142/s0217751x05023347       http://arxiv.org/pdf/nucl-th/0410084
10.1086/340756  http://arxiv.org/pdf/astro-ph/0203254v1.pdf
10.3115/v1/w14-1810     http://www.aclweb.org/anthology/W/W14/W14-1810.pdf
10.1007/s10955-011-0138-6       http://arxiv.org/pdf/1101.5043
10.1142/s0218216508006129       http://arxiv.org/pdf/math/0209138
10.1112/jlms/jdp026     https://digitalcommons.lsu.edu/cgi/viewcontent.cgi?article=2060&context=mathematics_pubs
10.4171/ifb/87  http://www.ems-ph.org/journals/show_pdf.php?issn=1463-9963&vol=5&iss=4&rank=5
10.1002/(sici)1097-0258(19981215)17:23<2661::aid-sim33>3.3.co;2-2       http://www.medicine.mcgill.ca/epidemiology/moodie/AGLM-HW/Altman1998.pdf
10.1103/physrevb.73.045421      http://arxiv.org/pdf/cond-mat/0507656
10.1109/4236.895012     http://eprints.cs.vt.edu/archive/00000525/01/pipe-techreport.pdf
10.1103/physrevd.85.124024      http://arxiv.org/pdf/1203.3109
10.1007/978-3-540-72734-7_8     http://www.lsv.ens-cachan.fr/Publis/PAPERS/PDF/BDL-apal09.pdf
10.1086/304764  http://arxiv.org/pdf/astro-ph/9707171v1.pdf
10.1145/346852.346943   http://users.pandora.be/michel.tilman/Publications/OOPSLA98PR.pdf
10.1016/j.nuclphysb.2003.11.011 http://arxiv.org/pdf/hep-th/0306227
10.1109/tevc.2010.2040181       http://arxiv.org/pdf/q-bio/0512003
10.1143/ptp.113.513     https://academic.oup.com/ptp/article-pdf/113/3/513/5432351/113-3-513.pdf
10.1142/9789812701848_0015      http://arxiv.org/pdf/astro-ph/0502118
10.1109/lcn.2007.152    https://arrow.dit.ie/cgi/viewcontent.cgi?article=1013&context=commcon
10.2973/odp.proc.ir.131.107.1991        https://doi.org/10.2973/odp.proc.ir.131.107.1991
10.1080/092961070500055335      http://www.ats.uni-muenchen.de/downloads/pustet/pustet_altmann_2005.pdf
10.1016/s0304-3975(02)00421-8   https://doi.org/10.1016/s0304-3975(02)00421-8
10.1080/13504851.2014.884687    https://europepmc.org/articles/pmc4606884?pdf=render
10.1063/1.1401139       http://arxiv.org/pdf/gr-qc/0104009
10.1145/316188.316208   http://www.cs.wisc.edu/~vernon/cs838/00/papers/stoica.99sigcomm.pdf
10.1088/0004-637x/701/1/l52     http://arxiv.org/pdf/0907.5459
10.1007/978-1-84628-715-2       http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1767402/pdf/hrt08800481.pdf
10.1007/978-3-642-32512-0_46    http://arxiv.org/pdf/1201.4603

So we won't suggest edits like this either https://en.wikipedia.org/w/index.php?title=Saccharomyceta&curid=68064105&diff=1194087545&oldid=1182890284 as we don't get non-repository URLs from other sources.