Page MenuHomePhabricator

Valid repository link incorrectly marked as url-access=subscription
Open, HighPublicBUG REPORT

Description

A citation for doi:10.1525/bio.2012.62.10.7 (oup.com) with a repository link to https://digitalcommons.law.uidaho.edu/cgi/viewcontent.cgi?article=1527&context=faculty_scholarship was incorrectly marked as url-access=subscription: https://en.wikipedia.org/w/index.php?title=Endangered_Species_Act_of_1973&diff=prev&oldid=1291679938

We don't currently have a way to tell repository links from random publisher links in the url parameter, and there is some natural content drift in Unpaywall (as seen especially for bronze OA at T344114) as well as some longstanding false negatives. Perhaps we can just focus on adding url-access=subscription when the URL is at the main publisher's domain i.e. the doi.org link or its target domain or one of the known domains of legacy publishers.

Event Timeline

First need to fix a few spurious test failures

........F.F.F.......F.F.F.....F.F...F....
======================================================================
FAIL: test_add_arxiv (tests.templateedit.TemplateEditTests.test_add_arxiv)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/federico/mw/oabot/src/tests/templateedit.py", line 22, in test_add_arxiv
    self.assertEqual("arxiv=1804.09042", edit.proposed_change)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'arxiv=1804.09042' != 'arxiv=1804.09042|'
- arxiv=1804.09042
+ arxiv=1804.09042|
?                 +


======================================================================
FAIL: test_add_arxiv_from_citeseerx (tests.templateedit.TemplateEditTests.test_add_arxiv_from_citeseerx)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/federico/mw/oabot/src/tests/templateedit.py", line 120, in test_add_arxiv_from_citeseerx
    self.assertEqual('arxiv=0209357', edit.proposed_change)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'arxiv=0209357' != 'arxiv=math/0209357|'
- arxiv=0209357
+ arxiv=math/0209357|
?       +++++       +


======================================================================
FAIL: test_add_hdl (tests.templateedit.TemplateEditTests.test_add_hdl)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/federico/mw/oabot/src/tests/templateedit.py", line 52, in test_add_hdl
    self.assertEqual("hdl=2027.42/134769|hdl-access=free", edit.proposed_change)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'hdl=2027.42/134769|hdl-access=free' != 'hdl=2027.42/134769|hdl-access=free|'
- hdl=2027.42/134769|hdl-access=free
+ hdl=2027.42/134769|hdl-access=free|
?                                   +


======================================================================
FAIL: test_add_pmc (tests.templateedit.TemplateEditTests.test_add_pmc)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/federico/mw/oabot/src/tests/templateedit.py", line 30, in test_add_pmc
    self.assertEqual("pmc=3731883", edit.proposed_change)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'pmc=3731883' != 'doi-access=free|'
- pmc=3731883
+ doi-access=free|


======================================================================
FAIL: test_add_pmc_dupe_title (tests.templateedit.TemplateEditTests.test_add_pmc_dupe_title)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/federico/mw/oabot/src/tests/templateedit.py", line 126, in test_add_pmc_dupe_title
    self.assertEqual('pmc=1085149', edit.proposed_change)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'pmc=1085149' != ''
- pmc=1085149


======================================================================
FAIL: test_add_pmc_gold_oa (tests.templateedit.TemplateEditTests.test_add_pmc_gold_oa)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/federico/mw/oabot/src/tests/templateedit.py", line 107, in test_add_pmc_gold_oa
    self.assertEqual('pmc=3871558', edit.proposed_change)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'pmc=3871558' != 'url=https://figshare.com/articles/_Differ[118 chars]928|'
- pmc=3871558
+ url=https://figshare.com/articles/_Differences_in_Behavior_and_Activity_Associated_with_a_Poly_A_Expansion_in_the_Dopamine_Transporter_in_Belgian_Malinois_/885928|


======================================================================
FAIL: test_existing_hdl (tests.templateedit.TemplateEditTests.test_existing_hdl)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/federico/mw/oabot/src/tests/templateedit.py", line 59, in test_existing_hdl
    self.assertEqual("hdl-access=free", edit.proposed_change)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'hdl-access=free' != ''
- hdl-access=free


======================================================================
FAIL: test_existing_oadoi (tests.templateedit.TemplateEditTests.test_existing_oadoi)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/federico/mw/oabot/src/tests/templateedit.py", line 66, in test_existing_oadoi
    self.assertEqual("doi-access=free", edit.proposed_change)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 'doi-access=free' != 'url=https://figshare.com/articles/dataset[101 chars]455|'
- doi-access=free
+ url=https://figshare.com/articles/dataset/_Assessing_the_Effects_of_Trematode_Infection_on_Invasive_Green_Crabs_in_Eastern_North_America_/1432455|


======================================================================
FAIL: test_existing_url_closed_access (tests.templateedit.TemplateEditTests.test_existing_url_closed_access)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/federico/mw/oabot/src/tests/templateedit.py", line 100, in test_existing_url_closed_access
    self.assertEqual('', edit.proposed_change)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: '' != 'url-access=subscription|'
+ url-access=subscription|


----------------------------------------------------------------------
Ran 24 tests in 429.607s

FAILED (failures=9)

Checking the domain name does not work so well with Elsevier URL which happen to redirect to a repository, like doi:10.1016/j.crci.2007.02.011 which goes to https://comptes-rendus.academie-sciences.fr/chimie/articles/10.1016/j.crci.2007.02.011/

The most common domain names which would be marked as url-access subscription:

$ find ~/www/python/src/bot_cache -maxdepth 1 -name "*json" -exec cat {} + | sed 's,"orig_string",\n"orig_string",g' | grep '"proposed_change": "url-access=subscription' | grep -Eo '\| *url *= *https?://[^/]+' | sed 's,[|] *url *= *,,g' | sort | uniq -c | sort -nr | head -n 40
   1737 https://doi.org
   1076 http://dx.doi.org
    926 https://www.sciencedirect.com
    646 https://onlinelibrary.wiley.com
    550 https://www.tandfonline.com
    521 https://www.jstor.org
    455 https://linkinghub.elsevier.com
    451 https://www.cambridge.org
    447 https://link.springer.com
    398 https://www.nature.com
    264 https://academic.oup.com
    258 https://www.science.org
    241 https://pubs.acs.org
    229 http://link.springer.com
    216 https://ieeexplore.ieee.org
    189 https://journals.sagepub.com
    171 https://www.journals.uchicago.edu
    161 http://www.tandfonline.com
    156 http://journals.sagepub.com
    152 https://muse.jhu.edu
    106 https://dx.doi.org
     95 http://www.sciencedirect.com
     87 https://www.oxfordreference.com
     87 https://dl.acm.org
     85 http://www.nature.com
     76 https://link.aps.org
     72 https://bioone.org
     67 https://pubs.rsc.org
     65 https://www.annualreviews.org
     64 https://comptes-rendus.academie-sciences.fr
     62 https://journals.lww.com
     60 https://brill.com
     58 https://www.degruyter.com
     56 https://iopscience.iop.org
     55 https://pubs.aip.org
     48 https://agupubs.onlinelibrary.wiley.com
     42 https://www.microbiologyresearch.org
     40 https://royalsocietypublishing.org
     40 https://pubs.geoscienceworld.org
     34 https://aip.scitation.org

Seems about right. I forgot to unconditionally include doi.org as it's always a publisher link. (Ah but it's in the denylist so it's ok.)

There's something wrong with the new logic, sometimes it sets url-access to empty for no reason https://en.wikipedia.org/w/index.php?title=Hurdia&diff=prev&oldid=1296401148

Or actually the reason is that there is a proposed edit for url= which would supersede the existing url-access, but we should do either both or neither. For example: "proposed_change": "url-access=|url=https://repositorio.uchile.cl/handle/2250/177034|".