Page MenuHomePhabricator

Search new OA locations for SemanticScholar and other PDF links
Closed, ResolvedPublic

Description

We have quite a few broken SemanticScholar links, even after GreenC fixed most of them:

We should try and add new OA locations to these, starting with identifiers (doi-access=free, pmc, hdl), but at the moment we don't because we ignore citations which contain a link to a PDF-looking file:

url_pdf_extension_re = re.compile(r'.*\.pdf([\?#].*)?$', re.IGNORECASE)
class UrlArgumentMapping(ArgumentMapping):
    def present_and_free(self, template):
        val = self.get(template)
        if val:
            match = url_pdf_extension_re.match(val.strip())
            if match:
                return True
        return False

We can and should add an exception here for SemanticScholar, but maybe we should reconsider this more broadly. There's no way to know whether an URL with a .pdf extension really leads to a PDF file. With most big publishers nowadays, even if it does, there's generally CloudFlare or something else in front which makes the PDF occasionally impossible to get. On the other hand, software like OJS serves the PDF directly but doesn't use a suffix in the URL. So I'd like to test a couple options:

  • get rid of the exception entirely;
  • keep it, but send a request and see whether we actually get an HTTP 200 response with a PDF content-type.

Event Timeline

Nemo_bis triaged this task as High priority.
Nemo_bis created this task.
Nemo_bis renamed this task from Search new OA locations or SemanticScholar PDF links to Search new OA locations for SemanticScholar PDF links.May 4 2021, 8:32 PM

I'm testing with that .pdf regex match completely disabled and it's mostly hdl-access and doi-access edits being found. A first batchof OAbot edits is ongoing with:

9069 https://doi.org
2161 http://hdl.handle.net

Some of these are hdl-access additions for hdl identifiers which were already added previously. These allow some assorted cleanup of dead URLs as well. In the previous ordinary batch there were also quite a few hdl-access=free additions but I'm not entirely sure why, perhaps Unpaywall has refetched or otherwise reprioritised some institutional repositories.

Some of the most common domains in the PDF URLs which normally make us skip work on a citation follow. DeepBlue is by far more common than SemanticScholar, which however remains in the top 10 of the most common at-risk domains (publishers and domains which are not open archives).

Given these numbers, I feel it's not going to be possible to have a different logic based on hardcoding certain domains. If any URL-dependent skipping is to be implemented, it will need to involve an HTTP request or other method to verify the current health of the URL.

$ find /data/project/oabot/www/python/src/bot_cache/ -maxdepth 1 -name "*json" -exec cat {} + | grep -Eo '\| *url *= *https?://[^/]+' | sed --regexp-extended 's,\| *url *= *,,g' | sort | uniq -c | sort -nr | head -n 30
551 https://deepblue.lib.umich.edu
476 https://www.ams.org
203 http://www.scielo.br
175 https://hal.archives-ouvertes.fr
148 http://docs.lib.noaa.gov
135 http://www.aseanbiodiversity.info
115 https://www.int-res.com
112 http://www.ams.org
105 https://www.cambridge.org                                                                                                                                               86 https://zenodo.org
85 http://icb.oxfordjournals.org
81 https://link.springer.com
76 https://digital.csic.es
74 https://pdfs.semanticscholar.org
73 http://discovery.ucl.ac.uk
70 https://www.nature.com
61 https://escholarship.org
58 https://authors.library.caltech.edu
57 https://academic.oup.com
55 https://iris.unito.it
53 http://dx.doi.org
51 http://eprints.whiterose.ac.uk
50 https://dspace.mit.edu
49 http://www.app.pan.pl
48 http://www.nature.com
44 http://pubman.mpdl.mpg.de
43 https://www.pure.ed.ac.uk
43 http://spiral.imperial.ac.uk
41 http://matwbn.icm.edu.pl
41 http://doc.rero.ch
40 https://kuscholarworks.ku.edu
Nemo_bis renamed this task from Search new OA locations for SemanticScholar PDF links to Search new OA locations for SemanticScholar and other PDF links.May 6 2021, 3:08 PM

The bots are working very happily together:
https://en.wikipedia.org/w/index.php?title=Jacob_van_Ruisdael&diff=prev&oldid=1021828979
https://en.wikipedia.org/w/index.php?title=Cephalotes_persimplex&diff=next&oldid=1021923392

The first run, with some 15k edits performed, went quite well. I see no reason to do anything more complicated than this, let's keep adding OA identifiers even when there's an URL. It helps when the URLs break and in many other cases.

Nemo_bis updated the task description. (Show Details)

We're still adding identifiers and identifier metadata for citations with existing URLs, so I think this is sufficiently fixed. https://en.wikipedia.org/w/index.php?title=Autonomous_aircraft&diff=prev&oldid=1169944905