We have quite a few broken SemanticScholar links, even after GreenC fixed most of them:
- https://en.wikipedia.org/?oldid=1020920408#Fix_pdfs.semanticscholar.org_links
- https://en.wikipedia.org/wiki/Wikipedia:Link_rot/cases/pdfs.semanticscholar.org
We should try and add new OA locations to these, starting with identifiers (doi-access=free, pmc, hdl), but at the moment we don't because we ignore citations which contain a link to a PDF-looking file:
url_pdf_extension_re = re.compile(r'.*\.pdf([\?#].*)?$', re.IGNORECASE) class UrlArgumentMapping(ArgumentMapping): def present_and_free(self, template): val = self.get(template) if val: match = url_pdf_extension_re.match(val.strip()) if match: return True return False
We can and should add an exception here for SemanticScholar, but maybe we should reconsider this more broadly. There's no way to know whether an URL with a .pdf extension really leads to a PDF file. With most big publishers nowadays, even if it does, there's generally CloudFlare or something else in front which makes the PDF occasionally impossible to get. On the other hand, software like OJS serves the PDF directly but doesn't use a suffix in the URL. So I'd like to test a couple options:
- get rid of the exception entirely;
- keep it, but send a request and see whether we actually get an HTTP 200 response with a PDF content-type.