Page MenuHomePhabricator

Do not suggest URLs redundant with DOI
Closed, ResolvedPublic

Description

Although we have a blacklist for the major legacy publishers and a check to ignore publisher locations from Unpaywall, OAbot suggestions still contain things like:

327 http://www.biodiversitylibrary.org                                                                                                                                                               
298 http://www.app.pan.pl
259 http://www.scielo.br

Adding such URLs is not a mistake but can be considered pointless, if they go to the same place as the DOI. Citation bot users have been busy removing such redundant URLs and we should not reverse them. Given the long queue of suggestions available, it's best to focus the tool's users' time on those with most added value.

A simple solution might be to ignore the PDF URLs associated to a doi.org URL on Dissemin paper records. A better solution might be to check whether the URL really means "the DOI is gold/hybrid/bronze OA" and if so just set |doi-access=free, avoid touching |url=. This is the logical continuation of T228632 and might be considered a return to the original functionality.

Event Timeline

Nemo_bis created this task.

This is partially fixed now, at least for the bigger runs, by rejecting the publisher URLs suggested by Unpaywall (although some still sneak in via repositories) and skipping Dissemin suggestions (for which we don't have this information).
https://github.com/dissemin/oabot/commit/5b56d0b63e2e6cbd57d0eb0d595deda1cd69a100#diff-d9cb15a512f67f322cb4cee14c682e3eR284

I'm refreshing the oabot suggestions after T233715 and T228632#5529941; when that's done, I'll re-assess how many redundant URLs get proposed nonetheless. It may take a few weeks.

Nemo_bis claimed this task.

I don't see any publisher in the top 10 (or top 100 really) so I think this is solved enough.

907 https://www.biodiversitylibrary.org
290 http://pdfs.semanticscholar.org
137 https://zenodo.org
123 https://www.osti.gov
 99 http://citeseerx.ist.psu.edu
 96 https://academiccommons.columbia.edu
 57 https://lirias.kuleuven.be
 40 https://hcommons.org
 30 https://www.biorxiv.org
 30 https://escholarship.org