Page MenuHomePhabricator

Gather OAbot suggestions from all pages with citations
Open, MediumPublic

Description

For a while now I've only run the server-side OAbot prefill.py on the ~400k English Wikipedia articles which link [[digital object identifier]], as a shortcut to identify articles which contain citations with a DOI, rather than all the ~4M articles which embed a CS1 template.

Now that we have better multithreading I'm running the prefill on all 4M articles again and it seems that many more suggestions were waiting. Some were citations with DOI (maybe the redirect from [[doi (identifier))]] wasn't previously picked up, or the pagelinks table had to be updated by a recent edit), which would be picked up even in the Unpaywall-only run; others are non-DOI citations which match a title and author search on Dissemin.

I'm not sure the Dissemin server can handle millions of requests weekly, but I think we can definitely switch on the Unpaywall-only prefill for all articles. Given most citations don't contain a DOI, and parsing the wikitext consumes most of the CPU time, that will actually slow down the DOI finding and reduce the queries performed per day against the Unpaywall API.

Event Timeline

Nemo_bis triaged this task as Medium priority.
Nemo_bis created this task.

~8 hours into the first run, it's CPU bound so I'm not sure how long it's going to take. So far it found only some 20 doi-access=free suggestions, but then the latest (half-finished) Dissemin-powered run was only a couple days ago.

The quota/CPU increase made the cronjob no longer CPU-bound and the API calls became too fast. I've manually reduced to max 2 concurrent requests.