Even if we fix T192708: Let the prefill run until the end when going through all articles, a complete run will always need some time (probably several days). It would be nice to be able to generate suggestions for a list of items, for instance titles or DOIs which we know have been worked on recently.
The are two steps:
- provide a list of page titles, which can simply be passed to get_proposed_edits();
- provide a search criterion or filter, from which to generate a list of titles, for instance from a list of DOIs or a category or whatever appears in the articles of interest (hopefully we can just use the search API so that one can use all the CirrusSearch capabilities; but we could also query links in the database or filter the results from the list of template usages).
For now I'm doing the first of the two: I've just created a quick and dirty bash command (which is not even able to handle quotes and parentheses in titles; I'd rather avoid escaping them with pattern matching):
cd ~/www/python/src echo $1 ~/www/python/venv/bin/python -c 'from app import get_proposed_edits, app; get_proposed_edits("'$1'", True)'
This can be passed to jsub with as little as 150 MB of memory and at least 0.5 seconds wait. For 210k titles from https://zenodo.org/record/997222/files/enwiki-20170720-pages-articles-citations.tsv.xz , that will still take one week.
A proper solution will handle the multithreading within python, e.g. https://stackoverflow.com/a/28913218/1333493