Suggestions endpoints for SDC image caption addition/translation
Adds endpoints for suggesting Commons files for SDC caption editing.
The suggestion algorithm is as follows:
- A pool of 500 Commons file candidates with the required source and target language structured caption characteristics is requested using the CirrusSearch-powered wikimediaeditortaskssuggestions Action API module;
- The candidate set is narrowed to a random sample of 50 images;
- An imageinfo query is made for unstructured captions, MIME type, and global usage data for the remaining candidates;
- Non-image files are filtered out, and structured data is requested for the remaining candidates;
- Candidates are given a final validity check, and those that remain are returned.
For reasons yet to be determined, the initial candidate search
occasionally returns invalid candidates, necessitating a follow-up
wbgetentities query in all cases. (This may only occur when using the
module in generator mode; more testing is needed.) If this is fixed, and
structured captions info is not required in the suggestions response, then
the final validity checks could be eliminated, or we could stop
requesting and returning structured captions altogether.
Due to the need to make multiple heavy API calls in series, these
endpoints are quite slow. Lowering the candidate sample size extracted
from the initial pool of candidates would also mitigate the impact of the
extmetadata query, but at the expense of lower numbers of results after
filtering and possible zero-results responses. In the event that this
performs inadequately in production, we can mitigate the problem by
lowering the sample size and/or dropping the gathering of unstructured
The service's minor version to 0.6.0 to reflect the addition of the new