Suggestions endpoints for SDC image caption addition/translation

Authored by Mholloway on May 30 2019, 2:40 PM.

Unpublished Commit · Learn More

Not On Permanent Ref: This commit is not an ancestor of any permanent ref.
This commit has been deleted in the repository: it is no longer reachable from any branch, tag, or ref.


Suggestions endpoints for SDC image caption addition/translation

Adds endpoints for suggesting Commons files for SDC caption editing.

The suggestion algorithm is as follows:

  1. A pool of 500 Commons file candidates with the required source and target language structured caption characteristics is requested using the CirrusSearch-powered wikimediaeditortaskssuggestions Action API module;
  1. The candidate set is narrowed to a random sample of 50 images;
  1. An imageinfo query is made for unstructured captions, MIME type, and global usage datafor the remaining candidates;
  1. Non-image files are filtered out, and structured data is requested for the remaining candidates;
  1. Candidates are given a final validity check, and those that remain are returned.


For reasons yet to be determined, the initial candidate search
occasionally returns invalid candidates, necessitating a follow-up
wbgetentities query in all cases. (This may only occur when using the
module in generator mode; more testing is needed.) If this is fixed, and
structured captions info is not required in the suggestions response, then
the final validity checks could be eliminated, or we could stop
requesting and returning structured captions altogether.

Due to the need to make multiple heavy API calls in series, these
endpoints are quite slow. Lowering the candidate sample size extracted
from the initial pool of candidates would also mitigate the impact of the
extmetadata query, but at the expense of lower numbers of results after
filtering and possible zero-results responses. In the event that this
performs inadequately in production, we can mitigate the problem by
lowering the sample size and/or dropping the gathering of unstructured

The service's minor version to 0.6.0 to reflect the addition of the new

Bug: T209997
Bug: T220034
Change-Id: I862bd382e4d93921a92467bd5a66435acd3ee53a