HomePhabricator

Suggestions endpoints for SDC image caption addition/translation
d5bd2f2eb9e8Unpublished

Authored by Mholloway on May 30 2019, 2:40 PM.

Unpublished Commit · Learn More

Not On Permanent Ref: This commit is not an ancestor of any permanent ref.
This commit has been deleted in the repository: it is no longer reachable from any branch, tag, or ref.

Description

Suggestions endpoints for SDC image caption addition/translation

Adds endpoints for suggesting Commons files for SDC caption editing.

The suggestion algorithm is as follows:

  1. A pool of 500 Commons file candidates with the required source and target language characteristics is requested using the CirrusSearch- powered wikimediaeditortaskssuggestions Action API module, along with MIME info from imageinfo;
  1. Candidates with non-image MIME types are filtered out;
  1. The candidate set is narrowed to a random sample of 50 images;
  1. If data on legacy unstructured ImageDescriptions is required, an additional imageinfo request is made for this data for the remaining candidates;
  1. Info on structured data is requested for the candidates;
  1. Candidates are again filtered based on the desired properties, and those that remain are returned.

Notes:

For reasons yet to be determined, the initial candidate search
occasionally returns invalid candidates, necessitating a follow-up
wbgetentities query in all cases. If this is fixed, and structured
captions info is not required in the suggestions response, then this
query can be eliminated.

There is some debate over whether to involve the presence or absence of
unstructured captions in the inclusion criteria. Requesting this info
slows down the response dramatically.

The imageinfo request, where needed, is here made on its own rather than
as part of the initial candidate request. This is to increase the
randomness of the suggestions served. Results from the initial search
are not random across requests, and including an imageinfo extmetadata
query would mean having to lower the limit to approximately 50, or else
the request will time out.

Lowering the candidate sample size extracted from the initial pool of
candidates would also mitigate the impact of the extmetadata query, but
at the expense of lower numbers of results after filtering and possible
zero-results responses.

To demonstrate this, multiple implementations of both endpoints are
available here for testing, with implementations involving unstructured
captions available via query parameters:

/caption/addition/{target}?includeUnstructured

Requests unstructured captions and includes them in the response, but
performs no filtering based on their presence or absence.

/caption/addition/{target}?requireUnstructured

Requests unstructured captions and filters out candidates that do
not have one in the target language.

/caption/translation/from/Missing path, expected "{src path ...}" in: {source}/to/{target}?includeUnstructured

Requests unstructured captions and includes them in the response, but
performs no filtering based on their presence or absence.

/caption/translation/from/Missing path, expected "{src path ...}" in: {source}/to/{target}?requireUnstructuredSource

Requests unstructured captions and filters out candidates that do not
have one in the source language.

/caption/translation/from/Missing path, expected "{src path ...}" in: {source}/to/{target}?requireUnstructuredSourceNoTarget

Requests unstructured captions and filters out candidates that do not
have one in the source language or do have one in the target language.

Without a query parameter, unstructured captions are not requested.

Note: This code will reduce in size when the final implementation is
agreed upon.

Bug: T209997
Bug: T220034
Change-Id: I862bd382e4d93921a92467bd5a66435acd3ee53a

Details

Committed
MhollowayMay 30 2019, 6:56 PM
Parents
rMSRAe8f45645554c: Morelike: output source language
Branches
Unknown
Tags
Unknown
ChangeId
I862bd382e4d93921a92467bd5a66435acd3ee53a