Page MenuHomePhabricator

Contributor gets list of articles that need lead images
Open, Needs TriagePublic

Description

"As a Contributor, I want a list of articles that need lead images, so I can choose one and add a lead image to it."

This is the primary endpoint for the suggested edits interface.

Queue size

This endpoint returns only ''some'' of the pages on the wiki that need lead images. There's no mechanism to cursor through all the articles that need lead images.

Suggesting lead images

Our research team is still working on the algorithm for identifying candidate lead images for an article. However, this pseudocode covers the current idea for the algorithm. The final algorithm will be delivered 1 Oct 2020, but should be just tweaks on this.

function getLeadImageCandidates(article) {

  get the interwiki links for the article;

  for each interwiki link,

    add the lead image for the article in the interwiki link to the list of candidates;

  remove the duplicates from the list of candidates;

  remove files that aren't on Wikimedia Commons from the list of candidates;

  return the list of candidates;
}

Pre-populating the queue

It's possible to generate the queue of articles needing lead images at API request time, but that's probably not very performant.

Another option is to pre-populate the queue using a background job and storing the results in (waves hands) some storage.

Ideally the pages in the queue are removed from the queue if a lead image is added. One way to do this is to add an article edit hook that checks if the article is in the queue, checks if it now has a lead image, and removes it from the queue if it does. This might also be a good time to kick off a job to add another article or two to the end of the queue!

Queue ordering

"Queues" imply order. There's not a request for a particular article order in the queue. I (@eprodromou ) need to confirm this.

I think it's better to have more "important" articles earlier in the queue, and less important articles later in the queue. One proxy metric for importance would be number of page reads, but that's stored separately in the Analytics DB. Another is the number of incoming wiki links to the page.

I'll see if I can come up with an ordering that's important to the Android team, but I thjink using one of these metrics for ordering would be more than enough.

Only pages in content namespaces should be returned.

Avoiding conflicts

There's a small chance that two contributors might hit the queue at the same time and try to add a lead image to the same article. This API doesn't try to prevent that (say, by locking articles for editing). There may be some other ways to prevent conflicts, like having a big enough queue of articles that each contributor gets a unique list during a short period (say, the ~5 minutes it takes to choose a lead image and save the new article version).

Endpoint

GET /page-queue

Get the next images that need lead images for which we have suggestions.

Parameters: none

Notable request headers: none

Notable status codes:
200 - zero or more articles are listed in the response body

Notable response headers: none

Response body: a JSON object representing the queue of articles that need lead images with these properties:

  • pages: an array of zero or more pages. Each page has the following properties (a "small" Page, according to the schema):
    • id
    • title
    • key
    • lead_image_candidates: an array of 1 or more images that are a candidate to be the lead image of this page. Each images is a 'File' object, medium size, with the following properties:
      • title
      • license
      • preferred: preferred file representation, FileRepresentation object medium size, with these properties:
        • mediatype
        • width
        • height
        • size
        • url

Note that there should be at least one candidate for each page; no page should have an empty list of candidates.