Page MenuHomePhabricator

Exclude unillustrated articles that should not have images
Closed, ResolvedPublic0 Estimated Story Points

Description

Context

There’s something that’s come up in the context of link recommendations, and also in the community conversations about images. There are some articles that we don’t want to make recommendations for because either (a) it is very easy to make the wrong decision about images on those articles, (b) those kinds of articles don’t need images.

  • Disambiguation pages: should never have images.
  • Years (e.g. “640 CE”): usually hard to pick the right image.
  • Lists (e.g. “List of people named Jazmin”): usually hard to pick the right image.
  • Redirect Pages

Currently, the output of the algorithm provides (1) unillustrated articles with image suggestions. These we are confident that there are no articles of the types listed above. The algorithm also provides (2) a list of unillustrated articles for MediaSearch to map their image suggestions to. The latter set of unillustrated articles we don't have a way to guarantee that the subset of data does not contain the article types listed above.

Acceptance Criteria
  • As a contributor OR botwriter, I want to ensure I do not have the following types of articles, so that I can productively spend my time on the types of articles that do need images and are much easier to apply without scrutiny.
    • Disambiguation pages: should never have images.
    • Years (e.g. “640 CE”): usually hard to pick the right image.
    • Lists (e.g. “List of people named Jazmin”): usually hard to pick the right image.
    • Redirect Pages
Subtasks
  • Write tests to cover these scenarios
Open Questions
  • In order for us to better monitor data quality, should we include metadata about pages as an attribute of the unillustrated articles?

Event Timeline

We'll need some investigation on the best approach to determine page types. Some initial thoughts:

  • What's the tolerance to noise in data? I think we could use wikidata for the filtering, but pages without a Q item would slip through (I'd need to verify the magnitude of this).
  • I was wondering if it would be possible to rephrase our goal to only allow regular articles, rather than having a list of page types to filter out?
  • In general, some page types might be easier to filter out than others.
sdkim updated the task description. (Show Details)
sdkim set the point value for this task to 0.

Confirmed with @JTannerWMF and @CBogen that the following items do not need to be part of the proof of concept and have moved to a separate task: https://phabricator.wikimedia.org/T279010

  • Biographies of living people: higher level of scrutiny for changes.
  • “Good” or “Featured” class articles: if they don’t have an image, it’s a deliberate decision by an experienced editor.
Aklapper renamed this task from Exclude Unillustrated Articles That Should Not Have Images to Exclude unillustrated articles that should not have images.Apr 1 2021, 8:50 AM
sdkim claimed this task.

While doing testing of the model output in November 2021, we noticed a type of "year" article that was slipping through the cracks, and we also realized we want to exclude articles about numbers (e.g. "14"). Therefore, along with the work done in T295316: Add an image: pre-deployment model refresh to refresh the model, @Clarakosi will also be adding these Wikidata QIDs to the exclusions:

Years

Numbers

For future iterations we should robably also filter out decades (Q39911). Or maybe the whole subtree under Q1790144 unit of time?