Page MenuHomePhabricator

Add an image: exclude certain articles
Closed, ResolvedPublic

Description

Upstream of our work, the process to generate image suggestions already excludes certain types of articles for which we do not want to provide suggestions. That work was done in T276137: Exclude unillustrated articles that should not have images:

  • Disambiguation pages
  • Years (e.g. “640 CE”)
  • Lists (e.g. “List of people named Jane”)
  • Redirect Pages

Since then, the Growth team has decided that for Iteration 1, we also want to exclude pages that have infoboxes, because inserting an image into an infobox presents engineering challenges that are our of scope for this iteration. To make some preliminary counts, the Research team explored how to exclude infoboxes in T287317: Add an image: count image suggestions without infoboxes. There are many different templates that are infoboxes.

While it is required that we exclude articles with infoboxes, there are some other exclusions that are desired. Note that an open task also exists for these with the Platform team: T279010: [Placeholder] Filter out page types of higher scrutiny.

  • Biographies of living persons (BLPs)
  • Good and Featured articles

Event Timeline

@Tgr -- moving this to Ready for Development so that we can start looking into it.

It might also be valuable to identify a theory on the case of which types of filters should be applied at which points of the architecture. Should some be applied by Platform team and some by Growth team? Or all by one team?

We can either exclude infoboxes by regenerating the dataset and reimporting into the search index (flexible in terms of how infoboxes are filtered but takes a fair deal of effort to repeat) or modifying the search queries (easy to change on the fly, but limited to what the search engine can handle). Our plan was the latter, using template exclusion (T289216: Add Image: Allow excluding templates/categories per task type). That requires coming up with a list of templates.

The list in T287317: Add an image: count image suggestions without infoboxes is based on identifying templates which have "instance of: infobox" on Wikidata. There are way too many of those though, e.g. 400+ on cswiki. We'd probably hit query length limits.

Most infobox templates are based on a small set of building blocks - the hastemplate search keyword is recursive (ie. it also finds templates which are used in templates which are used on the page) so we can probably come up with a short list of building blocks + the few infobox templates which do not use them.

I wrote a script to list all infoboxes in a wiki and discard those which include another infobox template. This hasn't proven useful - on cswiki there are still about 100 templates left (about half of them don't use building block templates, just raw table markup; the other half do use building blocks but enclose the code in <includeonly> so template usage cannot be tracked). That results in a 3000 character query, which is still a bit long.

I can see three ways forward:

  • Use the search index in some form, e.g. filter and re-import the dataset.
  • Ask the community to update their templates, e.g. create an empty {{infobox tracker}} template and add it to all infobox templates. There are hundreds so this is not a great solution.
  • Try to find a subset of templates which are used in almost all infoboxes. I tried that on arwiki and ended up with 170 templates still but maybe I just wasn't very good at guessing what are the building block templates of infoboxes.

Here's the result of the script above for the pilot wikis:

  • ar: 168 templates / 7K character search string
  • bn: 132 templates / 6.5K character search string
  • cs: 80 templates % 2.5K character search string
  • vi: 182 templates / 5K character search string

That's obviously unworkable. I think our best bet is dealing with this in the search index in some form.

I'll stall this for a while for feedback. If we go in the search index direction, we'll need to involve the Search team. I guess we'd ask Research for a list of pages with infoboxes, and delete the image recommendation tag from the search index for those articles.

Filtering during index generation is less ideal than filtering during search because infoboxes can be added to articles over time and we can't really account for that. We probably can't do much about that.

kostajh triaged this task as Medium priority.Sep 24 2021, 2:47 PM

Two arwiki infobox templates that seemingly should have been picked up (they have the "instance of: Wikimedia infobox" property on Wikidata, the template page is not new and has been linked to Wikidata a long time ago, and the property is not new either) but haven't: قالب:بطاقة كهف, قالب:بطاقة_جهاز_معلوماتي. Some kind of bug in the script?

Do we want to do T279010: [Placeholder] Filter out page types of higher scrutiny on our side? With category filtering, it should be straightforward.

Change 743114 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@master] Add an image: Add test version of GEInfoboxTemplates

https://gerrit.wikimedia.org/r/743114

Change 743114 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Add an image: Add test version of GEInfoboxTemplates

https://gerrit.wikimedia.org/r/743114

Change 743177 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[mediawiki/extensions/GrowthExperiments@wmf/1.38.0-wmf.9] Add an image: Add test version of GEInfoboxTemplates

https://gerrit.wikimedia.org/r/743177

Change 743177 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.38.0-wmf.9] Add an image: Add test version of GEInfoboxTemplates

https://gerrit.wikimedia.org/r/743177

Mentioned in SAL (#wikimedia-operations) [2021-12-03T01:00:57Z] <tgr@deploy1002> Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments: Backport: [[gerrit:743177|Add an image: Add test version of GEInfoboxTemplates (T291232)]] (duration: 00m 57s)

Two arwiki infobox templates that seemingly should have been picked up (they have the "instance of: Wikimedia infobox" property on Wikidata, the template page is not new and has been linked to Wikidata a long time ago, and the property is not new either) but haven't: قالب:بطاقة كهف, قالب:بطاقة_جهاز_معلوماتي. Some kind of bug in the script?

Namespaces were discarded, which was OK for templates but not OK for e.g. Module:Infobox. With the latest patch for the script, both of these are filtered out currently.