Page MenuHomePhabricator

Image matching algorithm coverage
Closed, ResolvedPublic

Description

In the past several months, @Miriam has made many improvements to the unillustrated articles algorithm and the image matching algorithm. Now that it's changed in a few ways, we want to reassess the coverage: for how many articles on each wiki will we be able to propose matches?

For a set of wikis, we want to calculate six things:

  • Total number of articles in the wiki
  • Unillustrated articles in the wiki
  • Articles with match from any source (polished): this means the count of unillustrated articles that have a match from any of the three sources, after the "polishing" steps to remove local images, etc.
  • Wikidata match (polished), Commons category match (polished), Interwiki match (polished): these are the number of unillustrated articles with matches from each of these sources. Since an article can have a match from more than one, these will sum to more than the "Articles with match from any source (polished)" value.

Here is a table with a sample row showing the output that we want:

wikiTotal number of articlesUnillustrated articlesArticles with match from any source (polished)Wikidata match (polished)Commons category match (polished)Interwiki match (polished)
enwiki6,000,0003,000,000250,00020,000150,000200,000

The list of wikis for which we want these numbers is:

  • enwiki
  • arwiki
  • kowiki
  • cswiki
  • viwiki
  • frwiki
  • fawiki
  • ptwiki
  • ruwiki
  • trwiki
  • plwiki
  • hewiki
  • svwiki
  • ukwiki
  • huwiki
  • hywiki
  • srwiki
  • euwiki
  • arzwiki
  • cebwiki
  • dewiki
  • bnwiki

Details

Due Date
Jan 27 2021, 8:00 AM

Event Timeline

@Miriam -- I was thinking about the polish step, and remembering some suggestions that we've gotten from community members. They recommend that we don't make recommendations on disambiguation pages, list articles, or year articles. Are any of those characteristics easy to add to the polish step, and to these counts?

@Marshall - this is not very easy to do with the current version that I have, as I should parse the "instance of" property. Let me first pull the numbers with the alg as is, and then do this on a second iteration!

@MMiller_WMF here you have an initial spreadsheet with coverage statistics: https://docs.google.com/spreadsheets/d/1IKi0mQ4MZRATVOPPaMr_tX6sOzhjL8INc0Pt_RwwUqs/edit?usp=sharing
I followed your schema, and just added one column which is "has at least one candidate", to give a better idea of the overall algorithm coverage.
Please let me know if this works!

@Miriam -- this looks great; thank you! I made a copy of your sheet where I added a couple other useful percentage columns: https://docs.google.com/spreadsheets/d/1N_PAVerw0AO-GWdGwEnLd2DhLkB_kvkVMxAqbeEb56A/edit#gid=0

I'll include a summary of what we find below -- but yes, it would be great if you could do a version that removes the disambiguation pages, list articles, and year articles.

Here's my summary from the table you created:

  • Overall, I think the coverage numbers reflected in the table are sufficient for a first version of an "add an image" feature. There are enough candidate matches from strong sources in each wiki.
  • Wikis range from 20% unillustrated (Serbian) to 69% unillustrated (Vietnamese).
  • We can find between 9,000 (Bengali) and 164,000 (English) unillustrated articles with match candidates. In general, this is a sufficient volume for a first version of the task, so that users have plenty of matches to do. In some of the sparser wikis, like Bengali, it might get into small numbers once users narrow to topics of interest. That said, Bengali only has 100k total articles, so we would be proposing matches for 9% of them, which is a lot.
  • In terms of how big of an improvement in illustrations we could make to the wikis with this algorithm, the ceiling ranges from 1% (cebwiki) to 10% (trwiki). That is the overall percentage of additional articles that would wind up with illustrations if every match is good and is added to the wiki.
  • The wikis with the lowest percentage of unillustrated articles for which we can find matches are arzwiki and cebwiki, which are both heavy on bot-created articles. This makes sense because many of those articles are of specific towns or species that wouldn't have images in Commons. But because those wikis have so many articles, there are still tens of thousands for which the algorithm has matches.
  • Of the three image sources, community members consider the Commons category to be weakest. This table verifies that of all the articles for which the algorithm has matches, at least 80% of those articles have matches coming from one of the other two stronger sources (Wikidata and crosswiki).
  • In the farther future, we hope that improvements to the image matching algorithm, or to MediaSearch, or to workflows to upload/caption/tag images yield more candidate matches.

@MMiller_WMF I added another sheet to the coverage spreadsheet containing the coverage numbers for unillustrated articles, excluding, as requested, disambiguation pages, list articles, and year articles. Please let me know if anything else is needed. Feel free to resolve this task if not.

@Miriam -- thank you. I updated my master spreadsheet with the numbers that exclude certain kinds of articles. The conclusions above don't change. We are finished with this task.