Page MenuHomePhabricator

Improve algorithm for unillustrated article selection
Closed, ResolvedPublic

Description

Together with @Swagoel , we are refining the heuristics to define:

  1. The list of images considered as icons
  2. The way in which we detect already illustrated articles

Event Timeline

Work done by @Swagoel for icon detection:

  • Hand labeled over 500 images over 4 languages (Irish, Welsh, Hindi, French) as images vs not to test the feasibility of a threshold approach.
  • Tested various threshold & html scraping based classification systems. Performance of both the whole model and just the threshold component was between 85% and 90% in the tested languages (Welsh, French). Some of the misclassification came from ambiguity on my side regarding how icon/image edge cases should be labeled.
  • Showed a threshold-only tool to be sufficient and, especially when paired with html scraping but also independently, flexible.

The following is a threshold function we found to be reasonably effective:

def get_threshold(wiki_size):
    #change th to optimize precision vs recall. recommended val for accuracy = 5
    sze, th, lim = 50000, 10, 4 
    if (wiki_size >= sze):
        #if wiki_size > base size, scale threshold by (log of ws/bs) + 1
        return (math.log(wiki_size/sze)+1)*th
    #else scale th down by ratio bs/ws, w min possible val of th = th/limiting val
    return min((sze/wiki_size) * th, th/lim)

To verify that the threshold “formula” was reasonably effective, we created pseudo ground truth approximations for the languages: 'it', 'zh', 'cs', 'he', 'ta', 'bs', 'be', 'ast' using a high recall version of my classification tool, then evaluated the performance against the pseudo ground truth.

So I re-run the unillustrated article detection using:

  • The new per-wiki thresholds calculated as per the previous post
  • An additional anti join with the page_props table to discard all articles having a page image

I then run the image matching algorithm on top of the new set of unillustrated articles. I eyeballed the results and they looked more consistent than the eariler version. Below the quantitative / coverage results:
kowiki

number of unillustrated articles: 273305
number of articles items with Wikidata image: 15983
number of articles items with Wikidata Commons Category: 28324
number of articles items with Language Links: 83995

arwiki

number of unillustrated articles: 580284
number of articles items with Wikidata image: 7028
number of articles items with Wikidata Commons Category: 26526
number of articles items with Language Links: 121891

viwiki

number of unillustrated articles: 867565
number of articles items with Wikidata image: 49226
number of articles items with Wikidata Commons Category: 57548
number of articles items with Language Links: 117138

cswiki

number of unillustrated articles: 181867
number of articles items with Wikidata image: 8337
number of articles items with Wikidata Commons Category: 21120
number of articles items with Language Links: 69413

frwiki

number of unillustrated articles: 951319
number of articles items with Wikidata image: 10938
number of articles items with Wikidata Commons Category: 39457
number of articles items with Language Links: 236592

enwiki

number of unillustrated articles: 2922830
number of articles items with Wikidata image: 36412
number of articles items with Wikidata Commons Category: 92072
number of articles items with Language Links: 325534