Page MenuHomePhabricator

Restructure the code for V3 of image recommendation algorithm
Closed, ResolvedPublic


V3 was coded up quickly to produce results for T266271.
I need to make it more reusable and able to produce coverage statistics.

Event Timeline

Refactored code is now available on stat1005.

- retrieve_image_candidates.ipynb
- prioritize_clean.ipynb
  • retrieve_image_candidates.ipynb discovers unillustrated articles and finds potential images matches
  • prioritize-clean.ipynb filters out bad image candidates and generates good image suggestions for unillustrated articles, together with image captions, descriptions, categories, and structured data when available

Approximate coverage statistics (estimated from a sample of 50k articles with initial candidate suggestions extracted with V3):

  • Coverage before filtering: 500k out of 3M unillustrated articles (17%)
  • First round of filtering: removing invalid image candidates (flags, svgs, image placeholders): discards 55% of articles with suggestions, leaving 7.5% of unillustrated articles with potential candidates
  • Second round of filtering: removing images that are on-wiki only: discards further 12% of articles with suggestions, leaving 5.3% of unillustrated articles with potential candidates
  • Out of the remaining candidates, the metadata coverage is the following:
    • missing descriptions: 59%
    • missing captions: 96%
    • missing categories: 0.0
    • missing structured data: 92%

Closing this for now as all points were addressed.