V3 was coded up quickly to produce results for T266271.
I need to make it more reusable and able to produce coverage statistics.
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | CBogen | T254768 [EPIC] Image recommendations proof-of-concept phase | |||
Resolved | Miriam | T256081 Image matching algorithm | |||
Resolved | Miriam | T268346 Restructure the code for V3 of image recommendation algorithm |
Event Timeline
Comment Actions
Refactored code is now available on stat1005.
stat1005.eqiad.wmnet:/home/mirrys/ImageRecommendation/V3: - retrieve_image_candidates.ipynb - prioritize_clean.ipynb
- retrieve_image_candidates.ipynb discovers unillustrated articles and finds potential images matches
- prioritize-clean.ipynb filters out bad image candidates and generates good image suggestions for unillustrated articles, together with image captions, descriptions, categories, and structured data when available
Comment Actions
Approximate coverage statistics (estimated from a sample of 50k articles with initial candidate suggestions extracted with V3):
- Coverage before filtering: 500k out of 3M unillustrated articles (17%)
- First round of filtering: removing invalid image candidates (flags, svgs, image placeholders): discards 55% of articles with suggestions, leaving 7.5% of unillustrated articles with potential candidates
- Second round of filtering: removing images that are on-wiki only: discards further 12% of articles with suggestions, leaving 5.3% of unillustrated articles with potential candidates
- Out of the remaining candidates, the metadata coverage is the following:
- missing descriptions: 59%
- missing captions: 96%
- missing categories: 0.0
- missing structured data: 92%