Restructure the code for V3 of image recommendation algorithm
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Miriam
	Nov 20 2020, 3:21 PM

Description

V3 was coded up quickly to produce results for T266271.
I need to make it more reusable and able to produce coverage statistics.

Status	Assigned	Task
Resolved	CBogen	T254768 [EPIC] Image recommendations proof-of-concept phase
Resolved	Miriam	T256081 Image matching algorithm
Resolved	Miriam	T268346 Restructure the code for V3 of image recommendation algorithm

Refactored code is now available on stat1005.

stat1005.eqiad.wmnet:/home/mirrys/ImageRecommendation/V3:
- retrieve_image_candidates.ipynb
- prioritize_clean.ipynb

retrieve_image_candidates.ipynb discovers unillustrated articles and finds potential images matches
prioritize-clean.ipynb filters out bad image candidates and generates good image suggestions for unillustrated articles, together with image captions, descriptions, categories, and structured data when available

Approximate coverage statistics (estimated from a sample of 50k articles with initial candidate suggestions extracted with V3):

Coverage before filtering: 500k out of 3M unillustrated articles (17%)
First round of filtering: removing invalid image candidates (flags, svgs, image placeholders): discards 55% of articles with suggestions, leaving 7.5% of unillustrated articles with potential candidates
Second round of filtering: removing images that are on-wiki only: discards further 12% of articles with suggestions, leaving 5.3% of unillustrated articles with potential candidates
Out of the remaining candidates, the metadata coverage is the following:
- missing descriptions: 59%
- missing captions: 96%
- missing categories: 0.0
- missing structured data: 92%

Closing this for now as all points were addressed.