Page MenuHomePhabricator

[M] Ignore main pages when gathering lead image data for image-suggestions
Closed, ResolvedPublic

Description

When gathering lead image data for image suggestions (and injecting that into the search index to aid in image search), we're using wmf_raw.mediawiki_imagelinks in hive as a data source

When I run the image suggestions data pipeline, if File:Image_X is lead image on Page_Y which corresponds to wikidata id Qxxx, then Image_X gets Qxxx added to its document in the search index, and is considered a good image suggestion for articles on other wikis corresponding to Qxxx.

If the day after the pipeline is run Image_X gets removed from Page_Y then, because wmf_raw snapshots are only monthly, Image_X will show up in searches/suggestions for Qxxx for another month when perhaps it shouldn't

This will manifest as a problem mostly on frequently-updated pages. The most obviously frequently-updated page on each wiki is the main page, which has wikidata id Q5296. We can exclude that wikidata id when we're gathering lead image data in the data pipeline easily

(should we also consider doing the same for other frequently updated pages?)

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
filter main pages by QID from lead imagesrepos/structured-data/image-suggestions!44mfossatiT325629main
Customize query in GitLab

Event Timeline

Cparle renamed this task from Consider ignoring frequently-modified pages when gathering lead image data for image-suggestions to Ignore main pages when gathering lead image data for image-suggestions.Dec 20 2022, 1:43 PM
Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)
MarkTraceur renamed this task from Ignore main pages when gathering lead image data for image-suggestions to [M] Ignore main pages when gathering lead image data for image-suggestions.Jul 10 2024, 8:36 PM
mfossati changed the task status from Open to In Progress.Aug 26 2024, 2:11 PM
mfossati claimed this task.