Page MenuHomePhabricator

[XL] Bug in lead image data in image-suggestions data pipeline
Closed, ResolvedPublic

Description

Some images have incorrect image.linked.from.wikipedia.lead_image data in weighted_tags in elasticsearch

For example this image has the weighted_tags value image.linked.from.wikipedia.lead_image/Q5296|1000

Querying hive ...

select * from analytics_platform_eng.image_suggestions_lead_image_data where page_id=27288157 and item_id='Q5296' and snapshot='2022-08-22';
OK
page_id	item_id	tag	score	found_on	snapshot
27288157	Q5296	image.linked.from.wikipedia.lead_image	1000	["arwiki"]	2022-08-22

So the data in the search index corresponds to the data in hive, but then when I look at the main page on arwiki (Q5296 is the wikidata id for a wikimedia main page) then the image is nowhere to be seen, and looking at the history the page hasn't been updated since 2021

Event Timeline

CBogen renamed this task from Bug in lead image data in image-suggestions data pipeline to [XL] Bug in lead image data in image-suggestions data pipeline.Sep 21 2022, 4:29 PM

Ok, so I don't think there's a bug after all

This ...

looking at the history the page hasn't been updated since 2021

... ignores changes transcluded from templates and the like

When I look at the mediawiki_imagelinks table in wmf_raw in hive for the snapshot that would have been the most recent when this ticket was raised

hive (wmf_raw)> select * from mediawiki_imagelinks where snapshot='2022-07' and wiki_db='arwiki' and il_to='Al-Ahzab_Battle_map-2.svg';
OK
il_from	il_to	il_from_namespace	snapshot	wiki_db
46	Al-Ahzab_Battle_map-2.svg	0	2022-07	arwiki
...

46 is the page_id of the main page, so that particular file was on the main page when the snapshot was taken, and that's why it was coming up in a search for the main page wikidata id. Have checked other data points, and the data checks out ok each time

So this isn't a bug in the pipeline code, rather it's a consequence of the wmf_raw table dumps only happening once a month, which means that the search index data based on wikipedia lead images (and suggestions based on that in turn) will be out of date a lot of the time for images on wikipedia pages that are updated more frequently than that - main pages being an obvious example

Raised T325629, closing this