Page MenuHomePhabricator

[S] image_placeholders is empty
Closed, ResolvedPublic

Description

hdfs://analytics-hadoop/user/analytics-platform-eng/image_placeholders contains a parquet that is used by the image suggestions pipeline to filter suggestions that are in placeholder categories.

That parquet is now empty:

>>> spark.read.parquet('/user/analytics-platform-eng/image_placeholders').show(10, False)
+-------+-----+-------+----------+                                              
|cl_from|cl_to|cl_type|page_title|
+-------+-----+-------+----------+
+-------+-----+-------+----------+
>>> print(spark.read.parquet('/user/analytics-platform-eng/image_placeholders').count())
0

It used to hold data at some point, though. An old copy lives at hdfs://analytics-hadoop/user/mfossati/image_placeholders:

>>> print(spark.read.parquet('image_placeholders').count())
3025

AFAICT, https://commons.wikimedia.org/wiki/Category:Examples_representing_SVG is one of those placeholder categories, and it certainly still has images.
It looks like it being empty is a bug.

Details

TitleReferenceAuthorSource BranchDest Branch
Script that outputs image placeholdersrepos/structured-data/image-suggestions!32mfossatiT333946main
Customize query in GitLab

Event Timeline

MarkTraceur renamed this task from image_placeholders is empty to [S] image_placeholders is empty.May 3 2023, 4:30 PM
MarkTraceur subscribed.

In estimation we decided that this should be limited, for the moment, to trying to re-run the job once. If you run into issues, report back here and we may need to re-estimate it.

mfossati changed the task status from Open to In Progress.May 10 2023, 3:24 PM
mfossati claimed this task.

Update: copied the old dataset over to the production path.
At office hours, we agreed to generate a fresh one through this petscan call.