Page MenuHomePhabricator

[L] Publish full image suggestions (and intermediate) dataset
Open, Needs TriagePublic

Description

Make available raw dumps of:

  • images suggestions
    • ALIS (image_suggestions/cassandra.py)
    • SLIS (image_suggestions/section_image_suggestions.py)
  • section topics (section_topics/pipeline.py)
  • entity/image pairs (image_suggestions/entity_images.py)

At minimum, we should publish a simple manual dump once. Ideally, dumps are automatically published on a monthly basis.

Note that publishing dumps risks draining the available data (i.e. if all images end up being added, then no more are available for notifications or newcomer tool)
Given the sheer amount of data, this is most likely nothing more than a theoretical issue (and even then, more images being added will likely yield new suggestions in other places), but it still merits a quick worst-case discussion with existing consumers of this data to ensure this case is adequately handled in the unlikely event this happens.
[DONE] See T337925#8906957

AC:

Update

One-off datasets:

Event Timeline

Update on the note in the description

Note that publishing dumps risks draining the available data (i.e. if all images end up being added, then no more are available for notifications or newcomer tool). Given the sheer amount of data, this is most likely nothing more than a theoretical issue (and even then, more images being added will likely yield new suggestions in other places), but it still merits a quick worst-case discussion with existing consumers of this data to ensure this case is adequately handled in the unlikely event this happens.

@AUgolnikova-WMF and I talked to @KStoller-WMF from the Growth team and everyone has agreed that this is an acceptable risk and we can move forward with publishing the dumps.

mfossati changed the task status from Open to In Progress.Jun 13 2023, 1:29 PM
mfossati claimed this task.

I suggest to get some advice by attending a Puppet office hours meeting.

Note that dumps will be owned by Data Engineering, Data Products team.
Work on Dumps 2.0 is also underway.

Aklapper changed the task status from In Progress to Open.Apr 11 2025, 10:13 PM

Resetting task status from "In Progress" to "Open" as this task has been "in progress" for more than one and a half years (see T380300).