Page MenuHomePhabricator

[L] Adapt or transform image_suggestions_search_index_delta to allow creating one update per article
Open, Needs TriagePublic

Description

Currently the table image_suggestions_search_index_delta is shaped to have a line per wiki_id, page_id, tag tuples.
For articles where multiple tags are updated we should ideally schedule a single update not multiple ones.
The way to achieve this is unclear, it could be done upstream by changing the schema of image_suggestions_search_index_delta to have for instance map<string, array<string>> where the key is the tag and the value is the array of tag values.
It could be an extra transformation step on the search side too but given that we would like to adapt this data-pipeline to use the unified weighted_tags stream (T372912) it might be preferable to do the grouping early on the image_suggestions pipeline side.

AC:

  • image_suggestions tag updates are grouped per page not per page, tag

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
MarkTraceur renamed this task from Adapt or transform image_suggestions_search_index_delta to allow creating one update per article to [L] Adapt or transform image_suggestions_search_index_delta to allow creating one update per article.Apr 16 2025, 4:36 PM

@dcausse changing how the Wikipedias delta is generated is quite tricky: article-level suggestions are isolated from section-level ones, and the respective DAGs write the delta separately. Hence, we'd need to build a new Spark job + DAG that waits for those 2 deltas and outputs the final one.
Numbers suggest that the amount of page IDs with more than one update is relatively small, compared to the overall amount: the last 5 snapshots have 719,128 page IDs to be updated, of which 22,746 will get more than one update, and 20,828 will get two updates.

Is this a hard requirement for you or a nice to have?

I think this task should be done as part of T372912 which will involve some refactoring of the way the tags are shipped.
I suspect that the delta you generate could easily be grouped by page_id after they're computed.

The impact of having multiple updates per article:

  • greater number of updates (which we want to minimize for T372912)
  • conflicts, perhaps rare but if two updates for the same page_id are sent close to each other one might fail on a conflict when entering elastic.