Page MenuHomePhabricator

Search dag image_suggestions_weekly failed waiting for analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27
Closed, ResolvedPublic

Description

This dataset (analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27) is not available and caused the search dag image_suggestions_weekly to fail with a timeout and will have to be restarted once the source data is available.

Root cause seems to be due to some missing datasets:

  • 2023-11 snapshot for wmf_raw.mediawiki_pagelinks and wmf_raw.mediawiki_page_props

This task is to coordinate the efforts regarding backfilling the datasets:

  • wmf_raw.mediawiki_pagelinks and wmf_raw.mediawiki_page_props is available with snapshot 2023-11
  • analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27 is available
  • image_suggestions_weekly@2023-12-04T00:00:00+00:00 is retried and suggestions are shipped to elasticsearch

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

wmf_raw.mediawiki_pagelinks and wmf_raw.mediawiki_page_props is available with snapshot 2023-11

This is now done, sorry for the trouble.

@Milimetric What was the root cause of this issue (the cause of missing datasets)?

analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27

@dcausse, the image suggestions DAG is running now. Please expect the dataset to become available soon.

analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2023-11-27

@dcausse, the image suggestions DAG is running now. Please expect the dataset to become available soon.

@dcausse heads up that the dataset is available, but has an unexpectedly large number of rows. Investigating now.

Past snapshots' row counts suggest that large deltas happen when upstream monthly snapshots change, so that might be an explanation.

@mfossati thanks for investigating this please let me know once you're confident that we can ship this snapshot.

@dcausse I haven't investigated further, as hopefully the large delta is due to the new monthly snapshots. I can't say I'm confident, but we've already observed a spike when switching to 2023-10 snapshots, so I guess that's fine.
See these row counts of full and delta datasets:

2023-10-23 - full: 70854623 - delta: 113410
2023-10-30 - full: 71362921 - delta: 9058111
2023-11-06 - full: 71391924 - delta: 111255
2023-11-13 - full: 71421381 - delta: 106908
2023-11-20 - full: 71410414 - delta: 124667
2023-11-27 - full: 70771414 - delta: 11854372

@mfossati OK, going to resume our dag and ship the data.

Gehel claimed this task.