Page MenuHomePhabricator

[L] Create search index deltas by comparing to `discovery.cirrus_index_without_content` in hive
Closed, ResolvedPublic

Description

The search index data is imported from image_suggestions_search_index_delta, which is created by diffing this week's snapshot with last week's snapshot in image_suggestions_search_index_full

The diff is between the current snapshot and the snapshot for the last successful run of the DAG, which means if we have a DAG that failed *after* it generated the search index data (which gets imported more or less straight away once the delta is ready) then the next run will be diffed against the wrong data

The search team now dumps the contents of the search indices into the discovery database in hive. If we use the table cirrus_index_without_content with image_suggestions_search_index_full to create image_suggestions_search_index_delta then I think our output data will be more robust.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Bump image-suggestions to 0.17.0-v0.17.0.repos/data-engineering/airflow-dags!585ghostbump-image-suggestions-to-0.17.0-v0.17.0main
Compute the search index delta against the `discovery.cirrus_index_without_content` Hive tablerepos/structured-data/image-suggestions!38mfossatiT338013main
Customize query in GitLab

Event Timeline

Cparle renamed this task from Create search index deltas by comparing to `discovery.cirrus_index_without_content` in hive rather than last week's data to Create search index deltas by comparing to `discovery.cirrus_index_without_content` in hive.Jun 2 2023, 9:41 AM
Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)
Cparle renamed this task from Create search index deltas by comparing to `discovery.cirrus_index_without_content` in hive to [L] Create search index deltas by comparing to `discovery.cirrus_index_without_content` in hive.Jun 28 2023, 4:42 PM
mfossati changed the task status from Open to In Progress.Dec 5 2023, 10:02 AM
mfossati claimed this task.

mfossati opened https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/38

Compute the search index delta against the discovery.cirrus_index_without_content Hive table

Approved the MR

Code looks good to me

Looking into the discrepancies between the old diff and the new diff ...

In total there are 1020 rows in the old diff that are not in the new:

  • 327 rows refer to deleted pages - which we can ignore because these will be dropped by the cirrus import job
  • select 10 rows at random from the remaining 693 not-deleted pages, and for all of them the data is already in cirrus search

In total there are 195903 rows in the new diff that are not in the old:

  • 13448 rows refer to deleted pages - which we can ignore because these will be dropped by the cirrus import job
  • Excluding the rows referring to deleted pages
    • 158937 are __DELETE_GROUPING__ instructions (deleting the relevant weighted_tag from the search index), and none of the pages appear in image_suggestions_search_index_full so it makes sense to delete them (note that 146945 of these are in the commons index)
    • 23478 have values for weighted_tags, and there's a row in image_suggestions_search_index_full corresponding to every one, so it makes sense to write them

So overall it looks very likely that the discrepencies are errors that crept in due to imports that were incomplete or re-done or aborted, and this new code corrects them (which is exactly what we were hoping for)

Based on the above there's a chance that we'll get alerts the first time this code is run because there's quite a few deletions (of things that probably should been deleted before) , but they're mostly on commons so perhaps not

mfossati merged https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/38

Compute the search index delta against the discovery.cirrus_index_without_content Hive table

Note that a DAG update would be nice, as we don't need the previous snapshot anymore.

Moving back to doing: DAG update & Airflow test run.

Airflow test run's output:

snapshot = '2024-01-08'
prod = spark.read.table('analytics_platform_eng.image_suggestions_search_index_delta').where(f'snapshot="{snapshot}"')
dev = spark.read.table('is_new_delta.image_suggestions_search_index_delta').where(f'snapshot="{snapshot}"')
prod.count(), dev.count()

(130903, 322933)

Row counts seem in line with https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/merge_requests/38#caveat

This week verification was impossible due to a missing upstream dependency (wmf.wikidata_item_page_link/snapshot=2024-01-15).
If the DAG times out, it will be a good opportunity to test the intended robustness of this ticket.

The DAG timed out as expected. I think we can just wait for the next to be picked up.