Page MenuHomePhabricator

Search dag image_suggestions_weekly failed with: Empty dataframe provided
Closed, ResolvedPublic2 Estimated Story Points

Description

Seen in the logs while processing analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-02-12.
Stack:

LogType:stdout
LogLastModifiedTime:Sun Feb 25 00:24:20 +0000 2024
LogLength:1359
LogContents:
INFO:root:Validating configuration for table analytics_platform_eng.image_suggestions_search_index_delta
INFO:root:Validating fields aliased to weighted_tags
Traceback (most recent call last):
  File "/var/lib/hadoop/data/k/yarn/local/usercache/analytics-search/appcache/application_1707226456123_97551/container_e117_1707226456123_97551_01_000001/convert_to_esbulk.py", line 8, in <module>
    sys.exit(run_cli())
  File "/var/lib/hadoop/data/k/yarn/local/usercache/analytics-search/appcache/application_1707226456123_97551/container_e117_1707226456123_97551_01_000001/venv/lib/python3.10/site-packages/discolytics/cli/convert_to_esbulk.py", line 731, in run_cli
    return main(**dict(vars(args)))
  File "/var/lib/hadoop/data/k/yarn/local/usercache/analytics-search/appcache/application_1707226456123_97551/container_e117_1707226456123_97551_01_000001/venv/lib/python3.10/site-packages/discolytics/cli/convert_to_esbulk.py", line 719, in main
    unique_value_per_partition(df, limit_per_file, 'wikiid')
  File "/var/lib/hadoop/data/k/yarn/local/usercache/analytics-search/appcache/application_1707226456123_97551/container_e117_1707226456123_97551_01_000001/venv/lib/python3.10/site-packages/discolytics/cli/convert_to_esbulk.py", line 631, in unique_value_per_partition
    raise Exception('Empty dataframe provided')
Exception: Empty dataframe provided

Checking analytics_platform_eng.image_suggestions_search_index_delta/snapshot=2024-02-12 it appears to be empty and seems to be expected (see T345570, https://wikimedia.slack.com/archives/C01DFVAQRGA/p1708941091842049).

AC:

  • convert_to_esbulk.py should accept empty partitions

Event Timeline

Gehel triaged this task as High priority.Feb 26 2024, 2:21 PM
Gehel moved this task from needs triage to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
Gehel set the point value for this task to 2.

ebernhardson merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/662

Short-circuit transfer_to_es DAGs in case convert_to_esbulk implies it