Page MenuHomePhabricator

ores_predictions_weekly DAG is failing
Closed, ResolvedPublicBUG REPORT

Description

Currently, ores_predictions_weekly DAG fails on airflow with following error:

Traceback (most recent call last):
  File "fetch_ores_thresholds.py", line 114, in <module>
    sys.exit(main(**dict(vars(args))))
  File "fetch_ores_thresholds.py", line 47, in main
    thresholds = get_all_thresholds(model, ores_scores_api)
  File "fetch_ores_thresholds.py", line 102, in get_all_thresholds
    optimization = get_threshold_at_precision(config, label, target)
  File "fetch_ores_thresholds.py", line 85, in get_threshold_at_precision
    statistics = doc[config.wiki]['models'][config.model]['statistics']
KeyError: 'arwiki'

Event Timeline

This particular script calls prod apis and writes the results to a file, no external dependencies. We can lookup what was supposed to run in the fixtures[2]. Mostly this says the script is file:///srv/deployment/wikimedia/discovery/analytics/spark/fetch_ores_thresholds.py and you can run the script locally as python3 fetch_ores_thresholds.py --model articletopic --output-path thresholds.json.

To make this more obvious to find I've added a patch that reports the skein specification used in runtime logs[1].

[1] https://gerrit.wikimedia.org/r/c/wikimedia/discovery/analytics/+/639279
[2] https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/airflow/tests/fixtures/skein_operator_spec/ores_predictions_weekly-fetch_prediction_thresholds.expected

This not running blocks the weekly task to ship data to elasticsearch. Probably the easiest way forward is to copy last weeks thresholds to this weeks location (from same skein spec) and then mark the threshold fetch as a success in airflow web ui. The problems with the ORES api can then be resolved without blocking the pipelines.

sudo -u analytics-search kerberos-run-command analytics-search hdfs dfs -cp hdfs://analytics-hadoop/wmf/data/discovery/ores/thresholds/articletopic/20201018.json hdfs://analytics-hadoop/wmf/data/discovery/ores/thresholds/articletopic/20201025.json

We are experiencing similar issues as here: T263910 - our calls are being blocked because of too many clients connected. I'm going to address this in two ways - add timeout/retry and additional logging for errors (so that the next investigation is shorter).

Change 639737 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikimedia/discovery/analytics@master] Add timeout and retry to ores fetches

https://gerrit.wikimedia.org/r/639737

Change 639781 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikimedia/discovery/analytics@master] Add logging for 400+ responses

https://gerrit.wikimedia.org/r/639781

Change 639737 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] Add timeout and retry to ores fetches

https://gerrit.wikimedia.org/r/639737

Change 639781 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] Add logging for 400+ responses

https://gerrit.wikimedia.org/r/639781

We added safeguarding mechanisms to the DAG.