ores_predictions_weekly DAG is failing
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	• Zbyszko
	Nov 4 2020, 5:29 PM

Description

Currently, ores_predictions_weekly DAG fails on airflow with following error:

Traceback (most recent call last):
  File "fetch_ores_thresholds.py", line 114, in <module>
    sys.exit(main(**dict(vars(args))))
  File "fetch_ores_thresholds.py", line 47, in main
    thresholds = get_all_thresholds(model, ores_scores_api)
  File "fetch_ores_thresholds.py", line 102, in get_all_thresholds
    optimization = get_threshold_at_precision(config, label, target)
  File "fetch_ores_thresholds.py", line 85, in get_threshold_at_precision
    statistics = doc[config.wiki]['models'][config.model]['statistics']
KeyError: 'arwiki'

Details

	Subject	Repo	Branch	Lines +/-
	Add logging for 400+ responses	wikimedia/discovery/analytics	master	+7 -1
	Add timeout and retry to ores fetches	wikimedia/discovery/analytics	master	+58 -21

Customize query in gerrit

Related Objects

Mentioned In: rWDAN106f04f3e4ef: Add logging for 400+ responses
rWDANef2f7d6ad973: Add timeout and retry to ores fetches
Mentioned Here: T263910: ORES redis: max number of clients reached...

Event Timeline

• Zbyszko created this task.Nov 4 2020, 5:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 4 2020, 5:29 PM

• Zbyszko added a project: CirrusSearch.Nov 4 2020, 5:29 PM

This particular script calls prod apis and writes the results to a file, no external dependencies. We can lookup what was supposed to run in the fixtures[2]. Mostly this says the script is file:///srv/deployment/wikimedia/discovery/analytics/spark/fetch_ores_thresholds.py and you can run the script locally as python3 fetch_ores_thresholds.py --model articletopic --output-path thresholds.json.

To make this more obvious to find I've added a patch that reports the skein specification used in runtime logs[1].

[1] https://gerrit.wikimedia.org/r/c/wikimedia/discovery/analytics/+/639279
[2] https://github.com/wikimedia/wikimedia-discovery-analytics/blob/master/airflow/tests/fixtures/skein_operator_spec/ores_predictions_weekly-fetch_prediction_thresholds.expected

This not running blocks the weekly task to ship data to elasticsearch. Probably the easiest way forward is to copy last weeks thresholds to this weeks location (from same skein spec) and then mark the threshold fetch as a success in airflow web ui. The problems with the ORES api can then be resolved without blocking the pipelines.

sudo -u analytics-search kerberos-run-command analytics-search hdfs dfs -cp hdfs://analytics-hadoop/wmf/data/discovery/ores/thresholds/articletopic/20201018.json hdfs://analytics-hadoop/wmf/data/discovery/ores/thresholds/articletopic/20201025.json

We are experiencing similar issues as here: T263910 - our calls are being blocked because of too many clients connected. I'm going to address this in two ways - add timeout/retry and additional logging for errors (so that the next investigation is shorter).

Change 639737 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikimedia/discovery/analytics@master] Add timeout and retry to ores fetches

https://gerrit.wikimedia.org/r/639737

gerritbot added a project: Patch-For-Review.Nov 6 2020, 9:53 AM

Change 639781 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikimedia/discovery/analytics@master] Add logging for 400+ responses

https://gerrit.wikimedia.org/r/639781

Change 639737 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] Add timeout and retry to ores fetches

https://gerrit.wikimedia.org/r/639737

• Zbyszko mentioned this in rWDANef2f7d6ad973: Add timeout and retry to ores fetches.Nov 6 2020, 4:24 PM

Change 639781 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] Add logging for 400+ responses

https://gerrit.wikimedia.org/r/639781

• Zbyszko mentioned this in rWDAN106f04f3e4ef: Add logging for 400+ responses.Nov 6 2020, 9:56 PM

Maintenance_bot removed a project: Patch-For-Review.Nov 6 2020, 10:10 PM