Implementation Details:
As Skein ignores the setting of retries=0 the plan is as follows:
Short Term:
- Implement fix in Skein library (see code from @xcollazo : https://gitlab.wikimedia.org/xcollazo/skein/-/commit/1ffeb3d7366aee7c80f248461edecd7ca01203c2
- Try to upstream the change to Skein library (cross fingers)
Long Term:
- Migrate Airflow to k8s so we can use Docker and move away from Skein
The Wikitext History job failed 4 times and succeeded the 5th, causing data duplication in the 2023-05 snapshot:
$ hdfs dfs -du -h /wmf/data/wmf/mediawiki/wikitext/history/snapshot=2023-0* | grep ‘=enwiki$’ 11.2 T /wmf/data/wmf/mediawiki/wikitext/history/snapshot=2023-04/wiki_db=enwiki 44.7 T /wmf/data/wmf/mediawiki/wikitext/history/snapshot=2023-05/wiki_db=enwiki 11.3 T /wmf/data/wmf/mediawiki/wikitext/history/snapshot=2023-06/wiki_db=enwiki
We should do one or more of the following:
- audit Airflow tasks for what can be safely rerun, and set retry to 0 for everything else
- add Data Quality checks to everything