On T368754: Production PySpark job that can run consistency checks for wmf_dumps.wikitext_raw, we developed a mechanism to run consistency checks, but we only tested it for daily runs.
In this task, we want to tune this mechanism, and leverage the work done on T372677: Figure a performant way to read all data from revision table via Spark, to be able to run consistency checks against the full revision history of all wikis.
This likely requires:
- Test runs and tuning against enwiki, wikidatawiki, commonswiki.
- After those are successfully, we also need to tune the work done on T368755: Python job that reads from wmf_dumps.wikitext_inconsistent_row and produced reconciliation events. to be able to emit the events successfully to EventGate.
- We are not using EventGate, thus no need to tune this.