On T367570: Spike: Figure feasability to emit (wiki_db, revision_id) pairs, we PoCed a mechanism to emit (wiki_db, revision_id) pairs. In this task we should make this code production quality.
Output:
- a PySpark job that can run similar checks as in T367570, but that is parameterized properly. Done via T368754.
- this job should have a sink_table parameter. Final table name: wmf_dumps.wikitext_inconsistent_rows
- Think about and define the DDL of wmf_dumps.wikitext_inconsistent_rows so that it is also usable from the point of view of data quality metrics.
- A separate job that reads from wmf_dumps.wikitext_inconsistent_rows and calls EventGate. Done via T368755.
- An Airflow job that orchestrates all of this. Core of work done via T368756.
- Figure a performant way to read all data from revision table via Spark ( T372677 )
- Add a new hourly Spark MERGE INTO job that consumes the page_content_late_change hive table. ( T368746 )