In T368176: [Dumps 2] Spike: Figure root causes of missing rows when doing reconciliation we found a large chunk of mismatch resulting from not processing page delete, restore, and merge events. This task is to look into our incremental updating airflow jobs (the ones that update wmf_dumps.wikitext_raw) and determine the best way to include the missed delete, restore, and merge.
We figured that a good way to move forward while simplifying the reconcile mechanism is to:
- Continue having an hourly Spark MERGE INTO job that consumes the page_content_change hive table.
- Have one additional hourly Spark job that scans page_content_change for deletes, accumulates the page_ids, and applies a single DELETE. A similar mechanism should be used for page moves as well, that would apply an UPDATE rather than a DELETE.
-
Add a new hourly Spark MERGE INTO job that consumes the page_content__late_change hive table.(Will be done on separate ticket, T375077)- All remaining inconsistencies should be relatively small, and thus would wait on the reconcile mechanism. Considering that page deletes and page moves would be applied on an hourly basis, the reconcile mechanism can potentially be run less frequently.