Change Details

On T335860, we implemented a pyspark job that runs a MERGE INTO that transforms event data into a table that will eventually have all the mediawiki revision history. Reviews of the MERGE INTO pointed at situations that should not happen, assuming that we have a properly backfilled table: ``` WHEN MATCHED AND s.changelog_kind IN ('insert', 'update') AND s_dt >= t.dt THEN UPDATE SET ... WHEN MATCHED AND s.changelog_kind = 'delete' AND s_dt >= t.dt THEN DELETE WHEN NOT MATCHED AND s.changelog_kind IN ('insert', 'update') THEN INSERT ( ... ``` For example, in the snippet above, we should never hit `WHEN MATCHED AND s.changelog_kind = 'insert'` and we should also never hit `WHEN NOT MATCHED AND s.changelog_kind ='update'`. Also: >>! In T347718#9247461, @xcollazo wrote: > Discussed this ticket with @JAllemandou, and there are some correctness issues that this work does not address. For example, we want to force `dumps_merge_visibility_events_to_wikitext_raw` to run after the same hour has been ingested for `dumps_merge_events_to_wikitext_raw`, because those visibility changes may be mutating new (wiki_db, revision_id ) tuples from the same hour. > > I will tackle this correctness issues separately though, as part of T340863, as we also discussed that even when having correct order, we want to be able to detect if something is not looking good. In this task we should: [] These things do happen right now, but that is because we are not backfilled, so we should wait until T340861 is complete first. [] We should run a couple queries to see if the above situations do happen after backfill. [] Build a mechanism to log these situations. It can be as simple as having an extra column in the target table where we log the error. Or fancier.