In T335860#8953786 (copy pasted below), we found up to 19 years of drift between `dt` and `meta.dt` on `event. rc1_mediawiki_page_content_change`.
In this task we should:
[] investigate the nature of this drift, and whether the events are legit or just errors.
[] if they are errors, open tasks (or pass this one) for #event-platform folks to fix.
>>! In T335860#8953786, @xcollazo wrote:
>>>! In T335860#8950820, @Ottomata wrote:
>>> current data on event.rc1_mediawiki_page_content_change suggests drift could be years even.
>> Really? I have seen some silly events, but we might be able to discard them? {T282887}
> ```
> SELECT
> abs(to_unix_timestamp(to_timestamp(dt)) - to_unix_timestamp(to_timestamp(meta.dt)))/60/60/24/365 as drift_in_years,
> dt,
> meta.dt as meta_dt
> FROM rc1_mediawiki_page_content_change
> ORDER BY drift_in_years DESC LIMIT 10;
>
> drift_in_years dt meta_dt
> 19.72640277777778 2003-09-07T09:45:35Z 2023-05-25T13:02:53.801266Z
> 19.396066749112126 2004-01-14T07:09:11Z 2023-06-02T20:41:52.689117Z
> 18.998406360984273 2004-04-29T11:38:29Z 2023-04-24T21:40:52.27638Z
> 18.885413749365803 2004-08-05T13:55:41Z 2023-06-20T18:09:09.097318Z
> 18.882671867072553 2004-08-05T13:55:41Z 2023-06-19T18:08:01.391324Z
> 18.879937975646882 2004-08-05T13:55:41Z 2023-06-18T18:11:05.404038Z
> 18.87719641045155 2004-08-05T13:55:41Z 2023-06-17T18:10:07.403142Z
> 18.87445662100457 2004-08-05T13:55:41Z 2023-06-16T18:10:05.418078Z
> 18.871715372907154 2004-08-05T13:55:41Z 2023-06-15T18:09:17.409511Z
> 18.868978405631662 2004-08-05T13:55:41Z 2023-06-14T18:10:44.416111Z
> Time taken: 36.701 seconds, Fetched 10 row(s)
> ```
>
>>> When a (wiki_id, revision.rev_id) pair has an event with the same dt
>> Would be curious to know the cases for these.
> These are pervasive, and they are particularly interesting to me because an Iceberg `MERGE INTO` will fail if they are found (as is, only one update is allowed per `ON` clause match):
> (excuse the convoluted SELECT statement below, but I had to salt it, otherwise it takes forever:)
> ```
> SELECT
> count(1) as count
> FROM
> (
> SELECT
> *,
> FLOOR(RAND() * 8) AS salt
> FROM rc1_mediawiki_page_content_change
> ) AS a
> INNER JOIN
> (
> SELECT
> *,
> EXPLODE(ARRAY(0,1,2,3,4,5,6,7)) AS salt
> FROM rc1_mediawiki_page_content_change
> ) AS b
> ON
> a.wiki_id = b.wiki_id AND
> a.revision.rev_id = b.revision.rev_id AND
> a.meta.id != b.meta.id AND
> a.dt = b.dt AND
> a.salt = b.salt
>
> count
> 218672499498
> Time taken: 1236.364 seconds, Fetched 1 row(s)
> ```
>
> Some example hits:
> ```
> SELECT
> a.meta.id as a_meta_id,
> b.meta.id as b_meta_id
> FROM
> (
> SELECT
> *,
> FLOOR(RAND() * 8) AS salt
> FROM rc1_mediawiki_page_content_change
> ) AS a
> INNER JOIN
> (
> SELECT
> *,
> EXPLODE(ARRAY(0,1,2,3,4,5,6,7)) AS salt
> FROM rc1_mediawiki_page_content_change
> ) AS b
> ON
> a.wiki_id = b.wiki_id AND
> a.revision.rev_id = b.revision.rev_id AND
> a.meta.id != b.meta.id AND
> a.dt = b.dt AND
> a.salt = b.salt
> LIMIT 100;
>
> fd386604-d9b4-477e-b3aa-287b9c204da6 08bd12dc-ca62-45f6-ba6a-d8944f4ad434
> fd386604-d9b4-477e-b3aa-287b9c204da6 08bd12dc-ca62-45f6-ba6a-d8944f4ad434
> 3feec88c-1089-4a81-85c5-7b831b3334f2 f4fd4f51-198a-459c-8641-f6f067675eda
> 5784ee08-6c99-4fb3-a5c8-83caf99b0bf0 e6f2c18c-5777-493d-9331-a4aecc4e5205
> 55d55de1-bd65-4b6c-8a3d-ff1755d42132 428ba882-7843-4db0-8624-2c6752bc1bdb
> 55d55de1-bd65-4b6c-8a3d-ff1755d42132 428ba882-7843-4db0-8624-2c6752bc1bdb
> 55d55de1-bd65-4b6c-8a3d-ff1755d42132 428ba882-7843-4db0-8624-2c6752bc1bdb
> 55d55de1-bd65-4b6c-8a3d-ff1755d42132 428ba882-7843-4db0-8624-2c6752bc1bdb
> 977ee408-d428-4cf6-9107-d37ac74de9b0 2fd90807-1b53-4c8c-8680-0dae3b216557
> 795b5b28-0a44-4e30-b0be-2ece1c39b99f ea8a90fb-6ed0-4fc3-905f-453bdcd38432
> 9540f855-cbcd-4c29-9243-d6163c396f5c 4da403a4-8391-4118-8f82-00e48a419d3a
> c9a9d101-3f9e-49a3-87c1-0280998b6917 a4118e75-84b7-40c7-900b-4235e7fa1d11
> ...
> ```
>
> Zooming in on one particular example:
>
> ```
> SELECT
> changelog_kind, dt, meta.dt as meta_dt, meta.id as meta_id, meta.request_id as meta_request_id, page_change_kind, wiki_id, datacenter, year, month, day, hour
> FROM
> rc1_mediawiki_page_content_change
> WHERE
> meta.id IN ('fd386604-d9b4-477e-b3aa-287b9c204da6', '08bd12dc-ca62-45f6-ba6a-d8944f4ad434')
>
> changelog_kind dt meta_dt meta_id meta_request_id page_change_kind wiki_id datacenteryear month day hour
> update 2023-04-27T15:28:43Z 2023-04-27T15:28:50.524204Z fd386604-d9b4-477e-b3aa-287b9c204da6 d3fd1331-5ca6-4703-93cf-cf97c412b93f edit abwiki eqiad 2023 4 27 15
> update 2023-04-27T15:28:43Z 2023-05-03T22:07:15.937117Z 08bd12dc-ca62-45f6-ba6a-d8944f4ad434 b69457a0-d51e-47fa-9546-ac3464c127b4 edit abwiki eqiad 2023 5 3 22
> update 2023-04-27T15:28:43Z 2023-05-03T22:07:19.614943Z fd386604-d9b4-477e-b3aa-287b9c204da6 d3fd1331-5ca6-4703-93cf-cf97c412b93f edit abwiki eqiad 2023 5 3 22
> update 2023-04-27T15:28:43Z 2023-04-27T15:28:46.44396Z 08bd12dc-ca62-45f6-ba6a-d8944f4ad434 b69457a0-d51e-47fa-9546-ac3464c127b4 edit abwiki eqiad 2023 4 27 15
> Time taken: 94.721 seconds, Fetched 4 row(s)
> ```
>
> So this one looks like it was consumed twice by the event pipeline on `2023-04-27` and, for some reason, reconsumed twice on `2023-05-03`. Note that they are all `datancenter=eqiad` so likely unrelated to a switchover.
>
> Another observation is that `dt`'s resolution is to the second. It would be nice if the resolution was bumped to nanos, like in `meta.dt`. Less chance of clashes.
>
> (I just discussed a separate example with @milimetric in which we speculated the event was triggered multiple times by MediaWiki.)
>
>
>>> perhaps we do not need to consume mediawiki.revision_visibility_change at all since it is folded in already?
>> page_change and page_content_change only capture visibility changes to the current revision, and the current rev's content cannot be suppressed (a new rev must be created first). `mediawiki.revision_visibility_change` will have visibility changes for all revisions.
>
> Got it!