In T335860#8953786 (copy pasted below), we found up to 19 years of drift between dt and meta.dt on event. rc1_mediawiki_page_content_change.
In this task we should:
- investigate the nature of this drift, and whether the events are legit or just errors.
- if they are errors, open tasks (or pass this one) for Event-Platform folks to fix.
In T335860#8953786, @xcollazo wrote:In T335860#8950820, @Ottomata wrote:current data on event.rc1_mediawiki_page_content_change suggests drift could be years even.
Really? I have seen some silly events, but we might be able to discard them? T282887: Avoid accepting Kafka messages with whacky timestamps
SELECT abs(to_unix_timestamp(to_timestamp(dt)) - to_unix_timestamp(to_timestamp(meta.dt)))/60/60/24/365 as drift_in_years, dt, meta.dt as meta_dt FROM rc1_mediawiki_page_content_change ORDER BY drift_in_years DESC LIMIT 10; drift_in_years dt meta_dt 19.72640277777778 2003-09-07T09:45:35Z 2023-05-25T13:02:53.801266Z 19.396066749112126 2004-01-14T07:09:11Z 2023-06-02T20:41:52.689117Z 18.998406360984273 2004-04-29T11:38:29Z 2023-04-24T21:40:52.27638Z 18.885413749365803 2004-08-05T13:55:41Z 2023-06-20T18:09:09.097318Z 18.882671867072553 2004-08-05T13:55:41Z 2023-06-19T18:08:01.391324Z 18.879937975646882 2004-08-05T13:55:41Z 2023-06-18T18:11:05.404038Z 18.87719641045155 2004-08-05T13:55:41Z 2023-06-17T18:10:07.403142Z 18.87445662100457 2004-08-05T13:55:41Z 2023-06-16T18:10:05.418078Z 18.871715372907154 2004-08-05T13:55:41Z 2023-06-15T18:09:17.409511Z 18.868978405631662 2004-08-05T13:55:41Z 2023-06-14T18:10:44.416111Z Time taken: 36.701 seconds, Fetched 10 row(s)When a (wiki_id, revision.rev_id) pair has an event with the same dt
Would be curious to know the cases for these.
These are pervasive, and they are particularly interesting to me because an Iceberg MERGE INTO will fail if they are found (as is, only one update is allowed per ON clause match):
(excuse the convoluted SELECT statement below, but I had to salt it, otherwise it takes forever:)SELECT count(1) as count FROM ( SELECT *, FLOOR(RAND() * 8) AS salt FROM rc1_mediawiki_page_content_change ) AS a INNER JOIN ( SELECT *, EXPLODE(ARRAY(0,1,2,3,4,5,6,7)) AS salt FROM rc1_mediawiki_page_content_change ) AS b ON a.wiki_id = b.wiki_id AND a.revision.rev_id = b.revision.rev_id AND a.meta.id != b.meta.id AND a.dt = b.dt AND a.salt = b.salt count 218672499498 Time taken: 1236.364 seconds, Fetched 1 row(s)Some example hits:
SELECT a.meta.id as a_meta_id, b.meta.id as b_meta_id FROM ( SELECT *, FLOOR(RAND() * 8) AS salt FROM rc1_mediawiki_page_content_change ) AS a INNER JOIN ( SELECT *, EXPLODE(ARRAY(0,1,2,3,4,5,6,7)) AS salt FROM rc1_mediawiki_page_content_change ) AS b ON a.wiki_id = b.wiki_id AND a.revision.rev_id = b.revision.rev_id AND a.meta.id != b.meta.id AND a.dt = b.dt AND a.salt = b.salt LIMIT 100; fd386604-d9b4-477e-b3aa-287b9c204da6 08bd12dc-ca62-45f6-ba6a-d8944f4ad434 fd386604-d9b4-477e-b3aa-287b9c204da6 08bd12dc-ca62-45f6-ba6a-d8944f4ad434 3feec88c-1089-4a81-85c5-7b831b3334f2 f4fd4f51-198a-459c-8641-f6f067675eda 5784ee08-6c99-4fb3-a5c8-83caf99b0bf0 e6f2c18c-5777-493d-9331-a4aecc4e5205 55d55de1-bd65-4b6c-8a3d-ff1755d42132 428ba882-7843-4db0-8624-2c6752bc1bdb 55d55de1-bd65-4b6c-8a3d-ff1755d42132 428ba882-7843-4db0-8624-2c6752bc1bdb 55d55de1-bd65-4b6c-8a3d-ff1755d42132 428ba882-7843-4db0-8624-2c6752bc1bdb 55d55de1-bd65-4b6c-8a3d-ff1755d42132 428ba882-7843-4db0-8624-2c6752bc1bdb 977ee408-d428-4cf6-9107-d37ac74de9b0 2fd90807-1b53-4c8c-8680-0dae3b216557 795b5b28-0a44-4e30-b0be-2ece1c39b99f ea8a90fb-6ed0-4fc3-905f-453bdcd38432 9540f855-cbcd-4c29-9243-d6163c396f5c 4da403a4-8391-4118-8f82-00e48a419d3a c9a9d101-3f9e-49a3-87c1-0280998b6917 a4118e75-84b7-40c7-900b-4235e7fa1d11 ...Zooming in on one particular example:
SELECT changelog_kind, dt, meta.dt as meta_dt, meta.id as meta_id, meta.request_id as meta_request_id, page_change_kind, wiki_id, datacenter, year, month, day, hour FROM rc1_mediawiki_page_content_change WHERE meta.id IN ('fd386604-d9b4-477e-b3aa-287b9c204da6', '08bd12dc-ca62-45f6-ba6a-d8944f4ad434') changelog_kind dt meta_dt meta_id meta_request_id page_change_kind wiki_id datacenteryear month day hour update 2023-04-27T15:28:43Z 2023-04-27T15:28:50.524204Z fd386604-d9b4-477e-b3aa-287b9c204da6 d3fd1331-5ca6-4703-93cf-cf97c412b93f edit abwiki eqiad 2023 4 27 15 update 2023-04-27T15:28:43Z 2023-05-03T22:07:15.937117Z 08bd12dc-ca62-45f6-ba6a-d8944f4ad434 b69457a0-d51e-47fa-9546-ac3464c127b4 edit abwiki eqiad 2023 5 3 22 update 2023-04-27T15:28:43Z 2023-05-03T22:07:19.614943Z fd386604-d9b4-477e-b3aa-287b9c204da6 d3fd1331-5ca6-4703-93cf-cf97c412b93f edit abwiki eqiad 2023 5 3 22 update 2023-04-27T15:28:43Z 2023-04-27T15:28:46.44396Z 08bd12dc-ca62-45f6-ba6a-d8944f4ad434 b69457a0-d51e-47fa-9546-ac3464c127b4 edit abwiki eqiad 2023 4 27 15 Time taken: 94.721 seconds, Fetched 4 row(s)So this one looks like it was consumed twice by the event pipeline on 2023-04-27 and, for some reason, reconsumed twice on 2023-05-03. Note that they are all datancenter=eqiad so likely unrelated to a switchover.
Another observation is that dt's resolution is to the second. It would be nice if the resolution was bumped to nanos, like in meta.dt. Less chance of clashes.
(I just discussed a separate example with @Milimetric in which we speculated the event was triggered multiple times by MediaWiki.)
perhaps we do not need to consume mediawiki.revision_visibility_change at all since it is folded in already?
page_change and page_content_change only capture visibility changes to the current revision, and the current rev's content cannot be suppressed (a new rev must be created first). mediawiki.revision_visibility_change will have visibility changes for all revisions.
Got it!