Page MenuHomePhabricator

Investigate drift between `dt` and `meta.dt`
Closed, ResolvedPublic2 Estimated Story Points

Description

In T335860#8953786 (copy pasted below), we found up to 19 years of drift between dt and meta.dt on event. rc1_mediawiki_page_content_change.

In this task we should:

  • investigate the nature of this drift, and whether the events are legit or just errors.
  • if they are errors, open tasks (or pass this one) for Event-Platform folks to fix.

current data on event.rc1_mediawiki_page_content_change suggests drift could be years even.

Really? I have seen some silly events, but we might be able to discard them? T282887: Avoid accepting Kafka messages with whacky timestamps

SELECT
    abs(to_unix_timestamp(to_timestamp(dt)) - to_unix_timestamp(to_timestamp(meta.dt)))/60/60/24/365 as drift_in_years,
    dt,
    meta.dt as meta_dt
FROM rc1_mediawiki_page_content_change
ORDER BY drift_in_years DESC LIMIT 10;

drift_in_years          dt                      meta_dt
19.72640277777778	2003-09-07T09:45:35Z	2023-05-25T13:02:53.801266Z
19.396066749112126	2004-01-14T07:09:11Z	2023-06-02T20:41:52.689117Z
18.998406360984273	2004-04-29T11:38:29Z	2023-04-24T21:40:52.27638Z
18.885413749365803	2004-08-05T13:55:41Z	2023-06-20T18:09:09.097318Z
18.882671867072553	2004-08-05T13:55:41Z	2023-06-19T18:08:01.391324Z
18.879937975646882	2004-08-05T13:55:41Z	2023-06-18T18:11:05.404038Z
18.87719641045155	2004-08-05T13:55:41Z	2023-06-17T18:10:07.403142Z
18.87445662100457	2004-08-05T13:55:41Z	2023-06-16T18:10:05.418078Z
18.871715372907154	2004-08-05T13:55:41Z	2023-06-15T18:09:17.409511Z
18.868978405631662	2004-08-05T13:55:41Z	2023-06-14T18:10:44.416111Z
Time taken: 36.701 seconds, Fetched 10 row(s)

When a (wiki_id, revision.rev_id) pair has an event with the same dt

Would be curious to know the cases for these.

These are pervasive, and they are particularly interesting to me because an Iceberg MERGE INTO will fail if they are found (as is, only one update is allowed per ON clause match):
(excuse the convoluted SELECT statement below, but I had to salt it, otherwise it takes forever:)

SELECT
count(1) as count
FROM
(
  SELECT 
    *, 
    FLOOR(RAND() * 8) AS salt 
    FROM rc1_mediawiki_page_content_change
) AS a
INNER JOIN
(
  SELECT 
    *, 
    EXPLODE(ARRAY(0,1,2,3,4,5,6,7)) AS salt 
    FROM rc1_mediawiki_page_content_change
) AS b
ON
a.wiki_id = b.wiki_id AND
a.revision.rev_id = b.revision.rev_id AND
a.meta.id != b.meta.id AND
a.dt = b.dt AND
a.salt = b.salt

count
218672499498
Time taken: 1236.364 seconds, Fetched 1 row(s)

Some example hits:

SELECT
a.meta.id as a_meta_id,
b.meta.id as b_meta_id
FROM
(
  SELECT 
    *, 
    FLOOR(RAND() * 8) AS salt 
    FROM rc1_mediawiki_page_content_change
) AS a
INNER JOIN
(
  SELECT 
    *, 
    EXPLODE(ARRAY(0,1,2,3,4,5,6,7)) AS salt 
    FROM rc1_mediawiki_page_content_change
) AS b
ON
a.wiki_id = b.wiki_id AND
a.revision.rev_id = b.revision.rev_id AND
a.meta.id != b.meta.id AND
a.dt = b.dt AND
a.salt = b.salt
LIMIT 100;

fd386604-d9b4-477e-b3aa-287b9c204da6	08bd12dc-ca62-45f6-ba6a-d8944f4ad434
fd386604-d9b4-477e-b3aa-287b9c204da6	08bd12dc-ca62-45f6-ba6a-d8944f4ad434
3feec88c-1089-4a81-85c5-7b831b3334f2	f4fd4f51-198a-459c-8641-f6f067675eda
5784ee08-6c99-4fb3-a5c8-83caf99b0bf0	e6f2c18c-5777-493d-9331-a4aecc4e5205
55d55de1-bd65-4b6c-8a3d-ff1755d42132	428ba882-7843-4db0-8624-2c6752bc1bdb
55d55de1-bd65-4b6c-8a3d-ff1755d42132	428ba882-7843-4db0-8624-2c6752bc1bdb
55d55de1-bd65-4b6c-8a3d-ff1755d42132	428ba882-7843-4db0-8624-2c6752bc1bdb
55d55de1-bd65-4b6c-8a3d-ff1755d42132	428ba882-7843-4db0-8624-2c6752bc1bdb
977ee408-d428-4cf6-9107-d37ac74de9b0	2fd90807-1b53-4c8c-8680-0dae3b216557
795b5b28-0a44-4e30-b0be-2ece1c39b99f	ea8a90fb-6ed0-4fc3-905f-453bdcd38432
9540f855-cbcd-4c29-9243-d6163c396f5c	4da403a4-8391-4118-8f82-00e48a419d3a
c9a9d101-3f9e-49a3-87c1-0280998b6917	a4118e75-84b7-40c7-900b-4235e7fa1d11
...

Zooming in on one particular example:

SELECT
  changelog_kind, dt, meta.dt as meta_dt, meta.id as meta_id, meta.request_id as meta_request_id, page_change_kind, wiki_id, datacenter, year, month, day, hour
FROM
  rc1_mediawiki_page_content_change
WHERE
  meta.id IN ('fd386604-d9b4-477e-b3aa-287b9c204da6', '08bd12dc-ca62-45f6-ba6a-d8944f4ad434')
  
changelog_kind	dt	meta_dt	meta_id	meta_request_id	page_change_kind	wiki_id	datacenteryear	month	day	hour
update	2023-04-27T15:28:43Z	2023-04-27T15:28:50.524204Z	fd386604-d9b4-477e-b3aa-287b9c204da6	d3fd1331-5ca6-4703-93cf-cf97c412b93f	edit	abwiki	eqiad	2023	4	27	15
update	2023-04-27T15:28:43Z	2023-05-03T22:07:15.937117Z	08bd12dc-ca62-45f6-ba6a-d8944f4ad434	b69457a0-d51e-47fa-9546-ac3464c127b4	edit	abwiki	eqiad	2023	5	3	22
update	2023-04-27T15:28:43Z	2023-05-03T22:07:19.614943Z	fd386604-d9b4-477e-b3aa-287b9c204da6	d3fd1331-5ca6-4703-93cf-cf97c412b93f	edit	abwiki	eqiad	2023	5	3	22
update	2023-04-27T15:28:43Z	2023-04-27T15:28:46.44396Z	08bd12dc-ca62-45f6-ba6a-d8944f4ad434	b69457a0-d51e-47fa-9546-ac3464c127b4	edit	abwiki	eqiad	2023	4	27	15
Time taken: 94.721 seconds, Fetched 4 row(s)

So this one looks like it was consumed twice by the event pipeline on 2023-04-27 and, for some reason, reconsumed twice on 2023-05-03. Note that they are all datancenter=eqiad so likely unrelated to a switchover.

Another observation is that dt's resolution is to the second. It would be nice if the resolution was bumped to nanos, like in meta.dt. Less chance of clashes.

(I just discussed a separate example with @Milimetric in which we speculated the event was triggered multiple times by MediaWiki.)

perhaps we do not need to consume mediawiki.revision_visibility_change at all since it is folded in already?

page_change and page_content_change only capture visibility changes to the current revision, and the current rev's content cannot be suppressed (a new rev must be created first). mediawiki.revision_visibility_change will have visibility changes for all revisions.

Got it!

Details

TitleReferenceAuthorSource BranchDest Branch
Add quality and drift data analysisrepos/data-engineering/mediawiki-event-enrichment!67gmodenaT340831-add-data-analysismain
Customize query in GitLab

Event Timeline

We should consider the following code on EventBus:

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/EventBus/+/refs/heads/master/includes/HookHandlers/MediaWiki/PageChangeHooks.php#482

				$event = $this->pageChangeEventSerializer->toVisibilityChangeEvent(
					$this->streamName,
					$wikiPage,
					$performer,
					$revisionRecord,
					$visibilityChanges['oldBits'],
					// NOTE: ArticleRevisionVisibilitySet hook does not give us a proper event time.
					// The best we can do is use the current timestamp :(
					wfTimestampNow()
				);

When investigating, maybe start with event.mediawiki_page_change_v1 instead of page_content_change?

What kind of events have significant drift? Seems like its only updates that are edits:

spark.sql("""
SELECT count(1) as count, changelog_kind, page_change_kind
FROM
    (SELECT
        *,
        abs(to_unix_timestamp(to_timestamp(dt)) - to_unix_timestamp(to_timestamp(meta.dt)))/60/60/24/365 as drift_in_years
    FROM event.rc1_mediawiki_page_content_change
    ORDER BY drift_in_years DESC
    )
WHERE drift_in_years > 2/52 -- two weeks
GROUP BY changelog_kind, page_change_kind
ORDER BY count DESC
"""
).show(truncate=False)


+--------+--------------+----------------+
|count   |changelog_kind|page_change_kind|
+--------+--------------+----------------+
|15964277|update        |edit            |
+--------+--------------+----------------+

rc1_mediawiki_page_content_change is an enriched version of mediawiki_page_change_v1. Do we see similar behavior over there? Yes:

spark.sql("""
SELECT count(1) as count, changelog_kind, page_change_kind
FROM
    (SELECT
        *,
        abs(to_unix_timestamp(to_timestamp(dt)) - to_unix_timestamp(to_timestamp(meta.dt)))/60/60/24/365 as drift_in_years
    FROM event.mediawiki_page_change_v1
    ORDER BY drift_in_years DESC
    )
WHERE drift_in_years > 2/52 -- two weeks
GROUP BY changelog_kind, page_change_kind
ORDER BY count DESC
"""
).show(truncate=False)


+-------+--------------+----------------+
|count  |changelog_kind|page_change_kind|
+-------+--------------+----------------+
|4786867|update        |edit            |
+-------+--------------+----------------+

Let's look at the worst offenders from description above:

spark.sql("""
SELECT
    _schema, changelog_kind, comment, created_redirect_page, dt, meta, page, page_change_kind, wiki_id, datacenter
FROM event.rc1_mediawiki_page_content_change
WHERE meta.id IN (
  '636a4445-9722-4509-8cd4-2cebaece1b5f',
  'cbfbae34-efa4-42bd-b8ac-37398a3ab1a5',
  'db32762f-cdd6-4de2-8666-48b6200e4032',
  'ac645b1e-b27b-40b8-b29c-b1b681ac2572',
  '9516ccb2-e472-48c9-9c42-3163734c9556'
)
AND 
(
      (year = 2023 AND month = 5 AND day = 25 AND hour = 13)
    OR
      (year = 2023 AND month = 6 AND day = 02 AND hour = 20)
    OR
      (year = 2023 AND month = 4 AND day = 24 AND hour = 21)
    OR
      (year = 2023 AND month = 7 AND day = 05 AND hour = 15)
    OR
      (year = 2023 AND month = 6 AND day = 28 AND hour = 18)
)
"""
).show(truncate=False)

+-------+--------------+-------+------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+----------------+-------+----------+
|_schema|changelog_kind|comment|created_redirect_page         |dt                  |meta                                                                                                                                                                                                                      |page                                                                       |page_change_kind|wiki_id|datacenter|
+-------+--------------+-------+------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+----------------+-------+----------+
|null   |update        |null   |{null, null, null, null, null}|2004-04-29T11:38:29Z|{de.wikipedia.org, 2023-04-24T21:40:52.27638Z, db32762f-cdd6-4de2-8666-48b6200e4032, 0208d90c-01cb-4d6c-839e-5b198a903db5, rc1.mediawiki.page_content_change, https://de.wikipedia.org/wiki/Augustiner_Chorherren}        |{true, 0, 197737, Augustiner_Chorherren, null, null}                       |edit            |dewiki |eqiad     |
|null   |update        |null   |{null, null, null, null, null}|2004-08-05T13:55:41Z|{pl.wikipedia.org, 2023-06-28T18:09:25.404567Z, 9516ccb2-e472-48c9-9c42-3163734c9556, 3b458032-11f5-48fa-9ed4-c0b3ba0e8567, rc1.mediawiki.page_content_change, https://pl.wikipedia.org/wiki/Wikipedysta:Hensle}          |{false, 2, 73922, Wikipedysta:Hensle, null, {null, null, null, null, null}}|edit            |plwiki |eqiad     |
|null   |update        |null   |{null, null, null, null, null}|2004-08-05T13:55:41Z|{pl.wikipedia.org, 2023-07-05T15:26:02.728882Z, ac645b1e-b27b-40b8-b29c-b1b681ac2572, cff318da-daad-4867-baa1-699203e5caad, rc1.mediawiki.page_content_change, https://pl.wikipedia.org/wiki/Wikipedysta:Hensle}          |{false, 2, 73922, Wikipedysta:Hensle, null, {null, null, null, null, null}}|edit            |plwiki |eqiad     |
|null   |update        |null   |{null, null, null, null, null}|2003-09-07T09:45:35Z|{ja.wikipedia.org, 2023-05-25T13:02:53.801266Z, 636a4445-9722-4509-8cd4-2cebaece1b5f, 8e7b7af9-c997-4e4f-a6f7-acd69a8001c0, rc1.mediawiki.page_content_change, https://ja.wikipedia.org/wiki/%E5%8E%9F%E6%B0%B4%E7%A6%81} |{true, 0, 15567, 原水禁, null, null}                                       |edit            |jawiki |eqiad     |
|null   |update        |null   |{null, null, null, null, null}|2004-01-14T07:09:11Z|{nl.wikipedia.org, 2023-06-02T20:41:52.689117Z, cbfbae34-efa4-42bd-b8ac-37398a3ab1a5, 19434548-6f05-4a7b-b167-38ada95b824f, rc1.mediawiki.page_content_change, https://nl.wikipedia.org/wiki/MediaWiki:Watchmethod-recent}|{false, 8, 31204, MediaWiki:Watchmethod-recent, null, null}                |edit            |nlwiki |eqiad     |
+-------+--------------+-------+------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+----------------+-------+----------+
spark.sql("""
SELECT
    'https://' || meta.domain || '/w/index.php?diff=prev&oldid=' || revision.rev_id AS revision_url
FROM event.rc1_mediawiki_page_content_change
WHERE meta.id IN ('636a4445-9722-4509-8cd4-2cebaece1b5f', 'cbfbae34-efa4-42bd-b8ac-37398a3ab1a5', 'db32762f-cdd6-4de2-8666-48b6200e4032', 'ac645b1e-b27b-40b8-b29c-b1b681ac2572', '9516ccb2-e472-48c9-9c42-3163734c9556')
AND 
(
      (year = 2023 AND month = 5 AND day = 25 AND hour = 13)
    OR
      (year = 2023 AND month = 6 AND day = 02 AND hour = 20)
    OR
      (year = 2023 AND month = 4 AND day = 24 AND hour = 21)
    OR
      (year = 2023 AND month = 7 AND day = 05 AND hour = 15)
    OR
      (year = 2023 AND month = 6 AND day = 28 AND hour = 18)
)
"""
).show(truncate=False)

+-------------------------------------------------------------+
|revision_url                                                 |
+-------------------------------------------------------------+
|https://de.wikipedia.org/w/index.php?diff=prev&oldid=6901200 |
|https://pl.wikipedia.org/w/index.php?diff=prev&oldid=1130401 |
|https://pl.wikipedia.org/w/index.php?diff=prev&oldid=1130401 |
|https://ja.wikipedia.org/w/index.php?diff=prev&oldid=2183715 |
|https://nl.wikipedia.org/w/index.php?diff=prev&oldid=1360489 |
+-------------------------------------------------------------+

Copying URLS here for easy clicking:
https://de.wikipedia.org/w/index.php?diff=prev&oldid=6901200
https://pl.wikipedia.org/w/index.php?diff=prev&oldid=1130401
https://pl.wikipedia.org/w/index.php?diff=prev&oldid=1130401
https://ja.wikipedia.org/w/index.php?diff=prev&oldid=2183715
https://nl.wikipedia.org/w/index.php?diff=prev&oldid=1360489

AFAICT, there is nothing fishy about these worst offenders in terms of data quality. They seem to be legit events that, for some reason, were ingested late.

We should consider the following code on EventBus:

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/EventBus/+/refs/heads/master/includes/HookHandlers/MediaWiki/PageChangeHooks.php#482

				$event = $this->pageChangeEventSerializer->toVisibilityChangeEvent(
					$this->streamName,
					$wikiPage,
					$performer,
					$revisionRecord,
					$visibilityChanges['oldBits'],
					// NOTE: ArticleRevisionVisibilitySet hook does not give us a proper event time.
					// The best we can do is use the current timestamp :(
					wfTimestampNow()
				);

Since none of the events that have more than 2 weeks of drift are visibility changes (See T341134#8998253), the PageChangeHoons.php snippet code above is not triggering the issue.

Looked over code at PageChangeEventSerializer.php, and all looks good to me. Copying relevant code that builds an edit event for completeness:

	/**
	 * Converts from the given WikiPage and RevisionRecord to a page_change_kind: edit event.
	 *
	 * @param string $stream
	 * @param WikiPage $wikiPage
	 * @param User $performer
	 * @param RevisionRecord $currentRevision
	 * @param RedirectTarget|null $redirectTarget
	 * @param RevisionRecord|null $parentRevision
	 * @return array
	 */
	public function toEditEvent(
		string $stream,
		WikiPage $wikiPage,
		User $performer,
		RevisionRecord $currentRevision,
		?RedirectTarget $redirectTarget = null,
		?RevisionRecord $parentRevision = null
	): array {
		$eventAttrs = $this->toCommonAttrs(
			'edit',
			$this->eventSerializer->timestampToDt( $currentRevision->getTimestamp() ),
			$wikiPage,
			$performer,
			$currentRevision,
			$redirectTarget,
			null
		);

		// On edit, the prior state is all about the previous revision.
		if ( $parentRevision !== null ) {
			$priorStateAttrs = [];
			$priorStateAttrs['revision'] = $this->revisionEntitySerializer->toArray( $parentRevision );
			$eventAttrs['prior_state'] = $priorStateAttrs;
		}

		return $this->toEvent( $stream, $wikiPage, $eventAttrs );
	}

Conclusions:

AFAICT, there is nothing fishy about these worst offenders in terms of data quality. They seem to be legit events that, for some reason, were ingested late.

What is fishy though, is how these events got ingested given that we do not currently have a backfill mechanism that would go all the way back to 2003?

I have no more leads to follow right now though, so I will pause this investigating, and close it in a few days if there are no more comments, as I do not depend on dt nor meta.dt anymore to move T335860 forward.

xcollazo changed the task status from Open to In Progress.Jul 7 2023, 7:02 PM
xcollazo triaged this task as High priority.
xcollazo set the point value for this task to 2.

close it in a few days if there are no more comments,

We should keep this open, I really think this should not happen.

What is fishy though, is how these events got ingested given that we do not currently have a backfill mechanism that would go all the way back to 2003?

Since these events are in page_change, and also based on the fact that you have duplicate change events for the same edit, e.g. https://pl.wikipedia.org/w/index.php?diff=prev&oldid=1130401, I can only guess that something is causing the PageSaveComplete (or whatever) MediaWiki hook to fire from somewhere. Its like some MediaWiki code is running in production firing the hook. Maybe some crazy hook testing somewhere we don't know about?

close it in a few days if there are no more comments,

We should keep this open, I really think this should not happen.

Fair enough.

What is fishy though, is how these events got ingested given that we do not currently have a backfill mechanism that would go all the way back to 2003?

Since these events are in page_change, and also based on the fact that you have duplicate change events for the same edit, e.g. https://pl.wikipedia.org/w/index.php?diff=prev&oldid=1130401, I can only guess that something is causing the PageSaveComplete (or whatever) MediaWiki hook to fire from somewhere. Its like some MediaWiki code is running in production firing the hook. Maybe some crazy hook testing somewhere we don't know about?

I think it is way too many events that have a high drift for being just a test. Look at this histogram, especially events with 10+ years of drift:

spark.sql("""
with drift as (
SELECT
    abs(to_unix_timestamp(to_timestamp(dt)) - to_unix_timestamp(to_timestamp(meta.dt)))/60/60/24/365 as drift_in_years,
    dt,
    meta.dt as meta_dt
FROM event.rc1_mediawiki_page_content_change
ORDER BY drift_in_years DESC
),

bins as (
  select
    floor(drift_in_years) as bin_floor,
    count(1) as count
  from drift
  group by 1
  order by 1
)

select
  bin_floor,
  bin_floor || ' - ' || (bin_floor + 1) as bin_range,
  count
from bins
order by 1;
"""
).show(truncate=False)

+---------+---------+---------+
|bin_floor|bin_range|count    |
+---------+---------+---------+
|0        |0 - 1    |160301506|
|1        |1 - 2    |3246159  |
|2        |2 - 3    |1462026  |
|3        |3 - 4    |1016688  |
|4        |4 - 5    |559787   |
|5        |5 - 6    |511664   |
|6        |6 - 7    |662303   |
|7        |7 - 8    |205655   |
|8        |8 - 9    |186432   |
|9        |9 - 10   |129370   |
|10       |10 - 11  |130019   |
|11       |11 - 12  |131190   |
|12       |12 - 13  |113173   |
|13       |13 - 14  |137259   |
|14       |14 - 15  |107158   |
|15       |15 - 16  |72394    |
|16       |16 - 17  |70490    |
|17       |17 - 18  |13786    |
|18       |18 - 19  |2611     |
|19       |19 - 20  |2        |
+---------+---------+---------+

+ @gmodena, for awareness.

@xcollazo I caught up with this thread. I'd suggest to update the description of this phab to highlight that the issue is with mediawiki_page_change_v1.

Since these events are in page_change, and also based on the fact that you have duplicate change events for the same edit, e.g. https://pl.wikipedia.org/w/index.php?diff=prev&oldid=1130401, I can only guess that something is causing the PageSaveComplete (or whatever) MediaWiki hook to fire from somewhere. Its like some MediaWiki code is running in production firing the hook. Maybe some crazy hook testing somewhere we don't know about?

I think it is way too many events that have a high drift for being just a test. Look at this histogram, especially events with 10+ years of drift:

I replicated your results on mediawiki_page_change_v1. This should remove known duplicates that exist in page_content_change rc1. Agree with you that the number of events having high drift looks suspicious.

+---------+---------+--------+
|bin_floor|bin_range|   count|
+---------+---------+--------+
|        0|    0 - 1|51060659|
|        1|    1 - 2| 1033645|
|        2|    2 - 3|  418427|
|        3|    3 - 4|  344786|
|        4|    4 - 5|  177593|
|        5|    5 - 6|  144346|
|        6|    6 - 7|  223380|
|        7|    7 - 8|   70144|
|        8|    8 - 9|   61379|
|        9|   9 - 10|   39649|
|       10|  10 - 11|   38659|
|       11|  11 - 12|   40651|
|       12|  12 - 13|   33535|
|       13|  13 - 14|   42446|
|       14|  14 - 15|   34521|
|       15|  15 - 16|   22640|
|       16|  16 - 17|   22786|
|       17|  17 - 18|    4618|
|       18|  18 - 19|     858|
+---------+---------+--------+

Query, for reference:

with drift as (
SELECT
    abs(to_unix_timestamp(to_timestamp(dt)) - to_unix_timestamp(to_timestamp(meta.dt)))/60/60/24/365 as drift_in_years,
    dt,
    meta.dt as meta_dt
FROM event.mediawiki_page_change_v1
ORDER BY drift_in_years DESC
),

bins as (
  select
    floor(drift_in_years) as bin_floor,
    count(1) as count
  from drift
  group by 1
  order by 1
)

select
  bin_floor,
  bin_floor || ' - ' || (bin_floor + 1) as bin_range,
  count
from bins
order by 1;

@xcollazo I looked at the legacy mediawiki_revision_create stream (it contains insert and update changes). Comparison is tricky because dt is missing from a significant number of events.
Schema details at https://schema.wikimedia.org/repositories//primary/jsonschema/mediawiki/revision/create/latest
I computed the diff between rev_timestamp (the revision's creation time in ISO8601 format) and meta.dt. This resulted in:

+---------+---------+---------+
|bin_floor|bin_range|    count|
+---------+---------+---------+
|        0|    0 - 1|145442143|
|        1|    1 - 2|    87066|
|        2|    2 - 3|    57624|
|        3|    3 - 4|    38797|
|        4|    4 - 5|    32504|
|        5|    5 - 6|    31192|
|        6|    6 - 7|    27213|
|        7|    7 - 8|    25312|
|        8|    8 - 9|    24651|
|        9|   9 - 10|    30714|
|       10|  10 - 11|    39717|
|       11|  11 - 12|    50210|
|       12|  12 - 13|    38921|
|       13|  13 - 14|    46518|
|       14|  14 - 15|    37151|
|       15|  15 - 16|    38252|
|       16|  16 - 17|    38528|
|       17|  17 - 18|    17392|
|       18|  18 - 19|     7179|
|       19|  19 - 20|     2533|
|       20|  20 - 21|      551|
|       21|  21 - 22|       96|
|       22|  22 - 23|       43|
+---------+---------+---------+

This also seems way off.

Query for reference:

with drift as (
SELECT
    abs(to_unix_timestamp(to_timestamp(rev_timestamp)) - to_unix_timestamp(to_timestamp(meta.dt)))/60/60/24/365 as drift_in_years,
    dt,
    meta.dt as meta_dt
FROM event.mediawiki_revision_create
ORDER BY drift_in_years DESC
),

bins as (
  select
    floor(drift_in_years) as bin_floor,
    count(1) as count
  from drift
  group by 1
  order by 1
)

select
  bin_floor,
  bin_floor || ' - ' || (bin_floor + 1) as bin_range,
  count
from bins
order by 1;

Two more slices.

  1. the drift between meta.dt and revision.rev_dt in mediawiki_page_change_v1:
+---------+---------+--------+
|bin_floor|bin_range|   count|
+---------+---------+--------+
|        0|    0 - 1|51053053|
|        1|    1 - 2| 1041818|
|        2|    2 - 3|  424597|
|        3|    3 - 4|  349518|
|        4|    4 - 5|  187305|
|        5|    5 - 6|  147640|
|        6|    6 - 7|  228192|
|        7|    7 - 8|   72489|
|        8|    8 - 9|   63125|
|        9|   9 - 10|   41222|
|       10|  10 - 11|   40662|
|       11|  11 - 12|   78302|
|       12|  12 - 13|   58293|
|       13|  13 - 14|   43655|
|       14|  14 - 15|   35493|
|       15|  15 - 16|   23758|
|       16|  16 - 17|   23508|
|       17|  17 - 18|    5085|
|       18|  18 - 19|    1036|
|       19|  19 - 20|      57|
|       20|  20 - 21|      13|
|       21|  21 - 22|       8|
+---------+---------+--------+
  1. drift between dt and revision.rev_dt in mediawiki_page_change_v1:
+---------+---------+--------+
|bin_floor|bin_range|   count|
+---------+---------+--------+
|        0|    0 - 1|53811197|
|        1|    1 - 2|    6459|
|        2|    2 - 3|    5467|
|        3|    3 - 4|    4122|
|        4|    4 - 5|    9535|
|        5|    5 - 6|    3183|
|        6|    6 - 7|    4623|
|        7|    7 - 8|    2270|
|        8|    8 - 9|    1661|
|        9|   9 - 10|    1431|
|       10|  10 - 11|    1963|
|       11|  11 - 12|   37612|
|       12|  12 - 13|   24725|
|       13|  13 - 14|    1178|
|       14|  14 - 15|     922|
|       15|  15 - 16|    1082|
|       16|  16 - 17|     687|
|       17|  17 - 18|     457|
|       18|  18 - 19|     177|
|       19|  19 - 20|      57|
|       20|  20 - 21|      13|
|       21|  21 - 22|       8|
+---------+---------+--------+

What's weird about this is that these are actual old state changes. It's not just that there is drift, some events that happened years ago and being emitted into the stream now. E.g. Why is an edit event for https://pl.wikipedia.org/w/index.php?diff=prev&oldid=1130401 being emitted now?

Breakdown by page_change_kind and performer.is_bot on mediawiki_page_change_v1.
Bots might correlate with some of the drift we see.

+--------------+-----------------+------+-------------+
|drift_in_years| page_change_kind|is_bot|num_revisions|
+--------------+-----------------+------+-------------+
|            18|             edit|  true|          891|
|            17|             edit| false|            6|
|            17|             edit|  true|         4771|
|            16|             edit| false|           26|
|            16|             edit|  true|        23564|
|            15|             edit|  true|        23403|
|            15|             edit| false|           48|
|            14|             edit|  true|        35721|
|            14|             edit| false|           66|
|            13|             edit|  true|        43758|
|            13|             edit| false|          125|
|            12|             edit|  true|        34150|
|            12|             edit| false|          506|
|            11|             edit|  true|        41889|
|            11|             edit| false|          191|
|            10|             edit| false|          868|
|            10|             edit|  true|        38895|
|             9|             edit|  true|        40111|
|             9|             edit| false|          550|
|             8|             edit| false|          904|
|             8|             edit|  true|        62217|
|             7|             edit| false|          848|
|             7|             edit|  true|        71205|
|             6|             edit| false|         1606|
|             6|             edit|  true|       227941|
|             5|             edit| false|         1279|
|             5|             edit|  true|       145956|
|             4|             edit|  true|       178042|
|             4|             edit| false|         3123|
|             3|             edit| false|        12146|
|             3|             edit|  true|       347938|
|             2|             edit| false|        23344|
|             2|             edit|  true|       404040|
|             1|             edit|  true|       958087|
|             1|             edit| false|        97700|
|             0|           delete|  true|        14917|
|             0|         undelete|  true|           68|
|             0|visibility_change| false|          691|
|             0|         undelete| false|         3946|
|             0|             move| false|       144252|
|             0|           delete| false|       297054|
|             0|             edit| false|     19223449|
|             0|           create| false|      2723210|
|             0|             move|  true|        20856|
|             0|visibility_change|  true|         1376|
|             0|           create|  true|      1115020|
|             0|             edit|  true|     28421568|
+--------------+-----------------+------+-------------+

Query for reference:

SELECT
    floor(abs(to_unix_timestamp(to_timestamp(dt)) - to_unix_timestamp(to_timestamp(meta.dt)))/60/60/24/365) as drift_in_years,
    page_change_kind,
    performer.is_bot,
    count(*) as num_revisions
FROM event.mediawiki_page_change_v1
GROUP by performer.is_bot, page_change_kind, drift_in_years
ORDER BY drift_in_years limit 100;
xcollazo changed the task status from In Progress to Open.Jul 13 2023, 8:05 PM

Change 939317 had a related patch set uploaded (by Milimetric; author: Milimetric):

[mediawiki/extensions/EventBus@master] Filter out null edits

https://gerrit.wikimedia.org/r/939317

Filter out null edits

HuH! Interesting. Current me does not remember this at all.

What are null edits?

Is there any reason to capture them in page_change as a different page_change_kind: null_edit?

If it is possible for PageSaveComplete to be called for things other than regular edits and creates, then perhaps we should not always blindly set dt to be the same as rev_dt. For regular edits and creates this makes sense, but for null edits, perhaps not?

You can make a null edit if you click the Edit button, make no changes at all, and click Save. That would not generate an entry in the revision table, but would trigger the PageSaveComplete hook. This used to generate separate events before the EventBus refactor: https://gerrit.wikimedia.org/g/mediawiki/extensions/EventBus/+/refs/changes/76/821776/30/includes/EventBusHooks.php#295

I don't think anyone cares about these things, they can be triggered:

  • accidentally, in which case we really don't care
  • on purpose, to update links

In the latter case, if people want to follow link updates, they can consume from the relevant topics, so I still vote to ignore.

If it is possible for PageSaveComplete to be called for things other than regular edits and creates, then perhaps we should not always blindly set dt to be the same as rev_dt. For regular edits and creates this makes sense, but for null edits, perhaps not?

This discussion tells me that we do care about just about everything else this hook does, with null edits being the only exception. I think setting dt = rev_dt is ok for everything else, but it would be good to see the data once we filter out the null edits (which are otherwise hard to distinguish because there's no entry in the db for them). Once these are gone, we can join to the sqooped data and verify more.

You can make a null edit if you click the Edit button, make no changes at all, and click Save.

I am constantly amazed at the complexity of MW...

Change 939317 merged by Milimetric:

[mediawiki/extensions/EventBus@master] Filter out null edits

https://gerrit.wikimedia.org/r/939317

My ramblings that got me to the null edits, might be useful for someone else testing that there is no other source of unexpected drift:

/** page content change seems to be catching tag updates
 * https://en.wiktionary.org/w/index.php?title=%E1%82%8E&diff=75197912&oldid=75194753
 *   In this example, an old edit by Kwamikagami was reverted by Thadh.  However, in our events, we have an edit with rev_id = 75194753 (the old id) but a comment and dt that make it seem like it was the revert.  However, the rev_id and rev_dt match this old edit.  This points to maybe the tags changing, and we don't capture that information in the page change state, so it looks like a bad event.
 *
 * In the event db, looking for 75194753 on 2023-07-11 and 2023-07-12:
 */

 select *
   from rc1_mediawiki_page_content_change
  where year = 2023 and month = 7 and ((day = 12 and hour = 8) or (day = 11 and hour = 20))
    and changelog_kind = 'insert'
    and revision.rev_id = 75194753
  limit 100;

/**
 * This only showed one record; after a closer look, I realized this was a page restore, identified by page_change_kind = 'undelete'.  These will have mismatching performer/rev_actor and confusing dt/revision.rev_dt relationships.  I'm going to ignore them and search further
 */

 select *
   from rc1_mediawiki_page_content_change
  where year = 2023 and month = 7 and day = 12
    and changelog_kind = 'insert'
    and page_change_kind != 'undelete'
    and revision.rev_dt != dt
  limit 100;

/**
 * no rows, same when searching all of July
 *
 * Realization: this is going to look very different based on what type of change we're looking at.  Let's break it down that way and look at the drift systematically
 */

presto:event> select changelog_kind, page_change_kind, count(*) from rc1_mediawiki_page_content_change where year=2023 and month=7 and day=12 group by changelog_kind, page_change_kind;
 changelog_kind | page_change_kind  |  _col2 
----------------+-------------------+--------
 insert         | create            |  160545
 update         | visibility_change |      84
 update         | move              |    6646
 insert         | undelete          |     102
 update         | edit              | 1458105
 delete         | delete            |    8581
(6 rows)

/**
 * On update/edit, we probably want to make sure the rev_dt is not outside the hour this edit is bucketed in:
 */

 select revision.rev_dt, dt, meta.dt as meta_dt, *
   from rc1_mediawiki_page_content_change
  where year = 2023 and month = 7 and day = 12
    and wiki_id = 'enwiki'
    and changelog_kind = 'update'
    and page_change_kind = 'edit'
    and date_trunc('hour', from_iso8601_timestamp(revision.rev_dt)) < date_trunc('hour', from_iso8601_timestamp(meta.dt))
  limit 10
;

/**
 * Here we find some things that I don't understand
 *
 * 1. https://en.wikipedia.org/w/index.php?title=User%3AClueBot_III%2FMaster_Detailed_Indices%2FTalk%3ASnake_oil&diff=1152702265&oldid=1080392529 that last edit came in again, two months later, as if it was being done again.  It seems like that null edit / "no-op" thread coming up again?
 * 2. Edit rev_id=1164928396 shows up more than 10 times with update/edit on days 11 and 12.  The same exact edit, with the same performer.  Every few minutes it happens again.  So something's happening that is triggering this hook that we don't know about.

 select revision.rev_dt, dt, meta.dt as meta_dt, *
   from rc1_mediawiki_page_content_change
  where year = 2023 and month = 7 and day in (11,12)
    and wiki_id = 'enwiki'
    and changelog_kind = 'update'
    and page_change_kind = 'edit'
    and revision.rev_id = 1164928396
  limit 10
;

 * 3. It's not just bots, on this edit, a human seems to have done the no-op bump thing: https://en.wikipedia.org/w/index.php?title=Umm_Salama&diff=1164349865&oldid=1155152199
 */

As per the MW tag assigned to this task, @Milimetric's https://gerrit.wikimedia.org/r/939317 should have been deployed on 2023-08-01, meaning that hourly content from event.rc1_mediawiki_page_content_change should show way less drift on any recent hourly partition.

Confirming:

Percentage of historical events that have a drift of 2 weeks or more:

SELECT sum(CASE drift_in_years > 2/52 WHEN true THEN 1 ELSE 0 END) / count(1) * 100 AS percentage
FROM
    (SELECT
        *,
        abs(to_unix_timestamp(to_timestamp(dt)) - to_unix_timestamp(to_timestamp(meta.dt)))/60/60/24/365 as drift_in_years
    FROM event.mediawiki_page_change_v1
    )

percentage
8.683998169667415
Time taken: 18.408 seconds, Fetched 1 row(s)

Percentage of events in the last week that have a drift of 2 weeks or more:

SELECT sum(CASE drift_in_years > 2/52 WHEN true THEN 1 ELSE 0 END) / count(1) * 100 AS percentage
FROM
    (SELECT
        *,
        abs(to_unix_timestamp(to_timestamp(dt)) - to_unix_timestamp(to_timestamp(meta.dt)))/60/60/24/365 as drift_in_years
    FROM event.mediawiki_page_change_v1
    WHERE to_timestamp(meta.dt) >= '2023-08-01'
    )

percentage
2.3860728382031895
Time taken: 28.865 seconds, Fetched 1 row(s)

That is a 8.683998169667415 - 2.3860728382031895 = 6.2979253315% difference, which is great!

Let's confirm that indeed the percentage drop aligns with the change making it to production by looking at the percentage over the last ~3 weeks:

SELECT sum(CASE drift_in_years > 2/52 WHEN true THEN 1 ELSE 0 END) / count(1) * 100 AS percentage, count(1) AS event_count, DATE(to_timestamp(meta.dt)) as dd
FROM
    (SELECT
        *,
        abs(to_unix_timestamp(to_timestamp(dt)) - to_unix_timestamp(to_timestamp(meta.dt)))/60/60/24/365 as drift_in_years
    FROM event.mediawiki_page_change_v1
    WHERE to_timestamp(meta.dt) >= '2023-07-25'
    )
GROUP BY dd
ORDER BY dd ASC


percentage	event_count	dd
10.952098822821526	1841437	2023-07-25
8.900063719496444	1858144	2023-07-26
10.997349880421433	1624455	2023-07-27
10.748732272367764	1727106	2023-07-28
10.924071221913549	1736825	2023-07-29
10.181690458812518	1824972	2023-07-30
11.790032212982528	1984914	2023-07-31
11.825081318922456	2114514	2023-08-01
11.274025979378044	1941232	2023-08-02
3.900187034061273	1518440	2023-08-03
0.0	1543647	2023-08-04
0.0	1705728	2023-08-05
0.0	1779332	2023-08-06
0.0	1860351	2023-08-07
0.0	1665826	2023-08-08
0.0	1681713	2023-08-09
0.0	1557606	2023-08-10
0.0	1346980	2023-08-11
0.0	1373358	2023-08-12
0.0	1390247	2023-08-13
0.0	654466	2023-08-14

The change is working as expected! I will assume that we do rolling upgrades and that is why we see a delay of two days to get the drift to 0.

This is great news, and this will definitely make consumption of this stream easier.

( Pinging @phuedx here as I believe he may have some questions about this work? )

xcollazo changed the task status from Open to In Progress.Aug 14 2023, 3:03 PM
xcollazo updated the task description. (Show Details)

( Pinging @phuedx here as I believe he may have some questions about this work? )

Thanks for the ping @xcollazo. I was curious about this work because of T259373: Develop strategy for mitigating degenerate client timestamps in event data, which is about determining whether client-side instruments are sending skewed timestamps and, if so, how to correct for it. There are plenty of queries here that can be reused for the initial investigation.