Page MenuHomePhabricator

mediawiki.page_change.v1 event stream - Investigate mistmatched meta.dt and dt (and rev_dt) fields
Open, Needs TriagePublic

Description

I just came across a surprising result. In 2025-09, there are many cases where event time dt (and revision.rev_dt) is in the past way behind ingestion time meta.dt.

See query and results here:
https://gitlab.wikimedia.org/-/snippets/267

It looks like all of these either have page_change_kind as 'create' or 'edit'. All create and edit events should results in brand new revision ids. For these kinds of events, IIRC, in EventBus, we set dt (event time) = Revision's create time. This makes sense, as we'd like the event time to match what MediaWiki consider's the event time: the time at which the revision was created.

So, I'd expect there to be very very little mismatch between dt and meta.dt, caused only by the small latency between revision create and eventgate's setting of meta.dt. Not weeks.

There may be edge cases where MediaWiki is 'importing' revisions from another wiki and causing create and edit events to be fired.

But, I ran the same query for data on 2025-01, before T391254: Hypothesis 5.2.13: EventBus Adoption of Domain Events was completed.
There are very few mismatches here, and where there are mismatches they are on the date border as might be expected.

I suspect something is different with event time since we migrated to DomainEvents.

Event Timeline

Here is a suspicious event from October

select * from event.mediawiki_page_change_v1 
where year=2025 and month=10 and (day=27 or day=28)
and to_date(dt) = date '2025-10-07' 
limit 10;
ColumnValue
dt2025-10-07T16:26:20Z
meta{"domain":"de.wikipedia.org","dt":"2025-10-28T21:29:25.733Z","id":"3dfb07c5-9cc6-4bb4-8be0-3c223b19746c","request_id":"d3d1509c-9390-462a-a710-50b7eed20a2b","stream":"mediawiki.page_change.v1","uri":"https://de.wikipedia.org/wiki/Benutzer:Shi_Annan/Abigail_Becker"}
page{"is_redirect":false,"namespace_id":2,"page_id":13663003,"page_title":"Benutzer:Shi_Annan/Abigail_Becker","revision_count":null,"redirect_page_link":{"interwiki_prefix":null,"is_redirect":null,"namespace_id":null,"page_id":null,"page_title":null}}
page_change_kindedit
performer{"edit_count":52199,"groups":["autoreview","editor","sysop","*","user","autoconfirmed","oathauth-twofactorauth"],"is_bot":false,"is_system":false,"is_temp":false,"registration_dt":"2006-04-04T13:24:52Z","user_id":207244,"user_text":"Elendur","user_central_id":202514}
prior_state{"page":{"is_redirect":null,"namespace_id":null,"page_id":null,"page_title":null,"revision_count":null},"revision":{"comment":"/* Conductor shipwreck */ whole sentence redundant","content_slots":{"main":{"content_body":null,"content_format":"text/x-wiki","content_model":"wikitext","content_sha1":"g6dywnxr3pw2pi5kkhi86kttobu83ry","content_size":19150,"origin_rev_id":261030920,"slot_role":"main"}},"editor":{"edit_count":null,"groups":["*"],"is_bot":false,"is_system":false,"is_temp":false,"registration_dt":null,"user_id":null,"user_text":"en>Very Polite Person","user_central_id":null},"is_comment_visible":true,"is_content_visible":true,"is_editor_visible":true,"is_minor_edit":false,"rev_dt":"2025-10-07T16:24:52Z","rev_id":261030920,"rev_parent_id":261030919,"rev_sha1":"g6dywnxr3pw2pi5kkhi86kttobu83ry","rev_size":19150}}
revision{"comment":"/* Conductor shipwreck */ ce","content_slots":{"main":{"content_body":null,"content_format":"text/x-wiki","content_model":"wikitext","content_sha1":"p2uyeqzww1gan8unh5yevqb4hdru2ef","content_size":19080,"origin_rev_id":261030921,"slot_role":"main"}},"editor":{"edit_count":null,"groups":["*"],"is_bot":false,"is_system":false,"is_temp":false,"registration_dt":null,"user_id":null,"user_text":"en>Very Polite Person","user_central_id":null},"is_comment_visible":true,"is_content_visible":true,"is_editor_visible":true,"is_minor_edit":false,"rev_dt":"2025-10-07T16:26:20Z","rev_id":261030921,"rev_parent_id":261030920,"rev_sha1":"p2uyeqzww1gan8unh5yevqb4hdru2ef","rev_size":19080}
wiki_iddewiki
datacentercodfw
year2025
month10
day28
hour21

This looks to me like a very normal edit event: https://de.wikipedia.org/w/index.php?title=Benutzer:Shi_Annan/Abigail_Becker&oldid=261030920

meta.dt is "2025-10-28T21:29:25.733Z" and dt and rev_dt is ""2025-10-07T16:24:52Z".

Hm, is it possible FlaggedRevs is the culprit here? I think dewiki uses flagged revs.

When I count for mistmatches in September by wiki_id, enwiki is way lower than dewiki, even though its edit rate is higher

wiki_id mismatch_date_cnt
dewiki  27497
bewwiktionary   23019
igwiki  17932
tcywiki 17437
mediawikiwiki   13561
zghwiktionary   6431
commonswiki     6303
minwikibooks    5398
mswikiquote     2969
zhwikiversity   2938
madwikisource   2808
viwikivoyage    1119
plwikisource    532
enwikibooks     511
suwiki  482
newiki  348
tawiki  255
tumwiki 249
enwiki  212
nowiki  205
...

It looks like that edit was imported, per https://de.wikipedia.org/w/index.php?title=Benutzer:Shi_Annan/Abigail_Becker&action=history

dewiki seems to import pages a lot more than enwiki, comparing https://de.wikipedia.org/w/index.php?title=Spezial:Logbuch&excludetempacct=1&page=&tagfilter=&type=import&user=&wpFormIdentifier=logeventslist&wpdate=&wpfilters%5B0%5D=newusers&offset=&limit=500 (~25/day) with https://en.wikipedia.org/w/index.php?title=Special:Log&page=&tagfilter=&type=import&user=&wpFormIdentifier=logeventslist&wpdate=&wpfilters%5B0%5D=newusers&offset=&limit=500 (~3/day).

That's like a 13x enwiki:dewiki ratio. From https://stats.wikimedia.org/#/en.wikipedia.org, I see 5M enwiki edits per month vs 0.743M dewiki edits per month (6.73 enwiki:dewiki ratio). I'd expect a ~100x mismatch ratio if it's just from import alone (e.g. ~274 mismatches on enwiki). That doesn't seem far off.

If these are all caused by imports (are they? we should check for sure), then we should probably model a page_change_kind: import in the mediawiki.page_change.v1 event.

I think the event will be mostly the same (unless the MediaWiki Domain Event gives us more info about the import action we want to capture?). The only difference here will be:

In the case of an import:
revision.rev_dt should be the actual revision timestamp.
dt should be the the event time of the import itself.

I think this might be a bit more complex that it seems. I've been looking at the import events in Mediawiki, and it seems that there are 2 different fields:

  • EventType: Which can be PageCreated, PageDeleted, PageRevisionUpdated, etc.
  • Cause: Which can be edit, move, delete, import, rollback, etc.

If I'm understanding this properly, the import is the cause triggering events, like PageCreated and PageRevisionUpdated which we are translating to page_change_kind: create and page_change_kind: edit

If we create a page_change_kind: import it could receive both "Create" and "Edit" events. I'm assuming we'll keep the changelog_kind as insert or update.

But, I'm wondering if it would make more sense to keep page_change_kind as it is, and propagate the cause field to our schema? Something like:

When a page is imported:

page_change_kind: create
changelog_kind: insert
cause: import

When a revision is imported:

page_change_kind: edit
changelog_kind: update
cause: import

It looks to me that import is just another way of triggering the "Create" and "Edit" events we listen, but it isn't a different event on its own.

I don't know how these events are used down the stream, so it would be nice if someone can add some opinions.
cc: @tchin, @xcollazo

I think our model definitely has a gap if we are to account for imports.

Using the event example from T409105#11337675, it states that page_change_kind = edit, but also prior_state.page.page_id = null ! How can we have an edit if there was no prior page? This is confusing.

As alluded before, it does seem that some of these are not full page imports, which would create a new page id, but just cherry-picks that go into an existing page? This seems like a vote for the cause: import idea, so that we can discriminate accordingly.

I think we need someone with deep knowledge of all the possible import states to walk us thru.

Nice @JMonton-WMF!

If we create a page_change_kind: import it could receive both "Create" and "Edit" events.

Good point. Agree reusing page_change_kind seems a little strange. Perhaps a new cause like field is a good idea. We should think a lot about this because we might be introducing a new convention for all state change like events here, but something like this sounds right.

it does seem that some of these are not full page imports, which would create a new page id, but just cherry-picks that go into an existing page?

Yeah, perhaps these are import and history merge? Is that a thing? When importing a page, can its revisions be merged into an existing page_id, like can be done when a page is undeleted? Even if that is possible, it would be surprising if this was a frequent source of this bug, as mediawiki.page_change only contains events that affect the latest revision of a page. I suppose a 'import and merge into page_id' could happen where the import is bringing in new latest revisions? But does that happen often?

All of that makes me wonder: What might cause be when page_change_kind = undelete? On undelete, a page and its revision history is brought back into reality. When the revision of a the undeleted page was originally created, the corresponding mediawiki.page_change event had page_change_kind = edit (or maybe create), but now it has page_change_kind = undelete. This feels more like a cause: undelete case?

Writing this out for my own understanding. Assuming the latest revision of a page was due do an 'edit' (not a page create), if we had a cause field, would it be like this?

# import event
changelog_kind: update
page_change_kind: edit
cause: import

# undelete event
changelog_kind: insert (?)
page_change_kind: edit
cause: undelete

# and just for comparison, I suppose a regular edit event would like like:
changelog_kind: update
page_change_kind: edit
cause: edit

If we create a page_change_kind: import it could receive both "Create" and "Edit" events

Yeah, this is true now for undeletes eh? Something is fishy! Let's think about what all 3 fields should be set to for every possible kind of change.

For reference:

https://github.com/wikimedia/mediawiki/blob/master/includes/Storage/PageUpdateCauses.php

I'm having flashbacks to Domain Event modeling from last year where we had a lot of back and forth about causes vs event types. Need to remember it all!