Page MenuHomePhabricator

Update event-producing tools to overwrite `meta.dt`
Closed, ResolvedPublic

Description

Currently EventGate and eventutilities libraries set event meta.dt timestamp field only if event producers don't already set them. This leads to some events being late in regard to their meta.dt timestamp, while they actually arrived 'in time' in terms on ingestion timestamp.
We wish to change the behavior of EventGate and the event-utilities libraries to always set meta.dt, overwriting the event producer provided value.

We must consider the canary events special case. Canary events are used by the platform to ensure that streams are working as expecting, and they are used by Hive ingestion to close hourly partition windows for streams that have no events in an hour.

The logic needed by event producing libraries is:

if meta.domain != 'canary':
    meta.dt = current_date_time

Plan:

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Don't set meta.dt - allow event intake service to set itrepos/search-platform/discolytics!59ottoT376026_meta_dtmain
makeMapToErrorEvent - update to /error/2.1.0 and don't set meta.dtrepos/data-engineering/eventgate-wikimedia!18ottoT376026_error_eventmaster
Customize query in GitLab

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1171696 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/extensions/EventBus@master] createRecentChangeEvent - allow the intake service to set meta.dt

https://gerrit.wikimedia.org/r/1171696

Change #1171697 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate - bump to version 1.15.0 for -external facing deployments

https://gerrit.wikimedia.org/r/1171697

Deployed eventgate in beta, looks good.

Change #1171697 merged by Ottomata:

[operations/deployment-charts@master] eventgate - bump to version 1.15.0 for -external facing deployments

https://gerrit.wikimedia.org/r/1171697

Mentioned in SAL (#wikimedia-operations) [2025-07-22T19:41:11Z] <ottomata> deploying eventgate-logging-external and eventgate-analytics-external to pick up meta.dt change - T376026

Mentioned in SAL (#wikimedia-analytics) [2025-07-22T19:41:15Z] <ottomata> deploying eventgate-logging-external and eventgate-analytics-external to pick up meta.dt change - T376026

I deployed eventgate-logging-external and eventgate-analytics-external in codfw. Along the way, I noticed that my info log to alert us of bad clients was being logged...by eventgate itself!

Overriding meta.dt in event e8102585-e503-42e2-8d89-0c1875e96293 of schema at /error/1.0.0 destined to stream eventgate-analytics-external.error.validation from 2025-07-22T19:44:44.011Z to 2025-07-22T19:44:44.011Z.

This is because the error events that get created by eventgate set meta.dt.

Submitted MR to fix, and along the way update the error event version and make use of dt for event time, as well as the name of the error class in error_type.

Change #1172079 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-*-external - bump to 1.16.0

https://gerrit.wikimedia.org/r/1172079

Change #1172079 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-*-external - bump to 1.17.0

https://gerrit.wikimedia.org/r/1172079

Mentioned in SAL (#wikimedia-operations) [2025-07-23T19:00:25Z] <ottomata> deploying eventgate-analytics-external and eventgate-logging-external to get meta.dt logic change - T376026

Change #1172094 had a related patch set uploaded (by Ottomata; author: Ottomata):

[wikimedia-event-utilities@master] Always override set meta.dt unless canary event

https://gerrit.wikimedia.org/r/1172094

Change #1171696 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] EventFactory - allow the intake service to set meta.dt

https://gerrit.wikimedia.org/r/1171696

Deployed meta.dt change to eventgate-analytics-external and eventgate-logging-external. I want to wait for EventFactory - allow the intake service to set meta.dt (1171696) to land with the MW train this week before I deploy to the others.

Change #1175582 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate main and analytics - bump to v1.19.0

https://gerrit.wikimedia.org/r/1175582

Change #1175582 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate main and analytics - bump to v1.19.0

https://gerrit.wikimedia.org/r/1175582

Mentioned in SAL (#wikimedia-operations) [2025-08-04T19:35:10Z] <ottomata> deploying eventgate-analytics and eventgate-main to pick up meta.dt field logic change - T376026

Deployed eventgate-analytics and eventgate-main.

Change #1175595 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/core@master] ApiMain - don't set meta.dt on mediawiki.api-request stream

https://gerrit.wikimedia.org/r/1175595

Change #1175596 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics - revert to v1.11.0 to avoid log spam

https://gerrit.wikimedia.org/r/1175596

Change #1175596 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics - revert to v1.11.0 to avoid log spam

https://gerrit.wikimedia.org/r/1175596

I had to rollback eventgate-analytics in eqiad because logspam. I missed a spot.

Will have to get that merged and then wait until the next train goes out.

Mentioned in SAL (#wikimedia-operations) [2025-08-04T20:45:39Z] <ottomata> eventgate-analytics in eqiad cannot be deployed due to stuck helm STATUS: pending-upgrade. This needs to be deployed to rollback to a version that doesn't cause logspam. cc cwhite, rzl - T376026

ApiMain - don't set meta.dt on mediawiki.api-request stream (1175595) is merged. I will wait until after next week's train to deploy eventgate-analytics again. But, I will be OOO the week of Aug 18, so it'll have to wait until the week of Aug 25.

Change #1175595 merged by jenkins-bot:

[mediawiki/core@master] ApiMain - don't set meta.dt on mediawiki.api-request stream

https://gerrit.wikimedia.org/r/1175595

Change #1172094 merged by jenkins-bot:

[wikimedia-event-utilities@master] Always set meta.dt unless canary event

https://gerrit.wikimedia.org/r/1172094

Change #1182572 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics, eventgate-logging-ext: upgrade to 1.19.0

https://gerrit.wikimedia.org/r/1182572

Change #1182572 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics, eventgate-logging-ext: upgrade to 1.19.0

https://gerrit.wikimedia.org/r/1182572

Mentioned in SAL (#wikimedia-operations) [2025-08-27T13:44:49Z] <ottomata> deploying eventgate-analytics and eventgate-logging-external to pick up meta.dt change - T376026

Deployed eventgate-logging-external and eventgate-analytics. We should be done!

17:15:15 <cwhite>: ottomata: ~3k/sec of Overriding meta.dt in event logs have reappeared.
20:54:33 <ottomata>: cwhite: can you post a link to logstash search and error in https://phabricator.wikimedia.org/T376026 ?

The messages are getting caught in the spam filter:

Filter

Graph

@colewhite how can I find an example of this log line? I need to figure out what is causing it. Thank you!

Change #1184550 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/extensions/CirrusSearch@master] Allow event system to set meta.dt

https://gerrit.wikimedia.org/r/1184550

I think there are change-prop culprits too. Harder to understand them but am working on it.

Change #1184560 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/services/mobileapps@master] Allow event system to set meta.dt

https://gerrit.wikimedia.org/r/1184560

Change #1184550 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Allow event system to set meta.dt

https://gerrit.wikimedia.org/r/1184550

Change #1184560 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Allow event system to set meta.dt

https://gerrit.wikimedia.org/r/1184560

I still see 2 offenders:

{"name":"eventgate-wikimedia","hostname":"eventgate-production-859cb57858-v55bz","pid":1,"level":"INFO","msg":"Overriding meta.dt in event 01f2df08-f760-48c5-966f-bd98010ff71d of schema at /mediawiki/page/change/1.3.0 destined to stream mediawiki.page_change.v1 from 2025-09-22T14:33:57Z to 2025-09-22T14:33:57.834Z.","time":"2025-09-22T14:33:57.834Z","v":0}
{"name":"eventgate-wikimedia","hostname":"eventgate-production-859cb57858-5hh8g","pid":1,"level":"INFO","msg":"Overriding meta.dt in event 4b26705a-0c32-46ff-a3ee-a332d2c5ac81 of schema at /mediawiki/cirrussearch/page_weighted_tags_change/1.0.0 destined to stream mediawiki.cirrussearch.page_weighted_tags_change.v1 from 2025-09-22T14:26:35Z to 2025-09-22T14:26:35.663Z.","time":"2025-09-22T14:26:35.663Z","v":0}

Why page_change: ? I will look into it! That should be fixed.

page_weighted_tags_change: Will ask search team.

Change #1190301 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/extensions/EventBus@master] EventSerializer - fix logic for setting of meta.dt

https://gerrit.wikimedia.org/r/1190301

Wow, I think I found the page_change offender. It is in EventBus itself. I think we thought we did the right thing for this stream when it was created years ago, but it looks like there was a bug! This logging allowed us to spot the bug and fix it. So...that is good? :)

Change #1190308 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/extensions/CirrusSearch@master] EventBusWeightedTagSerializerTest - Remove assertion that meta.dt is set

https://gerrit.wikimedia.org/r/1190308

Change #1190308 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] EventBusWeightedTagSerializerTest - Remove assertion that meta.dt is set

https://gerrit.wikimedia.org/r/1190308

Change #1190301 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] EventSerializer - fix logic for setting of meta.dt

https://gerrit.wikimedia.org/r/1190301

In the logs I spotted another offender

{"@timestamp":"2025-09-26T16:47:14.571Z","ecs.version":"8.10.0","log.level":"info","message":"Overriding meta.dt in event b63f71b4-d6ff-4a1f-8544-e01d11df60c3 of schema at /sparql/query/1.3.0 destined to stream wdqs-external.sparql-query from 2025-09-26T16:47:14.499Z to 2025-09-26T16:47:14.571Z.","service":{"name":"eventgate-analytics"}}

I think this one (sparql query logs) would need a dedicated task, I believe we might have used meta.dt to capture the query start time, since the event system in wdqs is buffered we might get less precise timing information now that we override this in event-gate. Ideally we should get a top-level dt field for this and drop meta.dt, but this will require some schema changes I think (cc @gmodena).

I think this one (sparql query logs) would need a dedicated task, I believe we might have used meta.dt to capture the query start time, since the event system in wdqs is buffered we might get less precise timing information now that we override this in event-gate. Ideally we should get a top-level dt field for this and drop meta.dt, but this will require some schema changes I think (cc @gmodena).

Filed: T405949