Page MenuHomePhabricator

Implement tracing across changeprop-jobqueue, kafka, eventgate
Open, Needs TriagePublic5 Estimated Story Points

Description

Summary

Currently, the tracing context is not propagated across the production job queue (MediaWiki -> eventgate -> Kafka -> changeprop -> MediaWiki).
We should add proper tracing instrumentation to eventgate and changeprop.

Background

  • Although MediaWiki now has basic tracing instrumentation, we lose the tracing context for jobs at the eventgate -> kafka boundary.
  • This causes the actual job executions to be assigned a new trace (if they get sampled at all), instead of being ordered under the original trace that triggered the operation.
  • Instrumenting eventgate and changeprop and preserving the tracing context across the entire production job queue would significantly improve visibility into this system.

Technical notes

  • Both changeprop and eventgate can be trivially wired up to the official OTEL Node SDK. Prior art also exists in wikifunctions.
  • We can inject the trace context into Kafka message headers in eventgate and read them in changeprop. Since most events outside of jobs won't need this, we can make it contingent on a stream configuration flag. For this to work, we first need to roll out a new node-rdkafka-factory version that supports message headers.
  • changeprop currently extensively uses legacy pre-ES6 bluebird promises, which are incompatible with the mechanism the OTEL SDK uses to juggle contexts across async operations. Since the service is now using a modern (v20) node version, we can take this opportunity to migrate relevant code paths to async functions and native Promises instead.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
mediawiki/extensions/EventBusREL1_44+3 -1
mediawiki/extensions/EventBusREL1_43+3 -1
mediawiki/services/change-propagationmaster+926 -43
mediawiki/services/change-propagationmaster+7 -5
mediawiki/services/change-propagationmaster+1 -2
mediawiki/services/change-propagationmaster+8 -6
mediawiki/services/change-propagationmaster+24 -23
mediawiki/services/change-propagationmaster+15 -23
mediawiki/services/change-propagationmaster+18 -20
mediawiki/services/change-propagationmaster+8 -8
mediawiki/services/change-propagationmaster+11 -13
mediawiki/services/change-propagationmaster+14 -10
mediawiki/services/change-propagationmaster+30 -23
mediawiki/services/change-propagationmaster+6 -5
mediawiki/services/change-propagationmaster+11 -9
mediawiki/services/change-propagationmaster+4 -4
mediawiki/services/change-propagationmaster+61 -58
mediawiki/extensions/EventBusmaster+3 -1
Show related patches Customize query in gerrit
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
gitlab_ci_templates - bump version of debian bullseye imagerepos/data-engineering/workflow_utils!54ottoT383557_debian_ci_imagemain
Instrument eventgate-wikimedia with OTELrepos/data-engineering/eventgate-wikimedia!16mszaboT395038master
Support producing messages with headersrepos/data-engineering/node-rdkafka-factory!5mszaboT395038master
Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dr0ptp4kt added subscribers: gmodena, Ahoelzl.
dr0ptp4kt subscribed.

Added a tag and a couple subscribers here for visibility.

Change #1149517 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Convert BaseExecutor._exec into an async function

https://gerrit.wikimedia.org/r/1149517

Change #1149518 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Convert RetryExecutor._delay into an async function

https://gerrit.wikimedia.org/r/1149518

Change #1149519 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Convert RetryExecutor.processMessage into an async function

https://gerrit.wikimedia.org/r/1149519

Change #1149520 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Convert RuleExecutor.processMessage into an async function

https://gerrit.wikimedia.org/r/1149520

Change #1149521 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Convert BaseExecutor._rateLimitMessage into an async function

https://gerrit.wikimedia.org/r/1149521

Change #1149522 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Convert BaseExecutor._updateLimiters into an async function

https://gerrit.wikimedia.org/r/1149522

Change #1149523 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Convert BaseExecutor._consume into an async function

https://gerrit.wikimedia.org/r/1149523

Change #1149524 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Convert BaseExecutor.subscribe() into an async function

https://gerrit.wikimedia.org/r/1149524

Change #1149525 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Convert subscription classes to use async functions

https://gerrit.wikimedia.org/r/1149525

Change #1149526 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Convert Subscriber class to use async functions

https://gerrit.wikimedia.org/r/1149526

Change #1149527 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Convert Kafka class to use async functions

https://gerrit.wikimedia.org/r/1149527

Change #1149528 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Remove last bluebird usage from BaseExecutor

https://gerrit.wikimedia.org/r/1149528

Change #1149529 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Use native Promise in ratelimiter

https://gerrit.wikimedia.org/r/1149529

Change #1149530 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Convert OresProcessor to use async functions

https://gerrit.wikimedia.org/r/1149530

Change #1149531 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/services/change-propagation@master] Instrument changeprop with OpenTelemetry

https://gerrit.wikimedia.org/r/1149531

@gmodena can you help assess the effort and impact?

@mszabo subscribing to stay in the loop as well, this work is very relevant to our attempts at SLOs for the new experiment platform (which also goes through EventGate and needs observability similar to what you're doing here)

@gmodena can you help assess the effort and impact?

Thanks for this @mszabo ! Disclaimner: I need to properly review the patch. But re-EventGate:

@tchin implemented trace propagation in service-utils, the service-runner replacement.
@mszabo if the service-utils implementation fits your needs, how about we coalesce the changes in service-utils package and migrate eventgate-wikimedia to it?

We can inject the trace context into Kafka message headers in eventgate and read them in changeprop. Since most events outside of jobs won't need this, we can make it contingent on a stream configuration flag. For this to work, we first need to roll out a new node-rdkafka-factory version that supports message headers.

+1 for injecting into kafka headers and making this stream-contingent. We should also inform non-EventGate producers, though AFAIK I'm not aware of use-cases for trace propagation there (cc @dcausse @pfischer re wikimedia-event-utilities).
Just to be sure: the config flag should be declared in EventStreamConfig (and not hard-coded in EventGate).

changeprop currently extensively uses legacy pre-ES6 bluebird promises, which are incompatible with the mechanism the OTEL SDK uses to juggle contexts across async operations. Since the service is now using a modern (v20) node version, we can take this opportunity to migrate relevant code paths to async functions and native Promises instead.

Sounds good to drop bluebird. I would not invest too much effort in refactoing changeprop at this point, but if the legacy promises break OTEL let's drop them.

@gmodena Thanks for your comments!

@tchin implemented trace propagation in service-utils, the service-runner replacement.
@mszabo if the service-utils implementation fits your needs, how about we coalesce the changes in service-utils package and migrate eventgate-wikimedia to it?

service-utils only provides a custom propagator implementation for X-Request-Id, but leaves it to the consuming application to actually setup an OTEL SDK. Thus we wouldn't really benefit from using it, since we'd need to manage the SDK integration ourselves anyways, plus X-Request-Id is already propagated across the jobqueue via a meta field.

+1 for injecting into kafka headers and making this stream-contingent. We should also inform non-EventGate producers, though AFAIK I'm not aware of use-cases for trace propagation there (cc @dcausse @pfischer re wikimedia-event-utilities).
Just to be sure: the config flag should be declared in EventStreamConfig (and not hard-coded in EventGate).

Yeah, it would be set in $wgEventStreams, like this:

$wgEventStreams = [
	'/^mediawiki\\.job\\..+/' => [
		'schema_title' => 'mediawiki/job',
		'destination_event_service' => 'eventgate-main',
		'canary_events_enabled' => false,
		'append_tracecontext_headers' => true,
	],
];

Sounds good to drop bluebird. I would not invest too much effort in refactoing changeprop at this point, but if the legacy promises break OTEL let's drop them.

Yeah, we need to migrate at least the code paths involved in job execution, as without it the activated context got lost in my local testing. Thankfully this is a very mechanical change. I've split that work into smaller patches to facilitate reviews.

Change #1151660 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/extensions/EventBus@master] tracing: Pass a Tracer to EventBusFactory

https://gerrit.wikimedia.org/r/1151660

Change #1151660 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] tracing: Pass a Tracer to EventBusFactory

https://gerrit.wikimedia.org/r/1151660

HCoplin-WMF set the point value for this task to 5.Jun 3 2025, 3:27 PM

Change #1149517 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Convert BaseExecutor._exec into an async function

https://gerrit.wikimedia.org/r/1149517

Change #1149518 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Convert RetryExecutor._delay into an async function

https://gerrit.wikimedia.org/r/1149518

Change #1149519 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Convert RetryExecutor.processMessage into an async function

https://gerrit.wikimedia.org/r/1149519

Change #1149520 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Convert RuleExecutor.processMessage into an async function

https://gerrit.wikimedia.org/r/1149520

Change #1149521 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Convert BaseExecutor._rateLimitMessage into an async function

https://gerrit.wikimedia.org/r/1149521

Change #1149522 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Convert BaseExecutor._updateLimiters into an async function

https://gerrit.wikimedia.org/r/1149522

Change #1149523 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Convert BaseExecutor._consume into an async function

https://gerrit.wikimedia.org/r/1149523

Change #1149524 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Convert BaseExecutor.subscribe() into an async function

https://gerrit.wikimedia.org/r/1149524

Change #1149525 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Convert subscription classes to use async functions

https://gerrit.wikimedia.org/r/1149525

Change #1149526 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Convert Subscriber class to use async functions

https://gerrit.wikimedia.org/r/1149526

Change #1149527 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Convert Kafka class to use async functions

https://gerrit.wikimedia.org/r/1149527

Change #1149528 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Remove last bluebird usage from BaseExecutor

https://gerrit.wikimedia.org/r/1149528

Change #1149529 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Use native Promise in ratelimiter

https://gerrit.wikimedia.org/r/1149529

Change #1149530 merged by jenkins-bot:

[mediawiki/services/change-propagation@master] Convert OresProcessor to use async functions

https://gerrit.wikimedia.org/r/1149530

Re stream config, if this is a setting for eventgate, it would be better to put it in the producers.eventgate stream config section.

producers.eventgate.append_tracecontext_headers ?

Change #1172079 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-*-external - bump to 1.16.0

https://gerrit.wikimedia.org/r/1172079

Hm, in beta eventgate-main I now get:

MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 pipe listeners added to [Request]. Use emitter.setMaxListeners() to increase limit

...

Ottomata renamed this task from Implement tracing across changeprop-jobqueue to Implement tracing across changeprop-jobqueue, kafka, eventgate.Jul 23 2025, 6:06 PM

@Mate, I reverted so that I can push out an unrelated change that was also in master.

Let's bring it back as a new MR and work on it.

Change #1226881 had a related patch set uploaded (by Paladox; author: Máté Szabó):

[mediawiki/extensions/EventBus@REL1_44] tracing: Pass a Tracer to EventBusFactory

https://gerrit.wikimedia.org/r/1226881

Change #1226882 had a related patch set uploaded (by Paladox; author: Máté Szabó):

[mediawiki/extensions/EventBus@REL1_43] tracing: Pass a Tracer to EventBusFactory

https://gerrit.wikimedia.org/r/1226882

Change #1226882 abandoned by Paladox:

[mediawiki/extensions/EventBus@REL1_43] tracing: Pass a Tracer to EventBusFactory

https://gerrit.wikimedia.org/r/1226882

Change #1226881 merged by jenkins-bot:

[mediawiki/extensions/EventBus@REL1_44] tracing: Pass a Tracer to EventBusFactory

https://gerrit.wikimedia.org/r/1226881

Andreas reviewing in weekly sync