Summary
Currently, the tracing context is not propagated across the production job queue (MediaWiki -> eventgate -> Kafka -> changeprop -> MediaWiki).
We should add proper tracing instrumentation to eventgate and changeprop.
Background
- Although MediaWiki now has basic tracing instrumentation, we lose the tracing context for jobs at the eventgate -> kafka boundary.
- This causes the actual job executions to be assigned a new trace (if they get sampled at all), instead of being ordered under the original trace that triggered the operation.
- Instrumenting eventgate and changeprop and preserving the tracing context across the entire production job queue would significantly improve visibility into this system.
Technical notes
- Both changeprop and eventgate can be trivially wired up to the official OTEL Node SDK. Prior art also exists in wikifunctions.
- We can inject the trace context into Kafka message headers in eventgate and read them in changeprop. Since most events outside of jobs won't need this, we can make it contingent on a stream configuration flag. For this to work, we first need to roll out a new node-rdkafka-factory version that supports message headers.
- changeprop currently extensively uses legacy pre-ES6 bluebird promises, which are incompatible with the mechanism the OTEL SDK uses to juggle contexts across async operations. Since the service is now using a modern (v20) node version, we can take this opportunity to migrate relevant code paths to async functions and native Promises instead.