Page MenuHomePhabricator

Release OpenTelemetry integration for service-utils
Closed, ResolvedPublic

Description

Current state

Service-utils contains experimental support for x-request-id context propagation that integrates with the OpenTelemetry Node SDK, which is already used by some adopter teams - e.g., Abstract Wikipedia.

Based on some initial investigation and functional testing, as well as discussion with that team, we believe this functionality already works as expected.

However, there are three areas where we could provide improved documentation, or batteries-included helpers:

  1. Production-appropriate NodeSDK initialization - This was a pain-point during integration by Abstract Wikipedia, where it was challenging to adapt their container-image entrypoint to use the typical --require technique to inject instrumentation setup, and they instead settled on an approach that modifies their main application script (i.e., setting up instrumentation, and then loading the application code via a dynamic import).
  2. Production configuration - While it's relatively easy to initialize the NodeSDK in a way that provides trace propagation, additional configuration is necessary to actually collect spans from autointrumentation or application-level spans (e.g., setting OTEL_EXPORTER_OTLP_ENDPOINT).
  3. Mental model - A clearer mental model of how this all fits together and what to expect. This would make it a lot easier to understand subtle differences between development and production, and more generally debug issues when traces are found to be incomplete.

Definition of done

  • [done] Fix lingering issues around instrumentation initialization identified during testing and update existing adopters. - T416756#11657003
  • [done] Documentation improvements that cover all three of these. - T416756#11630598
  • [done] Moving the XRequestPropagator out of @experimental stability (after moving it to a subpath export - T416756#11609350).
  • [declined] Some sort of batteries-included NodeSDK initialization helper, analogous to that used in, e.g., function-orchestrator. - T416756#11683524

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Remove @experimental stability from XRequestPropagatorrepos/data-engineering/service-utils!28swfrenchgraduate-propagatormain
instrumentTelemetry: Update XRequestPropagator import and fix ecs appropriatelyrepos/abstract-wiki/wikifunctions/function-evaluator!477jforresterT416756main
Add Development > Release to README.mdrepos/data-engineering/service-utils!27swfrenchadd-release-docsmain
Update XRequestPropagator import in instrumentTelemetry.jsrepos/abstract-wiki/wikifunctions/function-orchestrator!572swfrenchinstrumentation-importsmain
Move XRequestPropagator to subpath exportrepos/data-engineering/service-utils!25swfrenchfix/instrumentation-import-ordermain
Expand trace context propagation docsrepos/data-engineering/service-utils!24swfrenchextend-context-propagation-docsmain
Customize query in GitLab

Event Timeline

Scott_French changed the task status from Open to In Progress.Feb 7 2026, 12:47 AM
Scott_French triaged this task as Medium priority.
Scott_French moved this task from Inbox to In Progress on the ServiceOps new board.
  1. Production-appropriate NodeSDK initialization - This was a pain-point during integration by Abstract Wikipedia, [...]

I think a good way to approach this is to combine it with #3 - i.e., present a worked example of a minimal initialization script suitable for WMF production, together with an explanation of (a) what it's supposed to achieve (i.e., what to expect as a result of properly integrating it) and (b) how it goes about doing so (e.g., dispelling some of the magic around why it works).

The example in open-telemetry-service-utils-example gets most of the way there - there's just some more explanation needed, together with some avoidance of pitfalls around the commented demo-only functionality (e.g., around explicit configuration of NodeSDKConfiguration.spanProcessors).

  • [maybe] Some sort of batteries-included NodeSDK initialization helper, analogous to that used in, e.g., function-orchestrator.

Expanding on this a bit: If we were to go down this path, we'll probably want to isolate the helper to a subpath export. This would be the clearest way to avoid potentially error-prone ordering of Node SDK init vs. module imports. Not a deal-breaker, but something to consider in terms of whether it's worth it.

Edit: Ah, it looks like @opentelemetry/auto-instrumentations-node provides an example of something close to what I had in mind (although not quite the same interface).

Scott_French added a subscriber: tchin.

Expanding a bit on my comment in T416756#11599776 about error-prone SDK init, I noticed something curious today:

One of the core features of @opentelemetry/instrumentation-winston, which is among the instrumentations returned by @opentelemetry/auto-instrumentations-node's getNodeAutoInstrumentations, is the addition of span information to the log record for log-correlation purposes - i.e., trace_id, span_id, trace_flags.

Now, two things jump out at me:

  1. These do not comply with the ecs schema, so they're of limited utility for our use case (and, in practice, duplicate what we already achieve with fairly ubiquitous x-request-id logging).
  2. I've actually never seen these emitted by Abstract Wikipedia services, despite ostensibly having this instrumentation enabled.

So, what's going on?

From a bit of local testing, I'm now mildly confident that this is an artifact of how we're (re)exporting XRequestPropagator from the top-level index.ts, which dependent services then import during instrumentation init. In effect, that means an import of XRequestPropagator triggers transitive imports of Winston (and numerous other modules) before instrumentation happens, and runs afoul of:

Before any other module in your application is loaded, you must initialize the SDK.

Indeed, if I introduce a subpath export only for open_telemetry.ts, and use that in my initialization instead of the top-level import to access helpers, I get the expected behavior.

So, I think the lesson is, we probably want to isolate exports used in SDK init (or doing SDK init, if we go that route) to trivially contexts - i.e., low-dependency subpath exports.

Edit: One additional observation is that, even though we're in a similar situation for Express - i.e., there exist imports that precede auto-instrumentation - the latter at least seems to work normally. I've not had a chance to look closely at why that might be the case (but it implies some fundamental differences in how interception / wrapping plays out).

Edit: In hindsight, the compiled code contains no Express imports, since of course we're only ever importing types in service-utils, which likely explains the above.

From some additional digging into auto-instrumentation today, there's at least one additional point we should highlight in our documentation:

The patching technique in the core @opentelemetry/instrumentation package (so all dependent instrumentations) assumes the CommonJS module system is used. Broadly, that means additional setup is needed to ensure (ESM) module loading is properly intercepted / patched (i.e., via --loader, in this case pointed at a hooks file making use of import-in-the-middle). More details in:

As with all things, and alluded to in the second link, the situation is actually a bit murkier:

  1. While --loader is now deprecated, and the long-term direction is to adopt Node's module.register(), this has not happened yet. See issue 4933.
  2. In that same issue, the discussion touches on something I've seen to be true in practice, but isn't clear from the docs: Even if your code uses ESM, as long as the transitively imported instrumented dependency is loaded as CJS under-the-hood (e.g., distributes only CJS, like Winston or express), the instrumentation works as expected. Meaning, some things might work, while others do not.

In any case, this is probably something we should highlight in the docs explaining instrumentation setup.

Some additional points of note around trace sampling, and how that should be reflected in the (WIP) documentation:

Sampling decisions

In production at WMF, our Envoy service mesh plays a key role in mediating trace sampling.

The core requirement (MUST) for correct trace context propagation is to propagate x-request-id, which drives trace correlation and consistent sampling decisions (for a given probability) across Envoys, together with the recommendation (SHOULD) to also propagate the W3C standard headers (preserves parent / child relationships, as well as sampling decisions where different probabilities are used).

The linked guidance there speaks of propagation in a way that mostly treats a service as a pass-through entity - i.e., one that does not originate spans, and propagates the header(s) verbatim.

That's of course not true when the service is producing intermediate spans, and the outbound traceparent header values reflect that.

In those cases, how should the service make sampling decisions?

In production MediaWiki, which is our main source of "precedent" for distributed tracing at WMF, we follow a pattern where Envoy is the head-end sampler, and we always follow Envoy's downstream decision reflected in the traceparent flags field - even new root spans (those without a parent) are never sampled in our current configuration.

Notably, that sampling policy - i.e., respect parent, no sampling if root - differs from the Node SDK default when no explicit sampler is configured, which defaults to respect parent, always sample if root.

That can lead to noise in cases where requests don't traverse Envoy, such as k8s readiness probes or prometheus metrics scraping. If (auto-)instrumented, these will be root spans, and thus sampled and reported to otelcol. You can see this, for example, looking at wikifunctions-orchestrator traces, where the vast majority are exactly that (aside: there's something odd going on specifically for the /_info readiness probes, where the root span generated by the Node http instrumentation is missing, leaving only the child spans from express instrumentation).

In any case, I think our default guidance should be to set OTEL_TRACES_SAMPLER to parentbased_always_off, which is arguably the least surprising behavior, in that it remains consistent with our general principle (i.e., Envoy is authoritative).

That would mean that services that specifically need the ability to create sampled root spans (e.g., detached background work) would need some additional setup, but that's probably okay as long as we clearly document how to go about that.

I've made a number of updates to:

over the last two days, which cover many of the points discussed in this task in at least some detail..

Some additional tweaks will be needed if / when Move XRequestPropagator to subpath export is merged, and surely there are clarity / concision edits to be made, but I think this provides a good foundation for improving the tracing adoption documentation story.

Many thanks to @CDanis for pointing me to the explanation for:

[...] (aside: there's something odd going on specifically for the /_info readiness probes, where the root span generated by the Node http instrumentation is missing, leaving only the child spans from express instrumentation).

This is the filter/healthchecks processor we're configured on the OTel collector. That would have the effect of dropping the (root) span emitted by the http instrumentation, while the child spans emitted by the express instrumentation would survive (and it also explains why I could see the root span when manually hitting /_info/ instead).

In other news, Move XRequestPropagator to subpath export is merged, so we should be able to release that soon (once we sort out some details of other changes landing in the same release).

I'll write up some notes on how to upgrade from the (deprecated) XRequestPropagator reexport from helpers to the subpath export. That's a fairly mechanical change on its own, but might benefit from some additional tweaks to configure an @opentelemetry/instrumentation-winston logHook that rewrites the trace fields to ecs equivalents (e.g., trace_id -> trace.id).

service-utils v2.0.0 was released earlier today, so I've updated the example instrumentation script in on Wikitech to reflect the new @wikimedia/service-utils/otel subpath export.

Change #1248124 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from 2026-03-04-123739 to 2026-03-04-220825

https://gerrit.wikimedia.org/r/1248124

Change #1248124 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade orchestrator from 2026-03-04-123739 to 2026-03-04-220825

https://gerrit.wikimedia.org/r/1248124

Confirmed that tracing continues to work as expected (example) and raw ecs logs now contain the expected trace.id and span.id rewrites, after deployment of r/1248124 picked up the changes from function-orchestrator!572.

Definition of done (DRAFT)

  • [maybe] Some sort of batteries-included NodeSDK initialization helper, analogous to that used in, e.g., function-orchestrator.

Now that I've had a chance to think about this and experiment a bit with what it could look like, I think I've concluded we shouldn't pursue this (at least for now).

First, if we were to provide this in service-utils directly (i.e., as an additional subpath export), it would massively expand the dependency surface of the package vs. what we have now (where we're able to get away with just a dependency on @opentelemetry/api). I don't think that's warranted, given that OpenTelemetry integration is not something all service-utils adopters (at least as of now) care about.

Second, if we were to do the obvious thing and create a companion package for the initialization helper (i.e., so the dependencies become opt-in), that would work, but is probably going to create compatibility headaches unless we're careful to release service-utils and the companion package together (and are clear about cross-package compatibility rules for adopters).

While there are some patterns we could adopt for making our side of that easier to maintain - see, e.g., how https://github.com/open-telemetry/opentelemetry-js is laid out and subpackages are released together with specific compatibility and versioning guidelines - that does not seem worth the effort / complexity.

Ultimately, until we see wider adoption of OpenTelemetry integration - particularly so we can see how adopter teams customize their instrumentations (like Abstract Wikipedia does) - I think the best option is to provide an opinionated minimal example in documentation, as we now do on Wikitech.

Ah, thanks @Ottomata! I was unaware of the history in T395038 (and the existence of eventgate-wikimedia!16).

This work is focused on testing and improving onboarding support (e.g., documentation) for adopting service-utils' OpenTelemetry integration, rather than migrating potential adopters. Which is to say, I think the dependency order is the other way around - i.e., work under T395038 (if using service-utils) depends on this. Let me know if that doesn't make sense, or is actually what you're trying to represent.

Change #1250594 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2026-02-28-010106 to 2026-03-10-214300

https://gerrit.wikimedia.org/r/1250594

Change #1250594 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2026-02-28-010106 to 2026-03-10-214300

https://gerrit.wikimedia.org/r/1250594

@Ottomata - I've inverted parent <> child task relationship to reflect T416756#11693070. Please undo if you think it doesn't make sense to couple them in this way. FWIW, this task is likely to be closed this week.

Thanks, that is fine! I mostly use subtasks to keep tasks linked, and don't have an opinion on which direction the relationship should go ;) (someone tell me what is best!)

Since everything tracked here under WE6.2.9 has been completed and the hypothesis closed, I believe it's time to resolve this.

Many thanks to @tchin and @Jdforrester-WMF for the discussion and reviews!