Page MenuHomePhabricator

Extension:CirrusSearch not propagating tracing headers
Closed, ResolvedPublic

Description

In WMF production it's expected that services propagate headers used for tracing microservice sub-requests back to the root user request.

Historically this meant only x-request-id being passed through the RPC call chain in this way.

However we've recently added also the W3C-standard headers traceparent and tracestate, which are used by our distributed tracing implementation.

From searching the codebase, CirrusSearch doesn't seem to propagate any of these headers.

In Mediawiki core we now have Wikimedia/Http/TelemetryHeadersInterface, which allows for pluggable implementations that extract the necessary headers, and currently only has one implementation: Mediawiki/Http/Telemetry https://gerrit.wikimedia.org/g/mediawiki/core/+/master/includes/http/Telemetry.php

I think it should be enough to call Telemetry::getInstance()::getRequestHeaders() and add the returned array to all HTTP requests sent while CirrusSearch handles an incoming request.

Details

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1058210 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/Elastica@master] Propagate http tracing headers

https://gerrit.wikimedia.org/r/1058210

Change #1058210 merged by jenkins-bot:

[mediawiki/extensions/Elastica@master] Propagate http tracing headers

https://gerrit.wikimedia.org/r/1058210

@CDanis quasi-relatedly, is there a distributed tracing plan for event based functionality? Currently edits turn into events, which get ingested by an application we run and trigger mw api calls, but nothing today ties those things together for tracing. This is perhaps additionally complicated because multiple events can be collapsed into a single api call.

There's not a finished plan, no, but I'd be very happy to try some experiments with you next Q perhaps?

The otel data spec itself supports structures like what you describe -- see for instance here -- but I'm not sure offhand how much of that Jaeger will render. It would also be the first case where we'd have to have an application generating otel itself, instead of just relying on Envoy.

So, possible but non-trivial :)

Change #1059122 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikimediaEvents@master] search: Repair checkin events fired after session close

https://gerrit.wikimedia.org/r/1059122

Very interesting, I'll have to play around with jaeger a bit more but i suspect this could be useful in debugging problems with prod. I'll talk with the rest of the team and see if this is something we want to spend some time on.

BTW I opened a new task T371842: Come up with a roadmap for supporting tracing for batch- or message-driven systems with a bunch of pointers for what I think is the current state of the OTel/Jaeger world on this.

Thanks again for fixing the immediate issue so immediately!