Page MenuHomePhabricator

Trace header propagation for MediaWiki
Closed, ResolvedPublic

Description

Currently, when MediaWiki processes a request, it takes care to propagate the x-request-id header from the incoming request to any requests to other services it makes "on behalf" of that request as part of handling it (for instance if MediaWiki fetches from Restbase, or Shellbox).

In order to implement distributed tracing, we'll also need to similarly propagate the tracestate and traceparent headers.

In the future we'll likely want to integrate MediaWiki with the OpenTelemetry client library, so we can add other context to traces within MediaWiki. But for now we can simply propagating these two headers without modifying them or even parsing them (as is allowed by the W3C TraceContext Recommendation)

Event Timeline

Krinkle renamed this task from Trace header propagation for Mediawiki to Trace header propagation for MediaWiki.Jul 18 2023, 2:10 PM
Krinkle updated the task description. (Show Details)

Change 940149 had a related patch set uploaded (by Pmiazga; author: Pmiazga):

[mediawiki/core@master] WIP: Propagate tracestate and traceheaders in MWHttpClient library

https://gerrit.wikimedia.org/r/940149

@Krinkle had a pretty good question regarding the wgAllowExternalReqID guard.

Should these be under the same $wgAllowExternalReqID guard? It seems risky to take in wild input from Internet clients making the first top-level request to MW and controlling unsanitized tracestate data of arbitrary length into internal services.

In production, we allow $wgAllowExternalReqID because our traffic layer (Varnish) strips these and then sets its own initial values.

And the question here - who will provide the tracestate and traceparent headers? Are we going to strip those on varnish and then re-inject before calling MW (envoy ?).
Is it possible for a client from outside to pass us those headers?

Are we stripping those at the edge currently?

Change 946617 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Filter tracing headers from the outside

https://gerrit.wikimedia.org/r/946617

Change 946617 merged by CDanis:

[operations/puppet@production] Filter tracing headers from the outside

https://gerrit.wikimedia.org/r/946617

@Krinkle had a pretty good question regarding the wgAllowExternalReqID guard.

Should these be under the same $wgAllowExternalReqID guard? It seems risky to take in wild input from Internet clients making the first top-level request to MW and controlling unsanitized tracestate data of arbitrary length into internal services.

In production, we allow $wgAllowExternalReqID because our traffic layer (Varnish) strips these and then sets its own initial values.

And the question here - who will provide the tracestate and traceparent headers? Are we going to strip those on varnish and then re-inject before calling MW (envoy ?).

Yes, exactly.

Is it possible for a client from outside to pass us those headers?

Are we stripping those at the edge currently?

They're now being stripped at the edge, as of 2055 UTC today :)

Change 940149 merged by jenkins-bot:

[mediawiki/core@master] http: Propagate `tracestate` and `traceparent` headers

https://gerrit.wikimedia.org/r/940149

Change 947422 had a related patch set uploaded (by Pmiazga; author: Pmiazga):

[mediawiki/extensions/EventBus@master] Use the new Telemetry state for handling requestId

https://gerrit.wikimedia.org/r/947422

Change 947422 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] Use the new Telemetry state for handling requestId

https://gerrit.wikimedia.org/r/947422

pmiazga changed the task status from Open to In Progress.Aug 21 2023, 12:13 PM
pmiazga triaged this task as High priority.

@CDanis Can you confirm that you see traces on prod ?

@CDanis Can you confirm that you see traces on prod ?

Jaeger being able to receive traces is WIP (T343302 T344253) though I believe we can validate that otel-collector is indeed receiving traces? (as a question as I'm not sure, what do you think @CDanis?)

@CDanis Can you confirm that you see traces on prod ?

Jaeger being able to receive traces is WIP (T343302 T344253) though I believe we can validate that otel-collector is indeed receiving traces? (as a question as I'm not sure, what do you think @CDanis?)

I've been looking at this the past two days, with the intention to just verify that MediaWiki is indeed propagating headers if it receives them. It's been trickier than I expected to verify for a few different reasons.

One thing I did find is that MediaWiki isn't propagating anything (including x-request-id) when sending requests to sessionstore. I'll be filing another bug about that shortly.

Verified working for at least some code paths! With the known exception of T344926.

After having the epiphany that my earlier reproduction efforts weren't working because termbox SSR output gets cached in the application itself (either in the local appserver or in memcached, not sure which), I've managed to demonstrate tracestate/traceparent header propagation in production, after setting the headers identically in my incoming request to MediaWiki:

GET /termbox?entity=Q45429613&revision=1863914492&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ45429613&preferredLanguages=en HTTP/1.1
Host: localhost:6008
X-Request-Id: e9120785-77cf-46d8-9064-c19dd73f3b1b
tracestate: xxxCDANISxxx_tracestate
traceparent: xxxCDANISxxx_traceparent
User-Agent: MediaWiki/1.41.0-wmf.23

Hey @CDanis I'm closing this one in favor of T344926. Please let us know if you detect any other code paths that need work.

T344926 is a standalone ticket as it's an existing issue. From a quick check, sessions can use RESTBagOStuff to store/retrieve session data. When setting up a RESTBagOStuff we can specify a custom HTTP client, or it will use MultiHttpClient by default.

It looks like it's a third HTTP client. And from quick skim, it doesn't look like it supports any kind of telemetry atm. I would say that first we should make RESTBagOStuff use a generic PSR HTTP client. I also noticed that there is a plenty of services that use this MultiHttpClient that most likely are affected too:

  • EventBus
  • Etcd Configs
  • SwiftFileBackend
  • VirtualRestServiceClient

@larissagaulia I was going to say to close this ticket and let the T344926 become a standalone ticket but I see you already did it. thank you.