Page MenuHomePhabricator

Harmonise the identification of requests across our stack
Open, NormalPublic

Description

Problem

Our stack comprises many components/servers that all interact with each other in other to fulfil clients' requests (or prepare data to be served to clients at a later point). For any given request, requests may be spawned to other components and their responses assembled before being returned to the client. This creates the need for having a sort of a distributed stack trace that allows us to pin-point problematic links in the request chain.

A certain degree of request identification does currently exist in our infrastructure, alas only on sub-system levels:

There are probably more such examples.

In order to be able to trace the requests provoked by an (initial/external) request, all of the systems in our infrastructure should identify requests in the same way, use this identifier for logging and propagate it to other links in the request chain.

Proposed Solution

Use a UUID v1/v4 x-request-id header/entity. Varnish f-e (soon ATS) is the main point of entry of external requests. Therefore, it can generate the request IDs and attach them to requests in the form of the x-request-id header, which can then be used and propagated by all entities behind it. Furthermore, entities responding to requests must log the received/generated request ID.

See Also

Event Timeline

mobrovac created this task.Aug 7 2018, 12:41 PM
mobrovac triaged this task as Normal priority.
Restricted Application added a project: Operations. · View Herald TranscriptAug 7 2018, 12:41 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Possibly related -- this should likely be implemented separately, but there's a slight chance that there's overlap.

I've been investigating the use of an OpenTracing implementation, such as Jaeger or Zipkin, to provide distributed tracing. The OpenTracing standard provides for the use of headers to pass a couple of different IDs around. The goal of these IDs is to be able to build a flame graph of a request, as it transits different systems -- something like https://performance.wikimedia.org/xenon/svgs/daily/2018-08-06.index.svgz but with the trace beginning when the request was first received by Varnish, and ending when the last byte is sent (see http://opentracing.io/documentation/pages/instrumentation/instrumenting-large-systems.html for a quick overview and example of how this is useful.)

The overlap that I see is that the request ID generated by Varnish could be compatible with the span IDs that Opentracing implementations use, and in that case, we'd be able to use the request ID as the "parent span". In practice, this would mean that a log line, database query comment, or other item that contained the request ID could be used to identify a request trace for that entire request.

Not worth holding up the overall project, but seems like there could be value down the line if the implementations happened to be compatible.

Change 451240 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] Varnish: Unset X-Request-Id for external requests

https://gerrit.wikimedia.org/r/451240

mobrovac updated the task description. (Show Details)Aug 8 2018, 11:53 AM
Catrope added a subscriber: Catrope.Aug 8 2018, 8:36 PM

In addition to that, do we have a task for "every service should include the request ID in its logstash entries"? It seems to me like we'd need that for this to be useful.

mobrovac updated the task description. (Show Details)Aug 9 2018, 9:42 AM

In addition to that, do we have a task for "every service should include the request ID in its logstash entries"? It seems to me like we'd need that for this to be useful.

In my mind that was implied in the task description. I spelled it out for clarity.

I've been investigating the use of an OpenTracing implementation, such as Jaeger or Zipkin, to provide distributed tracing. The OpenTracing standard provides for the use of headers to pass a couple of different IDs around. The goal of these IDs is to be able to build a flame graph of a request, as it transits different systems -- something like https://performance.wikimedia.org/xenon/svgs/daily/2018-08-06.index.svgz but with the trace beginning when the request was first received by Varnish, and ending when the last byte is sent (see http://opentracing.io/documentation/pages/instrumentation/instrumenting-large-systems.html for a quick overview and example of how this is useful.)

I definitely agree that we should strive to eventually reach the point of being able to have flame graphs and all the niceties, but I see that as v2 (or even v3) of the effort. This task is deliberately limited to v1 (aka MVP) and the goal here is "if I have a reqId, I should be able to see all the requests that happened for/because of it in the logs", which is a much narrower scope. Nevertheless, I do think that we should go step by step here. As the task description tries to illustrate many variations of this problem have been attacked in different subsystems in different ways; consolidating all these mechanism into one is a first step towards integrating more advanced logging and tracing mechanisms.

The overlap that I see is that the request ID generated by Varnish could be compatible with the span IDs that Opentracing implementations use, and in that case, we'd be able to use the request ID as the "parent span". In practice, this would mean that a log line, database query comment, or other item that contained the request ID could be used to identify a request trace for that entire request.

Yup. I read the specification, and there is nothing that prevents us expanding the request ids proposed in this task to full-blown spans.

Not worth holding up the overall project, but seems like there could be value down the line if the implementations happened to be compatible.

I would go even further and say that this task is a prerequisite for implementing tracing in our infra given the systematisation that has to happen for both. Likewise, having the OT spec as a rudimentary guideline should be a prerequisite for the completion of this task so that we can build future features on top of this.

<snip>

Perfect -- to be clear, I wasn't making any objection, just trying to frame an additional use case. Thanks, Marko!

Imarlier moved this task from Inbox to Radar on the Performance-Team board.Aug 11 2018, 6:20 PM
Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.

Change 451240 merged by Ema:
[operations/puppet@production] Varnish: Unset X-Request-Id for external requests

https://gerrit.wikimedia.org/r/451240

Joe added a comment.Aug 14 2018, 4:43 AM

We also need internal requests to be traced, so I would assume we need all services to generate a request Id whenever they receive a request that has none.

ema moved this task from Triage to Caching on the Traffic board.Aug 15 2018, 6:35 AM

We also need internal requests to be traced, so I would assume we need all services to generate a request Id

Correct. Any source of requests should attach one.

whenever they receive a request that has none.

If a service receives a request without a req id it means we have a hole somewhere. I wonder how to trace/identify such cases...

If a service receives a request without a req id it means we have a hole somewhere. I wonder how to trace/identify such cases...

Because we currently tend to have services very isolated, it should be relatively easy just by knowing the source IP of the request, I would think.

With that said, a thing that we did when we undertook a similar project at a past company was that we gave every application that made HTTP requests a unique user agent (generally the name of the calling job or application). Might be worth exploring doing that as well. (@BPirkle - maybe something to think about in the context of your MediaWiki HTTP client work as well.)

We also need internal requests to be traced, so I would assume we need all services to generate a request Id whenever they receive a request that has none.

EventBus extension generates an x-request-id if the request doesn't already have one. Probably any web service should do this logic, in case it isn't being accessed via varnish. For Mediawiki, perhaps we should set x-request-id somewhere else earlier on? Currently if it isn't set, the EventBus extension will end up generating different x-request-ids even for events that are triggered during the same PHP request handling.

unique user agent (generally the name of the calling job or application). Might be worth exploring doing that as well.

For node services, we are already doing that. At least in theory, maybe again some 'holes' exist.

daniel moved this task from Inbox to Under discussion on the TechCom-RFC board.Aug 29 2018, 8:33 PM

@mobrovac I think as a first step we should:

  • Standardise the name of the header (for services that can/do set a header of this kind),
  • Require that services also read the header from internal requests (or generate their own) and pass it on to any internal requests, and to explicitly make sure we don't do that for external requests.

But I'm unsure whether we should also standardise the value format of the header (as UUID). For services where it's easy to use UUID as something else, that seems fine given the choice to pick the same format. But, if a service has a built-in method that only supports a different format, perhaps that's okay to keep as-is? I expect for the majority of cases we'll only be dealing with whatever format is used by Varnish. It's not clear to me what the added benefit would be of requests originating elsewhere internally passing around a slightly differently formatted value (assuming we don't parse it in any way).

Krenair added a subscriber: Krenair.Sep 3 2018, 9:07 PM

But I'm unsure whether we should also standardise the value format of the header (as UUID).

I'd be fine with not requiring UUID, but we should standardize the field type. Since UUID is string, we should stick with string everywhere, even if the actual value is integerish.

@mobrovac I think as a first step we should:

  • Standardise the name of the header (for services that can/do set a header of this kind),
  • Require that services also read the header from internal requests (or generate their own) and pass it on to any internal requests, and to explicitly make sure we don't do that for external requests.

Yup, that sounds about right. In the task I am proposing x-request-id but any x- value should work well enough. The second part about setting/consulting it everywhere is where I think the bulk of the work will be, but without it we can't really have a complete picture of a request's path and impact.

But I'm unsure whether we should also standardise the value format of the header (as UUID). For services where it's easy to use UUID as something else, that seems fine given the choice to pick the same format. But, if a service has a built-in method that only supports a different format, perhaps that's okay to keep as-is?

This smells like you already have concrete cases where either using a UUID or changing the way it's currently done might prove challenging? As for my suggestion of using UUIDs, I proposed them because of the low possibility of collision, but any function with such properties ought to work for us here.

I expect for the majority of cases we'll only be dealing with whatever format is used by Varnish. It's not clear to me what the added benefit would be of requests originating elsewhere internally passing around a slightly differently formatted value (assuming we don't parse it in any way).

I must admit I don't understand this part. If I understood you correctly, even if certain parts of the system do emit a different type of ID, I don't see a problem with it as long as (a) they honour the header on the way in (i.e. do not generate a new one in spite of it already being set); and (b) the generating function provides low-collision guarantees (as stated above). There would be one direct benefit if we use something like a time UUID, though: if every entity in the request chain produces them, then we could also get a time span out of them (eventually, and, under the assumption that most entities can generate time UUIDs with sufficient precision). However, this would be a benefit only at later stages where we implement something like sub-request IDs as a way of denoting lightweight spans for future instrumentation capabilities.

Since UUID is string, we should stick with string everywhere, even if the actual value is integerish.

I agree, we should make sure the feature is analytics-friendly and setting the header's type to string makes the most sense to me as it provides us with flexibility.

kchapman added a subscriber: kchapman.

TechCom is putting this on last call ending on 26 September 2pm PST(21:00 UTC, 23:00 CET)

Tgr awarded a token.Sep 24 2018, 12:09 AM
Tgr added a subscriber: Tgr.

One issue brought up in T113817: Connect Hadoop records of the same request coming via different channels was that we might want to avoid connecting too many things for privacy reasons (specifically, request IDs are useful for both debug logs and analytics data, but we don't want to connect debug logs to analytics data as they contain a bunch of information we are careful to keep out of out analytics logging). That can be done by storing a hash of the request ID and using different hashing methods or different salts for debug logs and for analytics. That's easy to do safely with a random (UUIDv4-style) request id but much harder with a deterministic (UUIDv1-style) ID.

Joe added a comment.Sep 24 2018, 6:52 AM

As long as a request ID doesn't get associated with PII-worthy data, I don't see it being a privacy issue. Since it's a per-request debugging tool, I don't see a value in adding it to the analytics data at all.

Coming to the format of the tracer, I think it should have the following feature:

  • Include information about where the request originated (at the edge, from an internal service)
  • Be reasonably unique, so that two different requests cannot be confused.

I don't think we need more characteristics than this, from an operations point of view. So I would see a N-letter prefix + a UUIDv4-like random string as a good format.

TechCom has approved this RFC

kchapman edited projects, added TechCom-RFC (TechCom-Approved); removed TechCom-RFC.
ema added a subscriber: ema.Oct 5 2018, 8:59 AM
CDanis added a subscriber: CDanis.Nov 8 2018, 2:25 PM
CDanis moved this task from Backlog to Radar on the User-CDanis board.

On a very random note, I wanted to say that I enjoyed this:

Guess the subscriber list triggered something in GMail's language recognition?

Change 487544 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/core@master] MWHttpRequest: Include the request ID in outgoing HTTP requests

https://gerrit.wikimedia.org/r/487544

Anomie added a subscriber: Anomie.Feb 8 2019, 7:30 PM

A discussion about later plans for this task was accidentally started in code review. Copying it here for continued discussion.

In Gerrit, @Tgr wrote:

I imagine the idea with X-Request-Id is that it is generated at the point of entry (which might not be MediaWiki) and proxied through internal service calls. There was some discussion on whether that is a better strategy vs. each application adding its own request ID header (which to some extent also happens already, e.g. Thumbor adds a Thumbor-Request-Id) but I can't recall where or with what results. I think Krinkle was involved, maybe he remembers.

In Gerrit, @mobrovac wrote:

Correct. The idea is to have the same request ID propagated through the system. This patch is just the first step, in which we ensure MW propagates the id it uses for logging. Next, we'd need to check if the request id has already been set on the incoming request and use that one instead. Then, we need to implement generating the id in varnish, but currently we can get away without because in MW we use Apache's mod_unique which sets $_SERVER['unique_id'] for each request.

In Gerrit, @Anomie wrote:

Next, we'd need to check if the request id has already been set on the incoming request and use that one instead.

But let's make sure that some attacker can't screw things up by passing X-Request-ID in their own requests.

And let's make sure that whatever is done to prevent that doesn't only work on Wikimedia's infrastructure, i.e. don't rely on filtering the header out at the edge before the request gets to MediaWiki.

In Gerrit, @mobrovac wrote:

But let's make sure that some attacker can't screw things up by passing X-Request-ID in their own requests.

Varnish is already removing this header from external incoming requests, so we are good there :)

And let's make sure that whatever is done to prevent that doesn't only work on Wikimedia's infrastructure, i.e. don't rely on filtering the header out at the edge before the request gets to MediaWiki.

I'm not sure I follow. In our env, MW has no choice but to trust Varnish for such chores. If MW receives a request with an already-set req id, how can it tell whether it's to be considered safe or not?

In Gerrit, @Anomie wrote:

My point is that third-party wikis probably don't have our Varnish configuration to protect them, but an attacker still shouldn't be able to unexpectedly inject a request ID into their logging.

In Gerrit, @mobrovac wrote:

My point is that third-party wikis probably don't have our Varnish configuration to protect them, but an attacker still shouldn't be able to unexpectedly inject a request ID into their logging.

I realise that. A simple thing that pops to mind now is to simply have a config var that tells MW whether to remove the header on incoming reqs or not, but we can bike-shed on this elsewhere :)

Anomie added a comment.Feb 8 2019, 7:32 PM
In Gerrit, @mobrovac wrote:

I realise that. A simple thing that pops to mind now is to simply have a config var that tells MW whether to remove the header on incoming reqs or not

Defaulted to "never accept the incoming header" and reconfigured in Wikimedia's configuration, I agree that should work.

Or we could somehow let code from LocalSettings.php override the default header, and in Wikimedia's CommonSettings.php we could do that (like we do for some other hooks already).

Change 487544 merged by jenkins-bot:
[mediawiki/core@master] MWHttpRequest: Include the request ID in outgoing HTTP requests

https://gerrit.wikimedia.org/r/487544

Defaulted to "never accept the incoming header" and reconfigured in Wikimedia's configuration, I agree that should work.

That's what I was thinking, yes. Instead of making this only about the request ID, we could generalise the config setting and have it cover unsafe headers in general, so that we don't end up with a gazillion variables, one for each header.

Or we could somehow let code from LocalSettings.php override the default header, and in Wikimedia's CommonSettings.php we could do that (like we do for some other hooks already).

What would be the benefit here as opposed to the config setting? Since defaulting the config to false would make MW's default installation secure (and, on the other hand, simple for us to override it for our needs), I have a hard time understanding why would we complicate things.

Change 505668 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/core@master] Allow the request ID to be passed in via the X-Request-Id header

https://gerrit.wikimedia.org/r/505668

Change 505668 merged by jenkins-bot:
[mediawiki/core@master] Allow the request ID to be passed in via the X-Request-Id header

https://gerrit.wikimedia.org/r/505668

Change 510796 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/mediawiki-config@master] Allow MW to honour the X-Request-Id header if set

https://gerrit.wikimedia.org/r/510796

Change 510796 merged by jenkins-bot:
[operations/mediawiki-config@master] Allow MW to honour the X-Request-Id header if set

https://gerrit.wikimedia.org/r/510796

Mentioned in SAL (#wikimedia-operations) [2019-05-28T09:32:21Z] <mobrovac@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Allow MW to honour the X-Request-Id header if set - T201409 (duration: 01m 12s)