Page MenuHomePhabricator

Harmonise the identification of requests across our stack
Open, MediumPublic

Description

Problem

Our stack comprises many components/servers that all interact with each other in other to fulfil clients' requests (or prepare data to be served to clients at a later point). For any given request, requests may be spawned to other components and their responses assembled before being returned to the client. This creates the need for having a sort of a distributed stack trace that allows us to pin-point problematic links in the request chain.

A certain degree of request identification does currently exist in our infrastructure, alas only on sub-system levels:

There are probably more such examples.

In order to be able to trace the requests provoked by an (initial/external) request, all of the systems in our infrastructure should identify requests in the same way, use this identifier for logging and propagate it to other links in the request chain.

Proposed Solution

Use a UUID v1/v4 x-request-id header/entity. Varnish f-e (soon ATS) is the main point of entry of external requests. Therefore, it can generate the request IDs and attach them to requests in the form of the x-request-id header, which can then be used and propagated by all entities behind it. Furthermore, entities responding to requests must log the received/generated request ID.

See Also

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 451240 merged by Ema:
[operations/puppet@production] Varnish: Unset X-Request-Id for external requests

https://gerrit.wikimedia.org/r/451240

We also need internal requests to be traced, so I would assume we need all services to generate a request Id whenever they receive a request that has none.

We also need internal requests to be traced, so I would assume we need all services to generate a request Id

Correct. Any source of requests should attach one.

whenever they receive a request that has none.

If a service receives a request without a req id it means we have a hole somewhere. I wonder how to trace/identify such cases...

If a service receives a request without a req id it means we have a hole somewhere. I wonder how to trace/identify such cases...

Because we currently tend to have services very isolated, it should be relatively easy just by knowing the source IP of the request, I would think.

With that said, a thing that we did when we undertook a similar project at a past company was that we gave every application that made HTTP requests a unique user agent (generally the name of the calling job or application). Might be worth exploring doing that as well. (@BPirkle - maybe something to think about in the context of your MediaWiki HTTP client work as well.)

We also need internal requests to be traced, so I would assume we need all services to generate a request Id whenever they receive a request that has none.

EventBus extension generates an x-request-id if the request doesn't already have one. Probably any web service should do this logic, in case it isn't being accessed via varnish. For Mediawiki, perhaps we should set x-request-id somewhere else earlier on? Currently if it isn't set, the EventBus extension will end up generating different x-request-ids even for events that are triggered during the same PHP request handling.

unique user agent (generally the name of the calling job or application). Might be worth exploring doing that as well.

For node services, we are already doing that. At least in theory, maybe again some 'holes' exist.

@mobrovac I think as a first step we should:

  • Standardise the name of the header (for services that can/do set a header of this kind),
  • Require that services also read the header from internal requests (or generate their own) and pass it on to any internal requests, and to explicitly make sure we don't do that for external requests.

But I'm unsure whether we should also standardise the value format of the header (as UUID). For services where it's easy to use UUID as something else, that seems fine given the choice to pick the same format. But, if a service has a built-in method that only supports a different format, perhaps that's okay to keep as-is? I expect for the majority of cases we'll only be dealing with whatever format is used by Varnish. It's not clear to me what the added benefit would be of requests originating elsewhere internally passing around a slightly differently formatted value (assuming we don't parse it in any way).

But I'm unsure whether we should also standardise the value format of the header (as UUID).

I'd be fine with not requiring UUID, but we should standardize the field type. Since UUID is string, we should stick with string everywhere, even if the actual value is integerish.

@mobrovac I think as a first step we should:

  • Standardise the name of the header (for services that can/do set a header of this kind),
  • Require that services also read the header from internal requests (or generate their own) and pass it on to any internal requests, and to explicitly make sure we don't do that for external requests.

Yup, that sounds about right. In the task I am proposing x-request-id but any x- value should work well enough. The second part about setting/consulting it everywhere is where I think the bulk of the work will be, but without it we can't really have a complete picture of a request's path and impact.

But I'm unsure whether we should also standardise the value format of the header (as UUID). For services where it's easy to use UUID as something else, that seems fine given the choice to pick the same format. But, if a service has a built-in method that only supports a different format, perhaps that's okay to keep as-is?

This smells like you already have concrete cases where either using a UUID or changing the way it's currently done might prove challenging? As for my suggestion of using UUIDs, I proposed them because of the low possibility of collision, but any function with such properties ought to work for us here.

I expect for the majority of cases we'll only be dealing with whatever format is used by Varnish. It's not clear to me what the added benefit would be of requests originating elsewhere internally passing around a slightly differently formatted value (assuming we don't parse it in any way).

I must admit I don't understand this part. If I understood you correctly, even if certain parts of the system do emit a different type of ID, I don't see a problem with it as long as (a) they honour the header on the way in (i.e. do not generate a new one in spite of it already being set); and (b) the generating function provides low-collision guarantees (as stated above). There would be one direct benefit if we use something like a time UUID, though: if every entity in the request chain produces them, then we could also get a time span out of them (eventually, and, under the assumption that most entities can generate time UUIDs with sufficient precision). However, this would be a benefit only at later stages where we implement something like sub-request IDs as a way of denoting lightweight spans for future instrumentation capabilities.

Since UUID is string, we should stick with string everywhere, even if the actual value is integerish.

I agree, we should make sure the feature is analytics-friendly and setting the header's type to string makes the most sense to me as it provides us with flexibility.

kchapman added a subscriber: kchapman.

TechCom is putting this on last call ending on 26 September 2pm PST(21:00 UTC, 23:00 CET)

Tgr added a subscriber: Tgr.

One issue brought up in T113817: Add request_id to webrequest logs as well as other event records ingested into Hadoop was that we might want to avoid connecting too many things for privacy reasons (specifically, request IDs are useful for both debug logs and analytics data, but we don't want to connect debug logs to analytics data as they contain a bunch of information we are careful to keep out of out analytics logging). That can be done by storing a hash of the request ID and using different hashing methods or different salts for debug logs and for analytics. That's easy to do safely with a random (UUIDv4-style) request id but much harder with a deterministic (UUIDv1-style) ID.

As long as a request ID doesn't get associated with PII-worthy data, I don't see it being a privacy issue. Since it's a per-request debugging tool, I don't see a value in adding it to the analytics data at all.

Coming to the format of the tracer, I think it should have the following feature:

  • Include information about where the request originated (at the edge, from an internal service)
  • Be reasonably unique, so that two different requests cannot be confused.

I don't think we need more characteristics than this, from an operations point of view. So I would see a N-letter prefix + a UUIDv4-like random string as a good format.

On a very random note, I wanted to say that I enjoyed this:

image.png (342×982 px, 44 KB)

Guess the subscriber list triggered something in GMail's language recognition?

Change 487544 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/core@master] MWHttpRequest: Include the request ID in outgoing HTTP requests

https://gerrit.wikimedia.org/r/487544

A discussion about later plans for this task was accidentally started in code review. Copying it here for continued discussion.

In Gerrit, @Tgr wrote:

I imagine the idea with X-Request-Id is that it is generated at the point of entry (which might not be MediaWiki) and proxied through internal service calls. There was some discussion on whether that is a better strategy vs. each application adding its own request ID header (which to some extent also happens already, e.g. Thumbor adds a Thumbor-Request-Id) but I can't recall where or with what results. I think Krinkle was involved, maybe he remembers.

In Gerrit, @mobrovac wrote:

Correct. The idea is to have the same request ID propagated through the system. This patch is just the first step, in which we ensure MW propagates the id it uses for logging. Next, we'd need to check if the request id has already been set on the incoming request and use that one instead. Then, we need to implement generating the id in varnish, but currently we can get away without because in MW we use Apache's mod_unique which sets $_SERVER['unique_id'] for each request.

In Gerrit, @Anomie wrote:

Next, we'd need to check if the request id has already been set on the incoming request and use that one instead.

But let's make sure that some attacker can't screw things up by passing X-Request-ID in their own requests.

And let's make sure that whatever is done to prevent that doesn't only work on Wikimedia's infrastructure, i.e. don't rely on filtering the header out at the edge before the request gets to MediaWiki.

In Gerrit, @mobrovac wrote:

But let's make sure that some attacker can't screw things up by passing X-Request-ID in their own requests.

Varnish is already removing this header from external incoming requests, so we are good there :)

And let's make sure that whatever is done to prevent that doesn't only work on Wikimedia's infrastructure, i.e. don't rely on filtering the header out at the edge before the request gets to MediaWiki.

I'm not sure I follow. In our env, MW has no choice but to trust Varnish for such chores. If MW receives a request with an already-set req id, how can it tell whether it's to be considered safe or not?

In Gerrit, @Anomie wrote:

My point is that third-party wikis probably don't have our Varnish configuration to protect them, but an attacker still shouldn't be able to unexpectedly inject a request ID into their logging.

In Gerrit, @mobrovac wrote:

My point is that third-party wikis probably don't have our Varnish configuration to protect them, but an attacker still shouldn't be able to unexpectedly inject a request ID into their logging.

I realise that. A simple thing that pops to mind now is to simply have a config var that tells MW whether to remove the header on incoming reqs or not, but we can bike-shed on this elsewhere :)

In Gerrit, @mobrovac wrote:

I realise that. A simple thing that pops to mind now is to simply have a config var that tells MW whether to remove the header on incoming reqs or not

Defaulted to "never accept the incoming header" and reconfigured in Wikimedia's configuration, I agree that should work.

Or we could somehow let code from LocalSettings.php override the default header, and in Wikimedia's CommonSettings.php we could do that (like we do for some other hooks already).

Change 487544 merged by jenkins-bot:
[mediawiki/core@master] MWHttpRequest: Include the request ID in outgoing HTTP requests

https://gerrit.wikimedia.org/r/487544

Defaulted to "never accept the incoming header" and reconfigured in Wikimedia's configuration, I agree that should work.

That's what I was thinking, yes. Instead of making this only about the request ID, we could generalise the config setting and have it cover unsafe headers in general, so that we don't end up with a gazillion variables, one for each header.

Or we could somehow let code from LocalSettings.php override the default header, and in Wikimedia's CommonSettings.php we could do that (like we do for some other hooks already).

What would be the benefit here as opposed to the config setting? Since defaulting the config to false would make MW's default installation secure (and, on the other hand, simple for us to override it for our needs), I have a hard time understanding why would we complicate things.

Change 505668 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/core@master] Allow the request ID to be passed in via the X-Request-Id header

https://gerrit.wikimedia.org/r/505668

Change 505668 merged by jenkins-bot:
[mediawiki/core@master] Allow the request ID to be passed in via the X-Request-Id header

https://gerrit.wikimedia.org/r/505668

Change 510796 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/mediawiki-config@master] Allow MW to honour the X-Request-Id header if set

https://gerrit.wikimedia.org/r/510796

Change 510796 merged by jenkins-bot:
[operations/mediawiki-config@master] Allow MW to honour the X-Request-Id header if set

https://gerrit.wikimedia.org/r/510796

Mentioned in SAL (#wikimedia-operations) [2019-05-28T09:32:21Z] <mobrovac@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Allow MW to honour the X-Request-Id header if set - T201409 (duration: 01m 12s)

Change 582948 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] mediawiki: Fix reqId in php7-fatal-error.php to consider X-Request-Id

https://gerrit.wikimedia.org/r/582948

Change 582948 merged by Alexandros Kosiaris:
[operations/puppet@production] mediawiki: Fix reqId in php7-fatal-error.php to consider X-Request-Id

https://gerrit.wikimedia.org/r/582948

Change 608643 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@master] Set X-Request-Id on responses

https://gerrit.wikimedia.org/r/c/mediawiki/core/ /608643

Change 608643 merged by jenkins-bot:
[mediawiki/core@master] Set X-Request-Id on all web responses

https://gerrit.wikimedia.org/r/608643

BBlack added a subscriber: BBlack.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!