Maniphest T201409

Harmonise the identification of requests across our stack
Open, MediumPublic
Actions

Assigned To

None

Authored By

	• mobrovac
	Aug 7 2018, 12:41 PM

Description

Problem

Our stack comprises many components/servers that all interact with each other in other to fulfil clients' requests (or prepare data to be served to clients at a later point). For any given request, requests may be spawned to other components and their responses assembled before being returned to the client. This creates the need for having a sort of a distributed stack trace that allows us to pin-point problematic links in the request chain.

A certain degree of request identification does currently exist in our infrastructure, alas only on sub-system levels:

MediaWiki's WebRequest relies on the UNIQUE_ID env variable provided by Apache's mod_unique_id
RESTBase and the services behind it use and propagate the X-Request-Id header
EventBus relies on the same x-request-id header when creating events for both asynchronous updates as well as JobQueue messages
Thumbor uses a custom Thumbor-Request-Id header

There are probably more such examples.

In order to be able to trace the requests provoked by an (initial/external) request, all of the systems in our infrastructure should identify requests in the same way, use this identifier for logging and propagate it to other links in the request chain.

Proposed Solution

Use a UUID v1/v4 x-request-id header/entity. Varnish f-e (soon ATS) is the main point of entry of external requests. Therefore, it can generate the request IDs and attach them to requests in the form of the x-request-id header, which can then be used and propagated by all entities behind it. Furthermore, entities responding to requests must log the received/generated request ID.

Details

Subject	Repo	Branch	Lines +/-
Set X-Request-Id on all web responses	mediawiki/core	master	+6 -0
mediawiki: Fix reqId in php7-fatal-error.php to consider X-Request-Id	operations/puppet	production	+6 -3
Allow MW to honour the X-Request-Id header if set	operations/mediawiki-config	master	+7 -0
Allow the request ID to be passed in via the `X-Request-Id` header	mediawiki/core	master	+25 -3
MWHttpRequest: Include the request ID in outgoing HTTP requests	mediawiki/core	master	+2 -0
Varnish: Unset X-Request-Id for external requests	operations/puppet	production	+5 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T201409 Harmonise the identification of requests across our stack
Open	None	T221976 Have Varnish set the `X-Request-Id` header for incoming external requests
Resolved	• ema	T221977 Package libvmod-uuid for Debian
Resolved	• mobrovac	T225711 Have service-template-node forward the request ID
Declined	Krinkle	T193050 Include request id (if present) in a comment in DB queries
Duplicate	None	T291419 Update tendril or its replacement to support having request IDs in DB query comments
Invalid	None	T291420 Update performance_schema to support having request IDs in DB query comments
Resolved	Krinkle	T291192 Update php-wmerrors page to include request ID
Resolved	Krinkle	T279211 WikimediaDebug: Read reqId from X-Request-Id header instead of from HTML (wgRequestId)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• Imarlier moved this task from Inbox, needs triage to Radar on the Performance-Team board.Aug 11 2018, 6:20 PM

• Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.

Change 451240 merged by Ema:
[operations/puppet@production] Varnish: Unset X-Request-Id for external requests

https://gerrit.wikimedia.org/r/451240

CCicalese_WMF removed a project: Core-Platform-Team-Old.Aug 14 2018, 2:02 AM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Aug 14 2018, 2:14 AM

We also need internal requests to be traced, so I would assume we need all services to generate a request Id whenever they receive a request that has none.

• ema moved this task from Backlog to Caching on the Traffic board.Aug 15 2018, 6:35 AM

In T201409#4500541, @Joe wrote:

We also need internal requests to be traced, so I would assume we need all services to generate a request Id

Correct. Any source of requests should attach one.

whenever they receive a request that has none.

If a service receives a request without a req id it means we have a hole somewhere. I wonder how to trace/identify such cases...

In T201409#4513970, @mobrovac wrote:

If a service receives a request without a req id it means we have a hole somewhere. I wonder how to trace/identify such cases...

Because we currently tend to have services very isolated, it should be relatively easy just by knowing the source IP of the request, I would think.

With that said, a thing that we did when we undertook a similar project at a past company was that we gave every application that made HTTP requests a unique user agent (generally the name of the calling job or application). Might be worth exploring doing that as well. (@BPirkle - maybe something to think about in the context of your MediaWiki HTTP client work as well.)

Ottomata subscribed.Aug 20 2018, 3:03 PM

We also need internal requests to be traced, so I would assume we need all services to generate a request Id whenever they receive a request that has none.

EventBus extension generates an x-request-id if the request doesn't already have one. Probably any web service should do this logic, in case it isn't being accessed via varnish. For Mediawiki, perhaps we should set x-request-id somewhere else earlier on? Currently if it isn't set, the EventBus extension will end up generating different x-request-ids even for events that are triggered during the same PHP request handling.

unique user agent (generally the name of the calling job or application). Might be worth exploring doing that as well.

For node services, we are already doing that. At least in theory, maybe again some 'holes' exist.

• Imarlier mentioned this in T110022: Move HTTP-related code from MW to its own library.Aug 20 2018, 3:23 PM

daniel moved this task from P1: Define to Under discussion on the TechCom-RFC board.Aug 29 2018, 8:33 PM

@mobrovac I think as a first step we should:

Standardise the name of the header (for services that can/do set a header of this kind),
Require that services also read the header from internal requests (or generate their own) and pass it on to any internal requests, and to explicitly make sure we don't do that for external requests.

But I'm unsure whether we should also standardise the value format of the header (as UUID). For services where it's easy to use UUID as something else, that seems fine given the choice to pick the same format. But, if a service has a built-in method that only supports a different format, perhaps that's okay to keep as-is? I expect for the majority of cases we'll only be dealing with whatever format is used by Varnish. It's not clear to me what the added benefit would be of requests originating elsewhere internally passing around a slightly differently formatted value (assuming we don't parse it in any way).

Krenair subscribed.Sep 3 2018, 9:07 PM

But I'm unsure whether we should also standardise the value format of the header (as UUID).

I'd be fine with not requiring UUID, but we should standardize the field type. Since UUID is string, we should stick with string everywhere, even if the actual value is integerish.

In T201409#4554407, @Krinkle wrote:

@mobrovac I think as a first step we should:

Standardise the name of the header (for services that can/do set a header of this kind),

Require that services also read the header from internal requests (or generate their own) and pass it on to any internal requests, and to explicitly make sure we don't do that for external requests.

Yup, that sounds about right. In the task I am proposing x-request-id but any x- value should work well enough. The second part about setting/consulting it everywhere is where I think the bulk of the work will be, but without it we can't really have a complete picture of a request's path and impact.

But I'm unsure whether we should also standardise the value format of the header (as UUID). For services where it's easy to use UUID as something else, that seems fine given the choice to pick the same format. But, if a service has a built-in method that only supports a different format, perhaps that's okay to keep as-is?

This smells like you already have concrete cases where either using a UUID or changing the way it's currently done might prove challenging? As for my suggestion of using UUIDs, I proposed them because of the low possibility of collision, but any function with such properties ought to work for us here.

I expect for the majority of cases we'll only be dealing with whatever format is used by Varnish. It's not clear to me what the added benefit would be of requests originating elsewhere internally passing around a slightly differently formatted value (assuming we don't parse it in any way).

I must admit I don't understand this part. If I understood you correctly, even if certain parts of the system do emit a different type of ID, I don't see a problem with it as long as (a) they honour the header on the way in (i.e. do not generate a new one in spite of it already being set); and (b) the generating function provides low-collision guarantees (as stated above). There would be one direct benefit if we use something like a time UUID, though: if every entity in the request chain produces them, then we could also get a time span out of them (eventually, and, under the assumption that most entities can generate time UUIDs with sufficient precision). However, this would be a benefit only at later stages where we implement something like sub-request IDs as a way of denoting lightweight spans for future instrumentation capabilities.

In T201409#4556081, @Ottomata wrote:

Since UUID is string, we should stick with string everywhere, even if the actual value is integerish.

I agree, we should make sure the feature is analytics-friendly and setting the header's type to string makes the most sense to me as it provides us with flexibility.

• Mholloway subscribed.Sep 14 2018, 3:34 PM

TechCom is putting this on last call ending on 26 September 2pm PST(21:00 UTC, 23:00 CET)

• bearND mentioned this in T203854: Expand usage of x-triggered-by.Sep 21 2018, 8:21 PM

• bearND subscribed.

One issue brought up in T113817: Add request_id to webrequest logs as well as other event records ingested into Hadoop was that we might want to avoid connecting too many things for privacy reasons (specifically, request IDs are useful for both debug logs and analytics data, but we don't want to connect debug logs to analytics data as they contain a bunch of information we are careful to keep out of out analytics logging). That can be done by storing a hash of the request ID and using different hashing methods or different salts for debug logs and for analytics. That's easy to do safely with a random (UUIDv4-style) request id but much harder with a deterministic (UUIDv1-style) ID.

As long as a request ID doesn't get associated with PII-worthy data, I don't see it being a privacy issue. Since it's a per-request debugging tool, I don't see a value in adding it to the analytics data at all.

Coming to the format of the tracer, I think it should have the following feature:

Include information about where the request originated (at the edge, from an internal service)
Be reasonably unique, so that two different requests cannot be confused.

I don't think we need more characteristics than this, from an operations point of view. So I would see a N-letter prefix + a UUIDv4-like random string as a good format.

TechCom has approved this RFC

• kchapman moved this task from P5: Last Call to TechCom-RFC-Closed on the TechCom-RFC board.Sep 27 2018, 7:47 PM

• kchapman edited projects, added TechCom-RFC (TechCom-RFC-Closed); removed TechCom-RFC.

• ema subscribed.Oct 5 2018, 8:59 AM

CDanis subscribed.Nov 8 2018, 2:25 PM

CDanis added a project: User-CDanis.Nov 9 2018, 7:31 PM

CDanis moved this task from Backlog to Radar on the User-CDanis board.

dr0ptp4kt mentioned this in T205569: Define cross-schema event stitching approach.Nov 15 2018, 2:37 AM

• jlinehan subscribed.Nov 15 2018, 3:53 PM

kzimmerman subscribed.Dec 13 2018, 6:10 PM

Krinkle removed a project: Patch-For-Review.Dec 13 2018, 8:37 PM

Krinkle moved this task from Untriaged to Approved on the TechCom-RFC (TechCom-RFC-Closed) board.

On a very random note, I wanted to say that I enjoyed this:

Guess the subscriber list triggered something in GMail's language recognition?

@Imarlier https://translate.google.com/#view=home&op=translate&sl=et&tl=en&text=krinkle

• mobrovac added a project: Platform Team Legacy (Designing).Dec 20 2018, 12:55 PM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 10:15 PM

Change 487544 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/core@master] MWHttpRequest: Include the request ID in outgoing HTTP requests

https://gerrit.wikimedia.org/r/487544

gerritbot added a project: Patch-For-Review.Feb 1 2019, 11:51 PM

A discussion about later plans for this task was accidentally started in code review. Copying it here for continued discussion.

In Gerrit, @Tgr wrote:

I imagine the idea with X-Request-Id is that it is generated at the point of entry (which might not be MediaWiki) and proxied through internal service calls. There was some discussion on whether that is a better strategy vs. each application adding its own request ID header (which to some extent also happens already, e.g. Thumbor adds a Thumbor-Request-Id) but I can't recall where or with what results. I think Krinkle was involved, maybe he remembers.

In Gerrit, @mobrovac wrote:

Correct. The idea is to have the same request ID propagated through the system. This patch is just the first step, in which we ensure MW propagates the id it uses for logging. Next, we'd need to check if the request id has already been set on the incoming request and use that one instead. Then, we need to implement generating the id in varnish, but currently we can get away without because in MW we use Apache's mod_unique which sets $_SERVER['unique_id'] for each request.

In Gerrit, @Anomie wrote:

Next, we'd need to check if the request id has already been set on the incoming request and use that one instead.

But let's make sure that some attacker can't screw things up by passing X-Request-ID in their own requests.

And let's make sure that whatever is done to prevent that doesn't only work on Wikimedia's infrastructure, i.e. don't rely on filtering the header out at the edge before the request gets to MediaWiki.

In Gerrit, @mobrovac wrote:

But let's make sure that some attacker can't screw things up by passing X-Request-ID in their own requests.

Varnish is already removing this header from external incoming requests, so we are good there :)

And let's make sure that whatever is done to prevent that doesn't only work on Wikimedia's infrastructure, i.e. don't rely on filtering the header out at the edge before the request gets to MediaWiki.

I'm not sure I follow. In our env, MW has no choice but to trust Varnish for such chores. If MW receives a request with an already-set req id, how can it tell whether it's to be considered safe or not?

In Gerrit, @Anomie wrote:

My point is that third-party wikis probably don't have our Varnish configuration to protect them, but an attacker still shouldn't be able to unexpectedly inject a request ID into their logging.

In Gerrit, @mobrovac wrote:

My point is that third-party wikis probably don't have our Varnish configuration to protect them, but an attacker still shouldn't be able to unexpectedly inject a request ID into their logging.

I realise that. A simple thing that pops to mind now is to simply have a config var that tells MW whether to remove the header on incoming reqs or not, but we can bike-shed on this elsewhere :)

In Gerrit, @mobrovac wrote:

I realise that. A simple thing that pops to mind now is to simply have a config var that tells MW whether to remove the header on incoming reqs or not

Defaulted to "never accept the incoming header" and reconfigured in Wikimedia's configuration, I agree that should work.

Or we could somehow let code from LocalSettings.php override the default header, and in Wikimedia's CommonSettings.php we could do that (like we do for some other hooks already).

Change 487544 merged by jenkins-bot:
[mediawiki/core@master] MWHttpRequest: Include the request ID in outgoing HTTP requests

https://gerrit.wikimedia.org/r/487544

ReleaseTaggerBot added a project: MW-1.33-notes (1.33.0-wmf.17; 2019-02-12).Feb 8 2019, 8:00 PM

In T201409#4939220, @Anomie wrote:

Defaulted to "never accept the incoming header" and reconfigured in Wikimedia's configuration, I agree that should work.

That's what I was thinking, yes. Instead of making this only about the request ID, we could generalise the config setting and have it cover unsafe headers in general, so that we don't end up with a gazillion variables, one for each header.

Or we could somehow let code from LocalSettings.php override the default header, and in Wikimedia's CommonSettings.php we could do that (like we do for some other hooks already).

What would be the benefit here as opposed to the config setting? Since defaulting the config to false would make MW's default installation secure (and, on the other hand, simple for us to override it for our needs), I have a hard time understanding why would we complicate things.

Eevans awarded a token.Mar 13 2019, 1:26 PM

Ottomata mentioned this in T220661: EventGate service runner worker occasionally killed, usually during higher load.Apr 16 2019, 9:14 PM

Change 505668 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/core@master] Allow the request ID to be passed in via the X-Request-Id header

https://gerrit.wikimedia.org/r/505668

• mobrovac mentioned this in T221976: Have Varnish set the `X-Request-Id` header for incoming external requests.Apr 26 2019, 5:28 PM

• mobrovac added a subtask: T221976: Have Varnish set the `X-Request-Id` header for incoming external requests.

Change 505668 merged by jenkins-bot:
[mediawiki/core@master] Allow the request ID to be passed in via the X-Request-Id header

https://gerrit.wikimedia.org/r/505668

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.6; 2019-05-21).May 15 2019, 9:00 AM

Change 510796 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/mediawiki-config@master] Allow MW to honour the X-Request-Id header if set

https://gerrit.wikimedia.org/r/510796

Change 510796 merged by jenkins-bot:
[operations/mediawiki-config@master] Allow MW to honour the X-Request-Id header if set

https://gerrit.wikimedia.org/r/510796

Mentioned in SAL (#wikimedia-operations) [2019-05-28T09:32:21Z] <mobrovac@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Allow MW to honour the X-Request-Id header if set - T201409 (duration: 01m 12s)

• mobrovac added a subtask: T225711: Have service-template-node forward the request ID.Jun 13 2019, 10:45 AM

• mobrovac closed subtask T225711: Have service-template-node forward the request ID as Resolved.Jun 25 2019, 3:18 PM

• Nikerabbit moved this task from Approved to In progress on the TechCom-RFC (TechCom-RFC-Closed) board.Jan 15 2020, 4:14 PM

Krinkle moved this task from In progress to Approved on the TechCom-RFC (TechCom-RFC-Closed) board.Feb 19 2020, 11:06 PM

Change 582948 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/puppet@production] mediawiki: Fix reqId in php7-fatal-error.php to consider X-Request-Id

https://gerrit.wikimedia.org/r/582948

Change 582948 merged by Alexandros Kosiaris:
[operations/puppet@production] mediawiki: Fix reqId in php7-fatal-error.php to consider X-Request-Id

https://gerrit.wikimedia.org/r/582948

Krinkle mentioned this in T253674: WikimediaDebug "Find in XHGui" can't find profiles.May 26 2020, 6:57 PM

Change 608643 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/core@master] Set X-Request-Id on responses

https://gerrit.wikimedia.org/r/c/mediawiki/core/ /608643

Change 608643 merged by jenkins-bot:
[mediawiki/core@master] Set X-Request-Id on all web responses

https://gerrit.wikimedia.org/r/608643

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.40; 2020-07-07).Jul 3 2020, 4:00 AM

Reedy mentioned this in T261260: Strange secondary error "Class 'WebRequest' not found" in logs after errors like "extension.json is not a valid JSON file".Aug 26 2020, 1:18 AM

CDanis mentioned this in T264667: Define distributed RPC/Request TRACING strategy and tooling recommendation.Oct 5 2020, 10:44 PM

Aklapper removed subscribers: Anomie, • Imarlier.Oct 16 2020, 5:02 PM

Krinkle mentioned this in T279211: WikimediaDebug: Read reqId from X-Request-Id header instead of from HTML (wgRequestId).Apr 2 2021, 10:29 PM

Krinkle mentioned this in T283291: Strip new X-Request-Id header from non-debug responses.May 20 2021, 9:27 PM

Krinkle added a subtask: T193050: Include request id (if present) in a comment in DB queries.Jun 30 2021, 6:25 PM

Krinkle mentioned this in T291192: Update php-wmerrors page to include request ID.Sep 16 2021, 2:28 PM

Krinkle added a subtask: T291192: Update php-wmerrors page to include request ID.Sep 18 2021, 2:07 AM

Krinkle added a subtask: T279211: WikimediaDebug: Read reqId from X-Request-Id header instead of from HTML (wgRequestId).

Legoktm changed the status of subtask T193050: Include request id (if present) in a comment in DB queries from Open to Stalled.Sep 20 2021, 5:54 PM

• Mholloway unsubscribed.Sep 20 2021, 6:59 PM

Krinkle closed subtask T291192: Update php-wmerrors page to include request ID as Resolved.Sep 22 2021, 5:14 PM

BBlack moved this task from Caching to Icebox-Temp on the Traffic board.Oct 8 2021, 5:26 PM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

AntiCompositeNumber subscribed.Oct 22 2021, 12:42 AM

Krinkle closed subtask T279211: WikimediaDebug: Read reqId from X-Request-Id header instead of from HTML (wgRequestId) as Resolved.Oct 26 2021, 1:43 AM

Ottomata mentioned this in T292586: Sticky Header: Create schema to track returning to the top of the page.Dec 16 2021, 1:59 PM