Page MenuHomePhabricator

Add request_id to webrequest logs as well as other event records ingested into Hadoop
Open, MediumPublic

Description

Varnish logs every request to Hadoop via varnishkafka. MediaWiki can log structured data to Hadoop via EventBs; in that case we have multiple logs for the same request in Hadoop. It would be nice if these logs could be connected, so that data available via Varnish (e.g. user agents) does not need to be duplicated; also because Varnish has better data for a few things (e.g. latency) and because cached requests are only logged from there.

Event Timeline

Tgr raised the priority of this task from to Needs Triage.
Tgr updated the task description. (Show Details)
Tgr added projects: Analytics, Varnish.
Tgr subscribed.

Approach 1: add Varnish request id (set by Varnish in the X-Varnish header) to MediaWiki

  • can these be made unique? right now they have 9 digits; given that Wikimedia traffic is in the range of 1E10 requests per day, that does not seem useful.
  • a request might pass through multiple varnishes (edge cluster front cache, edge cluster back cache, main datacenter back cache). Which one of these goes into Kafka? (The edge-side front I presume?)
  • does this get logged currently? Analytics/Data/Webrequest mentions a sequence field but doesn't say what that is. (Hint, hint... it would be awesome if that table were better documented :)

Approach 2: generate a unique id in MediaWiki, set it in the X-Analytics header

  • does the analytics webrequest architecture record all X-Analytics fields in a meaningful way?

Also, this will only make sense of joining data from different tables is easy / efficient enough. I'm not familiar with Hive so not sure if that's the case.

Tgr set Security to None.

Analytics/Data/Webrequest mentions a sequence field but doesn't say what that is.

This is the sequence number per host assigned by varnishkafka (have just updated that on doc) so not related to this use case.

a request might pass through multiple varnishes (edge cluster front cache, edge cluster back cache, main datacenter back cache). Which one of these goes into Kafka? (The edge-side front I presume?)

Yup 'frontend'.

can these be made unique? right now they have 9 digits; given that Wikimedia traffic is in the range of 1E10 requests per day, that does not seem useful.
This is the sequence number per host assigned by varnishkafka (have just updated that on doc) so not related to this use case.

Hm, if you had access to both the frontend cache hostname (e.g. cp1054.eqiad.wmnet) and the sequence number in Mediawiki, then, yeah, you could correlate these. Interesting idea!

does the analytics webrequest architecture record all X-Analytics fields in a meaningful way?

Several services including RESTBase and Parsoid already set and forward a v1 uuid in the X-Request-Id header. This header is also set in sub-requests to the action API.

So I guess what would be needed is

  • all services (including MediaWiki) should set generate a random request ID at the beginning of the request, unless the request has a X-Request-Id header in which case they should use that one
  • they should set an X-Request-Id header with the request ID on the response and all subrequests they make
  • they should include the request ID (or maybe a subset or hash of it, to limit length) in all logs (including when logging to Hadoop)
  • Varnish should cache the X-Request-Id header when it caches a response (this probably happens automatically)

Varnish should cache the X-Request-Id header when it caches a response (this probably happens automatically)

This is actually a good point to consider. Right now, we only forward X-Request-Id along the request graph, but don't return it along with the response. While it might not add much internally, returning the header does sound potentially interesting for client-side debugging.

The main disadvantage I see in returning this header would be a small increase in the response size. While this might not matter much if only done for a main 'page' request, doing it for all resources might end up adding too much bloat.

They could also be removed by Varnish - that way the webrequest log still includes them but the user does not receive them. Presumably an 50b / request overhead is not a big deal between data centers.

Also, we have ~1E10 requests per day so something like 1E20 should be a decent range for unique request IDs. That's 11 chars in a base64-ish encoding so the header could be shortened to something like XRI: Npb24gZnJvb which is just 17 extra bytes.

csteipp subscribed.

I think this is a good idea for operational logs, but we should not make correlating the webrequest/pageviews dataset with mediawiki logs easier, since mediawiki logs often identify the user.

but we should not make correlating the webrequest/pageviews dataset with mediawiki logs easier,

mmm... this is precisely what this ticket is about right? "Connect Hadoop records of the same request coming via different channels", this refers to the webrequest dataset (not pageviews). To be clear the "webrequest" dataset is the one that ends up on hadoop via varnish.

@Nuria, I think @csteipp means the logs MediaWiki stores in its log tables, and not "operational" logs. There aren't any plans to store request ids in anything but "operational" logs, afaik.

No, I meant operational logs. I would like to discourage us being able to correlate mediawiki operational logs and the webrequest dataset, since the webrequest dataset is where most of our user's reading history is being tracked, and I'm trying to make that harder to correlate with real ids instead of easier.

I'm ok with having the request id pushed into non-webrequest datasets in hadoop-- e.g., when we start pushing api request logs into hadoop. But I'm not entirely clear if that's what this ticket is about, or if it's explicitly about webrequests.

In talking with @bd808 on irc yesterday, it sounds like the reason for having the unique request id tracked between webrequest and operational logs is to count how many api requests are served from varnish and don't hit the backend to fulfill T102079. It seems like we should be able to get an approximation of that number without adding the request id to webrequests. But are there other uses?

I would like to discourage us being able to correlate mediawiki operational logs and the webrequest dataset, since the webrequest dataset is where most of our user's reading history is being tracked, and I'm trying to make that harder to correlate with real ids instead of easier.

I'm also making the assumption that we still log things like username, email, and other private data in the mediawiki operational logs. If we're not doing that any more, then happy to reevaluate that assumption.

There are two use cases discussed here:

  • Connect data logged to Hadoop from the API via the PSR-3 logging system (MediaWiki\Logger\LoggerFactory) with data logged to Hadoop via Varnish. The first has various details that would be hard or impossible to extract from the web request/response (which API modules were called, what app made the call, was it successful); the second can account for cached requests. If the request ID serves as an "external key" by which a webrequest log record of the Varnish response (possibly served from cache) references the MediaWiki log record of the original API response, we can join the two and get accurate data for client identity, module usage etc. which takes cached requests into account. I wouldn't call either of these operational logs, although we use the same log framework (but a different endpoint) for collecting the MW API data as for operational logs; but they are collected for analytics purposes.
  • Connect various operational logs via the request ID. This could be helpful in a complex debugging task (although I have no idea how often it would be needed in practice; usually you have a good idea which component is wrong and don't need to jump between services).

The first one is the original scope of this task; the second one came up in discussion as a nice bonus that we could get with little extra effort.

A side effect of doing both is that the Hadoop records and the operational logs would be connected with each other since they use the same request ID. As far as I can see that does not offer any benefit and could be avoided if problematic (by using different IDs or hashing the ID in different ways).

But are there other uses?

A typical use case for adding such request ids is tracking the processing of a single request across different services. The alternative is to forward all kinds of request details (like the client IP) to all services, so that they can figure out where DOS attacks or otherwise problematic requests come from. By only forwarding the request id, most services don't need to have access to such sensitive details, but debugging and tracing down the source of attacks is still possible for a developer with access to the relevant logging data.

As an example, here is a summary of Facebook's fairly fancy Mystery Machine log analysis system: http://blog.acolyer.org/2015/10/07/the-mystery-machine-end-to-end-performance-analysis-of-large-scale-internet-services/

We'll remain a bit more pedestrian for a while, but having an ID to search for in logstash is already better than having only timestamps.

But are there other uses?

A typical use case for adding such request ids is tracking the processing of a single request across different services. The alternative is to forward all kinds of request details (like the client IP) to all services, so that they can figure out where DOS attacks or otherwise problematic requests come from. By only forwarding the request id, most services don't need to have access to such sensitive details, but debugging and tracing down the source of attacks is still possible for a developer with access to the relevant logging data.

Right, for most logging I agree. My question was specifically, other than the deduping of API requests for counting usage, is there a specific use case for request id to be logged in webrequests.

@csteipp, yes. If request_id is on webrequests as well as on app server generated events, then researchers like @leila can associate certain pageviews with actions taken by mediawiki more easily, without having to examine IP addresses.

@BBlack, would it be possible to add a uuid X-Request-Id header at the varnish level?

chasemp triaged this task as Medium priority.May 5 2016, 8:47 PM

IMHO, we should define better what we need. There's a lot of grey area and disagreement in the discussion so far. Generating a UUID-like header value at the varnish frontends to trace through user-generated requests isn't fundamentally hard (but probably not a real UUID, something much shorter and simpler and just targeted at reasonable non-collision chances globally over a certain time window, like say 14d tops).

Before adding any more data pieces (and agreed with @BBlack that this ticket needs more clarification) I would like to make sure we are making good usage of the data we have. And in this regard we have just started collecting mw-avro data, let's go over our findings, how that data is affecting product/roadmap planning and once we have a use case for data we do not have (other than "more data is better") we can look into this ticket.

Ottomata claimed this task.

Going to resolve this instead of declining. For EventBus generated events, including mediawiki.api-request, x_request_id is set in events as the meta.request_id field.

The task was about connecting webrequest data and MediaWiki API logs (or more generally, MediaWiki logs), though, and webrequest data does not have the request ID as far as I know.

Hm, you are right, but that is not clear from the task description. I'll edit it and leave open. I think this would be really useful. I still don't understand (or I have forgotten) all the complications though.

Ottomata renamed this task from Connect Hadoop records of the same request coming via different channels to Add request_id to webrequest logs as well as other event records ingested into Hadoop.Jul 7 2020, 8:32 PM
Ottomata updated the task description. (Show Details)

Hm, you are right, but that is not clear from the task description.

It indeed wasn't. Sorry!
T113817#1711466 states the goal more clearly (and T113817#1711474 / T113817#1803654 a more controversial possible second goal), with the caveat that PSR-3 logging of API requests has been replaced by EventBus since. But I don't think that changes much.

I still don't understand (or I have forgotten) all the complications though.

Privacy (especially wrt. logstash logs being correlatable with webrequest logs) per T113817#1711231, and (depending on the implementation) a slight increase in traffic per T113817#1704071.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!