Connect Hadoop records of the same request coming via different channels
Open, NormalPublic

Description

Varnish logs every request to Hadoop via varnishkafka. MediaWiki can log structured data to Hadoop via AvroFormatter; in that case we have multiple logs for the same request in Hadoop. It would be nice if these logs could be connected, so that data available via Varnish (e.g. user agents) does not need to be duplicated; also because Varnish has better data for a few things (e.g. latency) and because cached requests are only logged from there.

Tgr created this task.Sep 25 2015, 10:38 PM
Tgr updated the task description. (Show Details)
Tgr raised the priority of this task from to Needs Triage.
Tgr added projects: Analytics, Varnish.
Tgr added a subscriber: Tgr.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 25 2015, 10:38 PM
Tgr added a comment.Sep 25 2015, 10:39 PM

Approach 1: add Varnish request id (set by Varnish in the X-Varnish header) to MediaWiki

  • can these be made unique? right now they have 9 digits; given that Wikimedia traffic is in the range of 1E10 requests per day, that does not seem useful.
  • a request might pass through multiple varnishes (edge cluster front cache, edge cluster back cache, main datacenter back cache). Which one of these goes into Kafka? (The edge-side front I presume?)
  • does this get logged currently? Analytics/Data/Webrequest mentions a sequence field but doesn't say what that is. (Hint, hint... it would be awesome if that table were better documented :)

Approach 2: generate a unique id in MediaWiki, set it in the X-Analytics header

  • does the analytics webrequest architecture record all X-Analytics fields in a meaningful way?
Tgr added a comment.Sep 25 2015, 11:33 PM

Also, this will only make sense of joining data from different tables is easy / efficient enough. I'm not familiar with Hive so not sure if that's the case.

Tgr updated the task description. (Show Details)Sep 25 2015, 11:33 PM
Tgr set Security to None.
Nuria added a subscriber: Nuria.Sep 28 2015, 2:31 PM

Analytics/Data/Webrequest mentions a sequence field but doesn't say what that is.

This is the sequence number per host assigned by varnishkafka (have just updated that on doc) so not related to this use case.

a request might pass through multiple varnishes (edge cluster front cache, edge cluster back cache, main datacenter back cache). Which one of these goes into Kafka? (The edge-side front I presume?)

Yup 'frontend'.

can these be made unique? right now they have 9 digits; given that Wikimedia traffic is in the range of 1E10 requests per day, that does not seem useful.
This is the sequence number per host assigned by varnishkafka (have just updated that on doc) so not related to this use case.

Hm, if you had access to both the frontend cache hostname (e.g. cp1054.eqiad.wmnet) and the sequence number in Mediawiki, then, yeah, you could correlate these. Interesting idea!

does the analytics webrequest architecture record all X-Analytics fields in a meaningful way?

Tgr added a subscriber: GWicke.Oct 5 2015, 11:14 PM

Several services including RESTBase and Parsoid already set and forward a v1 uuid in the X-Request-Id header. This header is also set in sub-requests to the action API.

So I guess what would be needed is

  • all services (including MediaWiki) should set generate a random request ID at the beginning of the request, unless the request has a X-Request-Id header in which case they should use that one
  • they should set an X-Request-Id header with the request ID on the response and all subrequests they make
  • they should include the request ID (or maybe a subset or hash of it, to limit length) in all logs (including when logging to Hadoop)
  • Varnish should cache the X-Request-Id header when it caches a response (this probably happens automatically)

Varnish should cache the X-Request-Id header when it caches a response (this probably happens automatically)

This is actually a good point to consider. Right now, we only forward X-Request-Id along the request graph, but don't return it along with the response. While it might not add much internally, returning the header does sound potentially interesting for client-side debugging.

The main disadvantage I see in returning this header would be a small increase in the response size. While this might not matter much if only done for a main 'page' request, doing it for all resources might end up adding too much bloat.

Tgr added a comment.Oct 5 2015, 11:36 PM

They could also be removed by Varnish - that way the webrequest log still includes them but the user does not receive them. Presumably an 50b / request overhead is not a big deal between data centers.

Also, we have ~1E10 requests per day so something like 1E20 should be a decent range for unique request IDs. That's 11 chars in a base64-ish encoding so the header could be shortened to something like XRI: Npb24gZnJvb which is just 17 extra bytes.

bd808 added a subscriber: bd808.Oct 6 2015, 3:13 AM

Some discussion related to this topic starts at T108618#1698925 and continues in T108618#1702008, T108618#1702049, T108618#1702327 and T108618#1703767.

csteipp added a subscriber: csteipp.

I think this is a good idea for operational logs, but we should not make correlating the webrequest/pageviews dataset with mediawiki logs easier, since mediawiki logs often identify the user.

Nuria added a comment.Oct 7 2015, 3:03 PM

but we should not make correlating the webrequest/pageviews dataset with mediawiki logs easier,

mmm... this is precisely what this ticket is about right? "Connect Hadoop records of the same request coming via different channels", this refers to the webrequest dataset (not pageviews). To be clear the "webrequest" dataset is the one that ends up on hadoop via varnish.

@Nuria, I think @csteipp means the logs MediaWiki stores in its log tables, and not "operational" logs. There aren't any plans to store request ids in anything but "operational" logs, afaik.

No, I meant operational logs. I would like to discourage us being able to correlate mediawiki operational logs and the webrequest dataset, since the webrequest dataset is where most of our user's reading history is being tracked, and I'm trying to make that harder to correlate with real ids instead of easier.

I'm ok with having the request id pushed into non-webrequest datasets in hadoop-- e.g., when we start pushing api request logs into hadoop. But I'm not entirely clear if that's what this ticket is about, or if it's explicitly about webrequests.

In talking with @bd808 on irc yesterday, it sounds like the reason for having the unique request id tracked between webrequest and operational logs is to count how many api requests are served from varnish and don't hit the backend to fulfill T102079. It seems like we should be able to get an approximation of that number without adding the request id to webrequests. But are there other uses?

I would like to discourage us being able to correlate mediawiki operational logs and the webrequest dataset, since the webrequest dataset is where most of our user's reading history is being tracked, and I'm trying to make that harder to correlate with real ids instead of easier.

I'm also making the assumption that we still log things like username, email, and other private data in the mediawiki operational logs. If we're not doing that any more, then happy to reevaluate that assumption.

Tgr added a comment.EditedOct 8 2015, 1:29 AM

There are two use cases discussed here:

  • Connect data logged to Hadoop from the API via the PSR-3 logging system (MediaWiki\Logger\LoggerFactory) with data logged to Hadoop via Varnish. The first has various details that would be hard or impossible to extract from the web request/response (which API modules were called, what app made the call, was it successful); the second can account for cached requests. If the request ID serves as an "external key" by which a webrequest log record of the Varnish response (possibly served from cache) references the MediaWiki log record of the original API response, we can join the two and get accurate data for client identity, module usage etc. which takes cached requests into account. I wouldn't call either of these operational logs, although we use the same log framework (but a different endpoint) for collecting the MW API data as for operational logs; but they are collected for analytics purposes.
  • Connect various operational logs via the request ID. This could be helpful in a complex debugging task (although I have no idea how often it would be needed in practice; usually you have a good idea which component is wrong and don't need to jump between services).

The first one is the original scope of this task; the second one came up in discussion as a nice bonus that we could get with little extra effort.

A side effect of doing both is that the Hadoop records and the operational logs would be connected with each other since they use the same request ID. As far as I can see that does not offer any benefit and could be avoided if problematic (by using different IDs or hashing the ID in different ways).

GWicke added a comment.EditedOct 8 2015, 1:31 AM

But are there other uses?

A typical use case for adding such request ids is tracking the processing of a single request across different services. The alternative is to forward all kinds of request details (like the client IP) to all services, so that they can figure out where DOS attacks or otherwise problematic requests come from. By only forwarding the request id, most services don't need to have access to such sensitive details, but debugging and tracing down the source of attacks is still possible for a developer with access to the relevant logging data.

As an example, here is a summary of Facebook's fairly fancy Mystery Machine log analysis system: http://blog.acolyer.org/2015/10/07/the-mystery-machine-end-to-end-performance-analysis-of-large-scale-internet-services/

We'll remain a bit more pedestrian for a while, but having an ID to search for in logstash is already better than having only timestamps.

But are there other uses?

A typical use case for adding such request ids is tracking the processing of a single request across different services. The alternative is to forward all kinds of request details (like the client IP) to all services, so that they can figure out where DOS attacks or otherwise problematic requests come from. By only forwarding the request id, most services don't need to have access to such sensitive details, but debugging and tracing down the source of attacks is still possible for a developer with access to the relevant logging data.

Right, for most logging I agree. My question was specifically, other than the deduping of API requests for counting usage, is there a specific use case for request id to be logged in webrequests.

Ottomata added a subscriber: leila.Nov 13 2015, 2:49 PM

@csteipp, yes. If request_id is on webrequests as well as on app server generated events, then researchers like @leila can associate certain pageviews with actions taken by mediawiki more easily, without having to examine IP addresses.

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 13 2015, 2:49 PM

@BBlack, would it be possible to add a uuid X-Request-Id header at the varnish level?

Tbayer added a subscriber: Tbayer.Nov 17 2015, 11:46 PM
Milimetric moved this task from Incoming to Radar on the Analytics board.Jan 12 2016, 7:32 PM
Restricted Application added a project: Operations. · View Herald TranscriptMay 4 2016, 9:13 AM
chasemp triaged this task as Normal priority.May 5 2016, 8:47 PM
BBlack added a comment.May 5 2016, 9:52 PM

IMHO, we should define better what we need. There's a lot of grey area and disagreement in the discussion so far. Generating a UUID-like header value at the varnish frontends to trace through user-generated requests isn't fundamentally hard (but probably not a real UUID, something much shorter and simpler and just targeted at reasonable non-collision chances globally over a certain time window, like say 14d tops).

Nuria added a comment.May 6 2016, 2:56 PM

Before adding any more data pieces (and agreed with @BBlack that this ticket needs more clarification) I would like to make sure we are making good usage of the data we have. And in this regard we have just started collecting mw-avro data, let's go over our findings, how that data is affecting product/roadmap planning and once we have a use case for data we do not have (other than "more data is better") we can look into this ticket.

BBlack moved this task from Triage to Watching on the Traffic board.Oct 4 2016, 1:18 PM