Page MenuHomePhabricator

Add Accept header to webrequest logs
Closed, ResolvedPublic3 Estimated Story Points

Description

For transitions between content version RESTBase wants to utilize content negotiation via the Accept header (T128040). Right now we want to use that to transition to the new version of the summary endpoint. However, figuring out the exact migration strategy depends on how many clients set the Accept header and which values do the set there. Without this data we can only guess the best strategy.

Could the Accept header be added to the webrequest table so that we could run queries to identify the distribution of accept headers and identify the clients which don't sent the header, or send it incorrectly?

Event Timeline

Pchelolo renamed this task from Add Accept header to web request Eventlogging to Add Accept header to webrequest table.Jul 13 2017, 5:29 PM
Ottomata renamed this task from Add Accept header to webrequest table to Add Accept header to webrequest logs.Jul 13 2017, 5:29 PM

This shouldn't be difficult. We'd need to configure varnishkafka to emit this header as part of the JSON webrequest log to Kafka, and alter the webrequest Hive table(s) to add this field.

Nuria triaged this task as Medium priority.Mar 29 2018, 11:02 PM
Nuria moved this task from Wikistats to Smart Tools for Better Data on the Analytics board.

We're finally implementing the feature it's needed for, so could you please prioritize this?

@Pchelolo Since the accept header is of little interest (i think, please correct me if I am wrong) for other than this question can't we extract this temporary data from a varnish dump and look at it? We have done similar things to quickly inspect DNT header for example. I might be wrong but i cannot think of many uses of accept header so it seems a bit overkill modifying webrequest table just for this purpose.

@Nuria I wasn't aware of that option. Basically, we will need to do this several times over the course of next months. If it's easy enough to do - I'm all for it.

We would need quite a large sample though - external requests for Parsoid HTML are not very high rate and we don't want to miss important clients.

pinging @Ottomata and @elukey to make sure they are ok with idea

Added the Traffic team so they can follow up with their thoughts. It is relatively easy to create a varnishlog instance on a caching host and tee/dump the traffic (properly filter to avoid filling up the host's disk), but it would be needed from multiple hosts to be relevant and it might become an availability risk (for the reason explained above, since IIUC it would need to run for a long time) and/or a source of confusion for people debugging other issues.

There are other two possible options if we don't want to add the Accept header to webrequest:

  • create a new (temporary) varnishkafka instance that pushes data to Kafka (only what needed).
  • use logstash as @Vgutierrez did when analyzing TLS data recently, but I am not completely familiar with the details of his work.

I've used logstash in the past to track TLS handshake parameters for a 0,10% of our traffic, and it already was a pretty big amount of data for what logstash expects. If it's a pretty small amount of requests logstash could be the easiest way, but it's not prepared to handle big amounts of traffic like our kafka infrastructure.

Hm, either of these solutions is fine, but even if Accept isn't requested from others, it might be something fairly interesting to just include in the full webrequest logs, no? It would be easy to add and might be nice to have for other analysis purposes!

@Pchelolo we discussed this in standup today.

If the data you need is small enough (can we filter on a URI?) and you only need a sample (say from a single cache host), AND if traffic folks don't mind, then we can probably set up a one-off varnishlog process to just scrape what you need for a few days at a time.

Else adding the Accept header to webrequest logs should be easy enough, so we should just go ahead and do that.

If the data you need is small enough (can we filter on a URI?) and you only need a sample (say from a single cache host), AND if traffic folks don't mind, then we can probably set up a one-off varnishlog process to just scrape what you need for a few days at a time.

We can regex on /api/rest_v1/page/html/ - that's the only thing we're interested in. The req rate for this endpoint is not super high, and 1 day of data should be enough. So I think this will work.

While the proposed solution will work for us in this case, I second @Ottomata's thoughts that having this header (or the lack thereof) included in the logs seems like a good idea generally (AND if it's easy to fill it in).

We did enable the feature after all by looking at requests reaching RESTBase, but that's not very convenient.

Technically this is no more required. However, moving forward, when Parsoid adds more and more different versions, we will need to support more and more transformations and it might get out of control pretty soon, especially if we have to support the transformation matrix when anything can be transformed into anything, so we will need a reliable way to deprecate older versions. Being able to run a simple query to analyze what percentage of traffic is using what version and make an informed decision when we can deprecate/delete older versions would be really valuable.

If it does not present any technical difficulties on Hadoop/Varnish side, it would still be greatly appreciated. If you still think it doesn't worth it, please close as declined, but (quote) "I'll be back"

Ok, @Pchelolo gets the persistence award! Let me understand: are there other headers we would need besides the accept one? (so we can do all these at once)

Ok, @Pchelolo gets the persistence award!

Yay! I've got the award!

Let me understand: are there other headers we would need besides the accept one? (so we can do all these at once)

I do not foresee anything else. All the negotiations for all the endpoints will work via Accept header. Language variant negotiation uses Accept-Language, but there's nothing to analyze about it in the long run.

☝️ Best I could do at short notice…

Change 464563 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add Accept header to varnishkafka webrequest logs

https://gerrit.wikimedia.org/r/464563

Change 464574 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] Add accept header to webrequest raw and refined tables

https://gerrit.wikimedia.org/r/464574

Change 464563 merged by Ottomata:
[operations/puppet@production] Add Accept header to varnishkafka webrequest logs

https://gerrit.wikimedia.org/r/464563

Just added the accept field to the varnishkafka generated webrequest logs.

@JAllemandou I haven't done this in a while, I'll ping you in my morning tomorrow to help me do the webrequest alter to make sure I don't forget something crucial.

@Ottomata - After standup please, as today is kids-day for me :)

Change 464574 merged by Ottomata:
[analytics/refinery@master] Add accept header to webrequest raw and refined tables

https://gerrit.wikimedia.org/r/464574

Mentioned in SAL (#wikimedia-operations) [2018-10-10T18:05:22Z] <otto@deploy1001> Started deploy [analytics/refinery@4e2d956]: Add accept header to webrequest logs - T170606

Mentioned in SAL (#wikimedia-operations) [2018-10-10T18:09:58Z] <otto@deploy1001> Finished deploy [analytics/refinery@4e2d956]: Add accept header to webrequest logs - T170606 (duration: 04m 35s)

Mentioned in SAL (#wikimedia-operations) [2018-10-10T18:10:03Z] <otto@deploy1001> Started deploy [analytics/refinery@28bbee8]: Add accept header to webrequest logs - T170606

Mentioned in SAL (#wikimedia-operations) [2018-10-10T18:20:37Z] <otto@deploy1001> Finished deploy [analytics/refinery@28bbee8]: Add accept header to webrequest logs - T170606 (duration: 10m 34s)

Ottomata set the point value for this task to 3.