Page MenuHomePhabricator

include the 'Server:' response header in varnishkafka
Closed, ResolvedPublic5 Estimated Story Points

Description

It would be useful, for tracking down certain kinds of errors at the edge<->appserver interface, to include the Server: header from the HTTP response.

I mostly imagine using this via varnish50x logstash, although having it in webrequest tables proper doesn't seem like a bad idea either.

I've prepared https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/511690/ for this. Will emitting an extra field in varnishkafka cause any problems for Analytics systems? Or will the unknown field simply be ignored?

Event Timeline

CDanis created this task.May 23 2019, 4:31 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 23 2019, 4:31 PM

Change 511690 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] varnishnsca webrequest: log Server: in response as 'backend'

https://gerrit.wikimedia.org/r/511690

Volans triaged this task as Medium priority.May 23 2019, 9:58 PM
CDanis assigned this task to Ottomata.May 29 2019, 8:12 PM

Andrew, can you (or someone else) advise on rolling out this change for Analytics?

I think the minimal viable thing is having this field appear in Logstash, and for that I think it would be enough just to add it to varnishkafka output, as written.

elukey added a subscriber: elukey.May 29 2019, 8:15 PM

@CDanis we are currently in offsite so this needs to wait until next week :)

I'll bring this up tomorrow to my team!

@CDanis I had a chat with my team as promised, and we are +1 to add the new field to varnishkafka but -1 to add it all the way to the webrequest table, that in our opinion shouldn't be a good target for this.

This is the overall picture:

  • varnishkafka reads from Varnish's shared memory log and builds a json (following the format that we are changing). Then it sends the json string to kafka (webrequest_etc.. topics).
  • Our ETL starts with Camus, that reads the webrequests topics and produces sequence files in Hadoop (raw data)
  • On top of these files, we apply a Hive database with a certain schema (that includes all the fields that we need).
  • We then refine this "raw" data to the wmf Hive database (adding geoip, deduping rows, etc..)

On weblog1001 IIRC we have kafkatee that pulls from the webrequest kafka topics, filters 50x requests and sends them to logstash. The idea would be to focus, for the moment, only on the varnishkafka change and the weblog1001's pipeline, in order to get the server data in the 50x logstash dashboard but not all the way down to the Analytics pipeline.

Would it be ok? If so I'll merge the patch on monday (need to roll it out gently + testing on one host first with puppet disabled).

elukey claimed this task.May 31 2019, 9:47 AM
elukey added a project: User-Elukey.
elukey added a subscriber: Ottomata.
elukey moved this task from Backlog to In Progress on the User-Elukey board.May 31 2019, 10:18 AM

Change 511690 merged by Elukey:
[operations/puppet@production] varnishkafka webrequest: log Server: in response as 'backend'

https://gerrit.wikimedia.org/r/511690

Mentioned in SAL (#wikimedia-operations) [2019-06-03T06:50:44Z] <elukey> roll restart varnishkafka (via puppet) for a config change - T224236

Mentioned in SAL (#wikimedia-analytics) [2019-06-03T06:51:43Z] <elukey> add the server field to the webrequest event format in varnishkafka + roll restart of all the varnishkafkas (via puppet) - T224236

Does logstash need to be changed to read the new field? Cc: @fgiunchedi

elukey added a comment.Jun 3 2019, 7:49 AM

Nevermind, self answered. The field is now showing up in the 50x logstash dashboard, but it seems requiring a refresh of the index list (I hovered on the warning icon beside the field in a 50x event). I didn't start the field refresh since I don't know if it could cause issues to logstash.

The new field is a mixture of:

  • "Varnish"
  • "Apache"
  • "nginx/etc.."
  • "mwXXX.eqiad.wmnet"
  • other backends (restbase, ores, etc..)
elukey added a comment.Jun 3 2019, 8:18 AM

Refreshed also the index list on kibana, no more warnings for the backend field.

ema moved this task from Triage to Caching on the Traffic board.Jun 3 2019, 3:10 PM
elukey set the point value for this task to 5.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.
fdans closed this task as Resolved.Jun 3 2019, 4:03 PM

Thanks Luca! Any reason we shouldn't add the field to webrequest Hive table too?

elukey added a comment.Jun 3 2019, 5:35 PM

@Ottomata the quorum of the A-team in Mallorca said that we shouldn't add it since it seems more a debugging info rather than something that should be in the webrequest raw/refined data, so I followed this decision. We can revise it anytime if you feel otherwise :)

K! let's leave it for now I don't mind either way.

ema added a subscriber: ema.Jun 4 2019, 1:59 PM

@CDanis thank you so much for this! Very useful. Note that the Server response header will be set to Varnish for all synthetic responses internally generated by Varnish. You can identify those by checking if the X-Cache-Status header (or X-Cache) contains "int". An example of response generated by Varnish (and not by the origin servers) is when you're asking a cache_upload server for the wrong domain:

$ curl -v -H "Host: banana.wikipedia.org" -H "X-Forwarded-Proto: https" http://cp2002.codfw.wmnet/x 2>&1 | egrep -i "X-Ca|Server:|1 404"
< HTTP/1.1 404 Requested domainname does not exist on this server
< Server: Varnish
< X-Cache: cp2002 int
< X-Cache-Status: int-front

Whenever in VCL we do return (synth($status_code, $reason)), that's a synthetic (int) response.

Not only those are synth responses! 503 errors generated by Varnish are also in that category and will have a Server response header value of "Varnish". They can be due to errors fetching from the origin server of various nature, including timeouts fetching from the origin, Varnish bugs, and what not.

CDanis added a comment.Jun 4 2019, 2:14 PM

Indeed, thanks @ema ! I talked with @fgiunchedi some about this earlier and we tweaked the wording on the Logstash dashboard to remind users that "Varnish" appearing highly in the "top n Backends" panel is not necessarily reflective of a Varnish issue.