Page MenuHomePhabricator

Get 5xx logs into kibana/logstash
Closed, ResolvedPublic

Description

Given kibana/logstash analysis capabilities I think it would be good to have 5xx there too. It should help with having breakdowns by different dimensions to debug what's going on, especially during outages can turn useful.
Implementation-wise we have started with kafkatee on oxygen sending logs to logstash. Other options include have a separate kafka topic just for errors, to be consumed from logstash for example.

Initial implementation using kafkatee on oxygen:

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 28 2016, 9:13 PM

It's actually possible to make logstash read things directly from a kafka topic (https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/manifests/logstash.pp;7a1700deba305d800cc5f78c4062cf62b5f8ac44$320-324). What kind of log event volume would we be adding here? Something like doubling the current hhvm.log volume?

@bd808 interesting! Not sure what's the current hhvm.log volume, though we're talking about ~20 logs/s at peak (see https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?from=now-7d&to=now for per-minute stats). Though depending on logstash capacity we might want to rate-limit it anyways, to avoid bringing logstash down exactly when a 500-generating outage is in progress.

~20/s is nothing at to worry about I don't think. Our MediaWiki log event volume is ~100/s but we handle spikes to 10x that fairly often.

MoritzMuehlenhoff triaged this task as Medium priority.Nov 2 2016, 5:51 PM

What kind of log event volume would we be adding here? Something like doubling the current hhvm.log volume?

I took another look at this and we have to consume the full webrequest firehose and then filter for 5xx at the logstash level. IOW before the logstash filter we're talking about ~100k messages/s and ~30MB/s of inbound network traffic, which the logstash filter will end up discarding most of it.

With volume that high we might need to setup a specific logstash or other process just to handle the filtering. It looks like kafkatee has some filtering support; maybe it could be used to make an error-only topic by filtering the firehose?

Indeed feeding the firehose into logstash directly isn't practical. I checked kafkatee and the filtering happens on output via grep, we could replicate the setup we have on oxygen and write to a file, then logstash tails that file. An error-only topic is a good idea but I don't think we have that ATM via varnishkafka (cc @elukey @Ottomata )

We could set up a special varnishkafka instance for this, if that makes sense. But, hm, I think using kafkatee would be better! kafkatee supports piped output, so we don't have to do any special log file tailing. We just add a configuration to pipe the grepped logs to whatever process we want (netcat, perhaps?). Is there a logstash stdin producer? This would look something like:

output pipe 1 /bin/grep --line-buffered '"http_status":"5' | logstash-stdin-producer

Hm, actually, it might even be nicer to feed 5xx logs back into a dedicated topic in kafka. If we did that, then we could use logstash's Kafka importer to consume the full 5xx topic.

Hm, actually, it might even be nicer to feed 5xx logs back into a dedicated topic in kafka. If we did that, then we could use logstash's Kafka importer to consume the full 5xx topic.

That'd be nice to have! And yes I think with an errors-only topic we can consume directly from logstash

elukey added a comment.EditedDec 13 2016, 8:48 AM

Adding an info about the varnishkafka instance: we could filter 5xx Response statues from all the varnish frontends and send them directly to Kafka, and the consume them as we wish. This would duplicate data already present in the webrequest topics but it would be more easy to handle. The best thing ever would be to consume the webrequest topic, filter 5xx and push them back to kafka in another topic!

The best thing ever would be to consume the webrequest topic, filter 5xx and push them back to kafka in another topic, but we are probably not ready for this :)

I think kafkatee + kafkacat for this would work pretty well.

The best thing ever would be to consume the webrequest topic, filter 5xx and push them back to kafka in another topic, but we are probably not ready for this :)

I think kafkatee + kafkacat for this would work pretty well.

What about Spark streaming? Too much? It would be a good occasion to use it...

What about Spark streaming? Too much? It would be a good occasion to use it...

Yeah it would! But I wouldn't feel good about maintaining this in the Analytics Cluster. We'd need another place (Kubernetes?! :o) to run it.

We could set up a special varnishkafka instance for this, if that makes sense. But, hm, I think using kafkatee would be better! kafkatee supports piped output, so we don't have to do any special log file tailing. We just add a configuration to pipe the grepped logs to whatever process we want (netcat, perhaps?). Is there a logstash stdin producer? This would look something like:

output pipe 1 /bin/grep --line-buffered '"http_status":"5' | logstash-stdin-producer

re-reading the task, this solution seems the simplest to implement now. Does kafkacat have the ability to run from multiple machines with each consuming part of the topic (consumer group I think?), this way we could run kafkacat on all logstash frontends and it'd be easy e.g. to roll-restart without losing ingestion

It should! But I haven’t tried it.

General options:

-C | -P | -L       Mode: Consume, Produce or metadata List
-G <group-id>      Mode: High-level KafkaConsumer (Kafka 0.9 balanced

consumer groups)

Expects a list of topics to subscribe to
elukey added a comment.Apr 6 2017, 1:25 PM

Catching up with this task.. So kafkatee seems not supporting group-ids, meanwhile kafkacat does:

https://github.com/edenhill/kafkacat

High-level balanced KafkaConsumer: subscribe to topic1 and topic2 (requires broker >=0.9.0 and librdkafka version >=0.9.1)

$ kafkacat -b mybroker -G mygroup topic1 topic2

Kafkatee is puppetized and can become a daemon, kafkacat is not puppetized and seems only a simple CLI tool.

@Ottomata am I missing something? Can we use kafkacat instead of kafkatee for this use case?

For simply writing to a log file, ya I think kafkacat would be just fine.

But, even though kafkatee doesn’t have consumer offset commits (via group
ids), it does enable local filesystem based offset storage. We aren’t
trying to balance multiple consumers here, so there’s no reason to
distribute. Local filesystem offset commits will work just fine (this is
similar to how both Camus and Kafka Connect HDFS Connector store offsets in
HDFS instead of in Kafka).

Summary after a chat with Filippo and Andrew.

There are two ways to viable approaches:

  1. Implement a Kafka 50X topic and use the Logstash input for Kafka (logstash::input::kafka in puppet).
  2. Create a new kafkatee instance on oxygen that consumes webrequests (filtering only 50X) and pushes to a specific Logstash port configured for the job.

Both solutions requires a bit of work but 2) seems the quickest to have a prototype ready in days and not months. The major downside at the moment is not having LVS/Pybal in front of the logstash hosts (so we need to target a single instance from oxygen), but it could be a good driver to push its priority up :)

After a quick chat with my team it seems that there are no data privacy/retention concerns for a small subset of webrequests on logstash/el.

Change 350817 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] [WIP] Send 5xx from kafkatee to logstash

https://gerrit.wikimedia.org/r/350817

elukey moved this task from Backlog to In Progress on the User-Elukey board.May 2 2017, 4:27 PM
fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.May 5 2017, 1:28 PM

Change 350817 merged by Filippo Giunchedi:
[operations/puppet@production] Send 5xx from kafkatee to logstash

https://gerrit.wikimedia.org/r/350817

fgiunchedi added a comment.EditedMay 8 2017, 10:59 AM

Basic functionality kafkatee -> logstash has been added, left to do:

  • kafkatee blocks signals for its children, preventing among other things a SIGPIPE to cleanup pipelines, for a proposed fix see https://gerrit.wikimedia.org/r/#/c/352591/
  • Kibana dashboards
  • Move hostname field into host so logs appear to be originated from varnish machines and not from oxygen
  • Reconstruct the url into url field so e.g. "Top URLs" visualization does the right thing

Change 353282 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: build http_request from webrequest fields

https://gerrit.wikimedia.org/r/353282

Change 353853 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: move 'hostname' to 'host' for webrequest

https://gerrit.wikimedia.org/r/353853

fgiunchedi updated the task description. (Show Details)May 15 2017, 10:51 AM

Change 353282 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: build http_request from webrequest fields

https://gerrit.wikimedia.org/r/353282

Change 353853 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: move 'hostname' to 'host' for webrequest

https://gerrit.wikimedia.org/r/353853

Mentioned in SAL (#wikimedia-operations) [2017-05-23T10:27:18Z] <godog> upload kafkatee 0.1.5 to jessie-wikimedia, remove unused kafkatee 0.1.4 from trusty-wikimedia - T149451

Change 355220 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: move webrequest 5xx to logstash.svc

https://gerrit.wikimedia.org/r/355220

Change 355220 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: move webrequest 5xx to logstash.svc

https://gerrit.wikimedia.org/r/355220

fgiunchedi closed this task as Resolved.May 23 2017, 3:25 PM
fgiunchedi claimed this task.
fgiunchedi updated the task description. (Show Details)

All done! Kibana dashboard https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X can use some improvement but Good Enough for now