Page MenuHomePhabricator

Resurrect eventlogging_EventError logging to in logstash
Closed, ResolvedPublic

Description

We used to send the eventlogging_EventError to logstash so that developers had more visibility into their event schema validation errors. We should do this again, especially has EL usage is ramping up.

Event Timeline

fdans moved this task from Incoming to Operational Excellence on the Analytics board.

I'd actually be interested in picking this up as a 10% project to better understand our logging infrastructure and deployment of the various EL services. If that'd slow things down a little too much, then I understand – I'd rather see this resolved 🙂

Sam, that'd be great! Find me and Marcel (mforns) on IRC in #wikimedia-analytics and lets discuss. Actually...@fgiunchedi or @herron might have something helpful to say, as they are currently implementing some new Kafka -> Logstash integration.

We have a JSON Kafka topic in the jumbo-eqiad Kafka cluster that we'd want to expose in logstash. What's the best way of doing this today?

Likely adding a new kafka input for logstash, though how much data are we talking about ?

Likely adding a new kafka input for logstash, though how much data are we talking about ?

Not much! https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&panelId=13&fullscreen&from=now-7d&to=now-5m

It may change though if somebody deploys a new eventlogging event that causes validation errors, but usually the rate is pretty low :)

Seems small enough, I see peaks of 200 events/s a few weeks back, I guess that was validation errors you mentioned @elukey ! Anyways yeah I think we can do that, the configuration is already in place in role::logstash::eventlogging but I don't know what other pieces are missing.

It looks like [[ https://github.com/wikimedia/puppet/blob/8b4f503c5186f2d4b26538e237f828d2e1b11a13/manifests/site.pp#L1334-L1338 | logstash1007.eqiad.wmnet is configured ]] to collect messages on the eventlogging_EventError topic and [[ https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/role/manifests/logstash/eventlogging.pp#L14 | tag 'em with kafka and eventlogging_EventError ]] but I see no log lines tagged as the latter.

Mibad. We actually remove the eventlogging_EventError tag before logging the log line: https://github.com/wikimedia/puppet/blob/134aff8e082c26e1c28ff2f0f76759f0e099075f/modules/role/files/logstash/filter-eventlogging.conf#L16

Edit

Nevertheless, it seems that log lines still aren't being logged, e.g. try searching for {_,}type:eventlogging tags:es tags:kafka.

Change 475322 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[operations/puppet@production] eventlogging: Tag error log lines with the schema

https://gerrit.wikimedia.org/r/475322

@herron: It looks like the only other instance of logstash::input::kafka is set up [[ https://github.com/wikimedia/puppet/blob/6d385899fb826b122893633e8a0089017b5403b6/modules/profile/manifests/logstash/collector.pp#L165-L182 | in profile::logstash::collector ]]. I note that the instance set up in role::logstash::eventlogging isn't using the JSON codec and isn't collecting to the Kafka broker securely (i.e. the $security_protocol, $ssl_truststore_location, and $ssl_truststore_password parameters are missing).

Could this explain why these messages aren't being logged?

The Kafka security stuff shouldn't matter (yet) in this case. We'll likely have to set those one day soon.

It doesn't seem like the codec choice should affect this either.

I don't have SSH access to logstash1007 in order to check which processes are running/erroring. Should I submit an access request or can someone on this ticket have a look?

@phuedx come find me in #wikimedia-analytics in about 30 mins and we'll look together

Actually, now is fine come find me! :)

Change 476043 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use proper Kafka cluster for sending EventLogging errors to logstash

https://gerrit.wikimedia.org/r/476043

Change 476043 merged by Ottomata:
[operations/puppet@production] Use proper Kafka cluster for sending EventLogging errors to logstash

https://gerrit.wikimedia.org/r/476043

☝️ @Ottomata quickly spotted that the issue was that we were pointing to the wrong Kafka cluster.

The Kafka brokers were wrong! The logstash input was using the old analytics Kafka cluster, instead of jumbo. Fixed! https://logstash.wikimedia.org/app/kibana#/dashboard/default?_g=h@7bf0c26&_a=h@ecf0ee1

Change 476050 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[operations/puppet@production] eventlogging/logstash: Events are encoded as JSON

https://gerrit.wikimedia.org/r/476050

Change 476050 merged by Ottomata:
[operations/puppet@production] eventlogging/logstash: Events are encoded as JSON

https://gerrit.wikimedia.org/r/476050

Change 475322 abandoned by Phuedx:
eventlogging/logstash: Make schema a field

Reason:
Per my comment above.

https://gerrit.wikimedia.org/r/475322

Change 476057 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Only rename event.message and recvFrom for eventlogging_EventError logstash

https://gerrit.wikimedia.org/r/476057

Change 476057 merged by Ottomata:
[operations/puppet@production] Only rename event.message and recvFrom for eventlogging_EventError logstash

https://gerrit.wikimedia.org/r/476057

phuedx claimed this task.

Great to see this working! Thanks for all of your help @Ottomata and @fgiunchedi.

Awesome, thank you for working on this!

Is there some kind of rate limiting? I don't think Logstash could deal with one or more errors per pageview if the error is in some schema that's used very often (or getting used very often is part of the error).

Would this transport method be suitable for JS error logging as well?

Is there some kind of rate limiting?

No AFAIK though the Logstash Kafka input will only consume messages as fast as Logstash can ingest them.

I don't think Logstash could deal with one or more errors per pageview if the error is in some schema that's used very often (or getting used very often is part of the error).

Neither of Kafka's quota mechanisms fit the bill here – we're looking for N messages/window rather than N bytes/window – but could be a good start. Since the EventLogging validator is horizontally scalable, we could give those processes a well-known client ID and impose a quota on it.

An alternative would be to create a generic process that consumes messages from one topic and re-produces them on a different topic at a pre-defined rate (using the leaky bucket algorithm, for example) but this would increase resource consumption on the cluster overall.

Would this transport method be suitable for JS error logging as well?

Yes. If you can get JS errors into a Kafka topic (presumably by replicating [[ https://github.com/wikimedia/puppet/blob/577b081e00384a366bb04e763674c2604a8381cf/modules/profile/manifests/cache/kafka/eventlogging.pp | profile::cache::kafka::eventlogging ]]), then those messages can be ingested by Logstash with only a little configuration.

Would this transport method be suitable for JS error logging as well?

Yes. If you can get JS errors into a Kafka topic (presumably by replicating profile::cache::kafka::eventlogging), then those messages can be ingested by Logstash with only a little configuration.

The way Sam puts it here would be suitable, but we don't think that using EventLogging itself for client side JS error logging would be good.

Neither of Kafka's quota mechanisms fit the bill here – we're looking for N messages/window rather than N bytes/window – but could be a good start. Since the EventLogging validator is horizontally scalable, we could give those processes a well-known client ID and impose a quota on it.

Sorry. This is the wrong way around. We should impose a quota on the Logstash input (consumer) and not the EventLogging validator (producer). Imposing a quota on the producer would impact other consumers.

The way Sam puts it here would be suitable, but we don't think that using EventLogging itself for client side JS error logging would be good.

FWIW I don't think it's particularly helpful for EL errors to be EL events themselves, either.

I tried to make a dashboard for the errors but the schema field is always EventError and Kibana does not recognize the event field (which contains the real schema) as an object, maybe because it is a very common name and there are probably records in which it is a string. So it is impossible to show the schema name in list view or charts.
I guess this could be solved by a few more renames like the one in filter-eventlogging.conf.

Also looks like the wiki field gets lost somewhere.

Change 477419 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] EventLogging Logstash filter: move useful fields out of event

https://gerrit.wikimedia.org/r/477419

Reopening this change was reverted due to issues with too much verbose logging going into logstash: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476982/

Nuria reassigned this task from phuedx to Ottomata.

Messages like these were being logged as many as 500K every 15 mins:

`[2018-07-11 19:44:04,942] INFO [ReplicaFetcher replicaId=1001, leaderId=1003, fetcherId=2] Retrying leaderEpoch request for partition eqiad.change-prop.retry.cpjobqueue.retry.mediawiki.job.LocalPageMoveJob-0 as the leader reported an error: UNKNOWN_SERVER_ERROR (kafka.server.ReplicaFetcherThread)

and that caused a package drop.

Given that no alarms in kafka were going on at the time it is likely this is just verbose logging from kafka and logstash just cannot handle the volume.

Nuria I think you might have the wrong ticket. That change has nothing to do with EventLogging

Totally right, I thought it might be related to us sending more logs to kafka due to this EL change but I see it is not related as it happen after. My apologies.

Change 477419 merged by Ottomata:
[operations/puppet@production] EventLogging Logstash filter: move useful fields out of event

https://gerrit.wikimedia.org/r/477419

Change 478101 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] Send JSON string of event for validation errors in EventError

https://gerrit.wikimedia.org/r/478101

Change 478101 merged by Ottomata:
[eventlogging@master] Send JSON string of event for validation errors in EventError

https://gerrit.wikimedia.org/r/478101

mforns subscribed.

This task is done. Resolving.

Re-opening, needs to check if it is deployed and restarted.

@mforns: I'm pretty sure this is deployed and (re)started as there are log lines appearing in Kibana for the following query: https://logstash.wikimedia.org/goto/36329f5fd2348e181177ff879800d9c0