Page MenuHomePhabricator

Logstash down for MediaWiki
Closed, ResolvedPublic

Description

Around 12:30 UTC today, Logstash stopped accepting messages from MediaWiki.

From https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors


Impact
  • Blocks MediaWiki deployments, as Scap checks Logstash during the canary phase.
  • Workflow for verifying changes from mwdebug1002 servers false-positively suggests there are no problems. (which unlike the main dashboard, does not appear suspicious because most requests produce no errors and these servers receive no other traffic, so no errors is the default).
  • Engineering teams working on MediaWiki have no visibility into operational problems from the PHP core.

Event Timeline

Krinkle created this task.Jul 15 2019, 6:33 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 15 2019, 6:33 PM
Krinkle triaged this task as Unbreak Now! priority.Jul 15 2019, 6:33 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 15 2019, 6:33 PM
CDanis added a subscriber: CDanis.Jul 15 2019, 6:33 PM
Krinkle updated the task description. (Show Details)Jul 15 2019, 6:36 PM

@jcrespo said on #wikimedia-operations "<jynus> robh: see eqsin mails" when talking about this issue, see P8748.

I never talked about this issue, and had not idea why @Urbanecm thoughout I was talking about this while I was having a private conversation with other person.

jcrespo removed a subscriber: jcrespo.Jul 15 2019, 6:48 PM

I'm sorry, I thought so because said conversation directly followed the report from @Daimona.

Huh, when I first reported I thought someone already knew about this. Anyway. Looking at this I see there has been a spike around 12:20-30, which is right before data stopped to flow. It's also interesting to note that, apparently, there hasn't been any drop in the amount of data, which was in fact replaced by these errors. So something is wrong with Elastic (?).

Krinkle added a subscriber: jcrespo.EditedJul 15 2019, 7:17 PM

@jcrespo said on #wikimedia-operations "<jynus> robh: see eqsin mails" when talking about this issue, see P8748.

I never talked about this issue, and had not idea why @Urbanecm thoughout I was talking about this while I was having a private conversation with other person.

I'm sorry, I thought so because said conversation directly followed the report from @Daimona.

For the record, @jcrespo was replying to RobH about an unrelated topic in response to messages from the wikibugs and icinga-wm bots.

Krinkle added a comment.EditedJul 15 2019, 7:24 PM

Meanwhile, back on topic. Some graphs that have been mentioned in the IRC conversation about this.

Dashboard: kafka-consumer-lag

Messages appear to be arriving fine into the Kafka layer, but not consumed properly (by Logstash/Elasticsarch?).


Dashboard: logstash


There seems strong correlation with two Logstash message sources gaining and losing traffic respectively. Source input/kafka/rsyslog-shipper-eqiad grew by 10X, and input/kafka/rsyslog-udp-localhost-eqiad shrunk by about 3X.

We decided to drop logs from cpjobqueue and changeprop at the logstash layer with the following config:

89-filter_drop_cpjobque_changeprop.conf:

filter {
  if [type] == "cpjobqueue" {
    drop {}
  }
  if [type] == "changeprop" {
    drop {}
  }
  if [_type] == "cpjobqueue" {
    drop {}
  }
  if [_type] == "changeprop" {
    drop {}
  }
}

Related to T150106: Type collisions in log events causing indexing failures in ELK Elasticsearch if the root problem is type collisions in the Elasticsearch index

Once the backlog is processed, https://grafana.wikimedia.org/d/000000102/production-logging?refresh=5m&panelId=8&fullscreen&orgId=1 This can be lowered to high, but something should be put in place to prevent another logs outage, even if it is a rough way, such as an alert to identify it and a runbook to drop a source of logs like above.

CDanis closed this task as Resolved.Jul 15 2019, 11:49 PM
CDanis claimed this task.

The backlog in Kafka should clear in just a few more minutes. Closing this; separate issues to be opened later for followup work.

Mentioned in SAL (#wikimedia-operations) [2019-07-16T00:03:42Z] <shdubsh> restart logstash to revert mitigations - T228089