Logstash down for MediaWiki
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Jul 15 2019, 6:33 PM

Description

Around 12:30 UTC today, Logstash stopped accepting messages from MediaWiki.

From https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors

Impact

Blocks MediaWiki deployments, as Scap checks Logstash during the canary phase.
Workflow for verifying changes from mwdebug1002 servers false-positively suggests there are no problems. (which unlike the main dashboard, does not appear suspicious because most requests produce no errors and these servers receive no other traffic, so no errors is the default).
Engineering teams working on MediaWiki have no visibility into operational problems from the PHP core.

Related Objects

Mentioned In: T230847: Logstash missing most messages from mediawiki (Aug 2019)
T227700: Fatal on some Special:MyLanguage urls: MWException "Can't determine talk page associated with interwiki link"
Mentioned Here: T150106: Type collisions in log events causing indexing failures in ELK Elasticsearch
P8748 Logstash is down (T228089)

Event Timeline

Krinkle created this task.Jul 15 2019, 6:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 15 2019, 6:33 PM

Krinkle triaged this task as Unbreak Now! priority.Jul 15 2019, 6:33 PM

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 15 2019, 6:33 PM

CDanis subscribed.Jul 15 2019, 6:33 PM

Krinkle updated the task description. (Show Details)Jul 15 2019, 6:36 PM

Paladox subscribed.Jul 15 2019, 6:37 PM

Urbanecm subscribed.Jul 15 2019, 6:40 PM

Krinkle mentioned this in T227700: Fatal on some Special:MyLanguage urls: MWException "Can't determine talk page associated with interwiki link".Jul 15 2019, 6:42 PM

Ladsgroup subscribed.Jul 15 2019, 6:44 PM

@jcrespo said on #wikimedia-operations "<jynus> robh: see eqsin mails" when talking about this issue, see P8748.

I never talked about this issue, and had not idea why @Urbanecm thoughout I was talking about this while I was having a private conversation with other person.

jcrespo unsubscribed.Jul 15 2019, 6:48 PM

I'm sorry, I thought so because said conversation directly followed the report from @Daimona.

Huh, when I first reported I thought someone already knew about this. Anyway. Looking at this I see there has been a spike around 12:20-30, which is right before data stopped to flow. It's also interesting to note that, apparently, there hasn't been any drop in the amount of data, which was in fact replaced by these errors. So something is wrong with Elastic (?).

In T228089#5334366, @Urbanecm wrote:

@jcrespo said on #wikimedia-operations "<jynus> robh: see eqsin mails" when talking about this issue, see P8748.

In T228089#5334379, @jcrespo wrote:

I never talked about this issue, and had not idea why @Urbanecm thoughout I was talking about this while I was having a private conversation with other person.

In T228089#5334398, @Urbanecm wrote:

I'm sorry, I thought so because said conversation directly followed the report from @Daimona.

For the record, @jcrespo was replying to RobH about an unrelated topic in response to messages from the wikibugs and icinga-wm bots.

Meanwhile, back on topic. Some graphs that have been mentioned in the IRC conversation about this.

Dashboard: kafka-consumer-lag

Messages appear to be arriving fine into the Kafka layer, but not consumed properly (by Logstash/Elasticsarch?).

Dashboard: logstash

There seems strong correlation with two Logstash message sources gaining and losing traffic respectively. Source input/kafka/rsyslog-shipper-eqiad grew by 10X, and input/kafka/rsyslog-udp-localhost-eqiad shrunk by about 3X.

We decided to drop logs from cpjobqueue and changeprop at the logstash layer with the following config:

89-filter_drop_cpjobque_changeprop.conf:

filter {
  if [type] == "cpjobqueue" {
    drop {}
  }
  if [type] == "changeprop" {
    drop {}
  }
  if [_type] == "cpjobqueue" {
    drop {}
  }
  if [_type] == "changeprop" {
    drop {}
  }
}

Related to T150106: Type collisions in log events causing indexing failures in ELK Elasticsearch if the root problem is type collisions in the Elasticsearch index

Once the backlog is processed, https://grafana.wikimedia.org/d/000000102/production-logging?refresh=5m&panelId=8&fullscreen&orgId=1 This can be lowered to high, but something should be put in place to prevent another logs outage, even if it is a rough way, such as an alert to identify it and a runbook to drop a source of logs like above.

I've started an incident document at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190715-logstash and would appreciate more contributors.

I also wanted to reference T150106: Type collisions in log events causing indexing failures in ELK Elasticsearch here

The backlog in Kafka should clear in just a few more minutes. Closing this; separate issues to be opened later for followup work.

Mentioned in SAL (#wikimedia-operations) [2019-07-16T00:03:42Z] <shdubsh> restart logstash to revert mitigations - T228089

Urbanecm mentioned this in T230847: Logstash missing most messages from mediawiki (Aug 2019).Aug 20 2019, 11:35 PM

	F29774971: Screenshot 2019-07-15 at 19.30.51.png
	Jul 15 2019, 6:33 PM

	F29774972: Screenshot 2019-07-15 at 19.30.04.png
	Jul 15 2019, 6:33 PM

	F29775316: capture.png
	Jul 15 2019, 7:24 PM

	F29775279: capture.png
	Jul 15 2019, 7:24 PM

Logstash down for MediaWikiClosed, ResolvedPublicActions

Description

Impact

Related Objects

Event Timeline

Logstash down for MediaWiki
Closed, ResolvedPublic
Actions