Page MenuHomePhabricator

Logstash missing most messages from mediawiki (Aug 2019)
Closed, ResolvedPublic

Description

01:24 <+icinga-wm> PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/o
01:24 <+icinga-wm> gging-eqiad&var-topic=All&var-consumer_group=All

See T228089 for previous occurance of this.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 20 2019, 11:29 PM
Urbanecm triaged this task as Unbreak Now! priority.Aug 20 2019, 11:29 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptAug 20 2019, 11:29 PM
Urbanecm updated the task description. (Show Details)Aug 20 2019, 11:35 PM

Mentioned in SAL (#wikimedia-operations) [2019-08-20T23:42:54Z] <Urbanecm> Evening SWAT aborted due to no logs logged for some period of time (T230847), no patches were reverted

fgiunchedi lowered the priority of this task from Unbreak Now! to Medium.Aug 21 2019, 8:40 AM
fgiunchedi added a subscriber: fgiunchedi.

Downgrading to normal since the situation has stabilized and returned to normal as 23:40 UTC, still pending investigation on what caused it

Looks like a logstash consumer has failed, according to kafka logs on logstash1010

[2019-08-20 23:09:36,042] INFO [GroupCoordinator 1004]: Member logstash-0-b1f8c6cc-31aa-44bb-98df-349ff911179e in group logstash has failed, removing it from t
he group (kafka.coordinator.group.GroupCoordinator)                                                                                                            [2019-08-20 23:09:36,042] INFO [GroupCoordinator 1004]: Preparing to rebalance group logstash with old generation 4127 (__consumer_offsets-49) (kafka.coordinat
or.group.GroupCoordinator)                                                                                                                                     [2019-08-20 23:09:39,151] INFO [GroupMetadataManager brokerId=1004] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2019-08-20 23:09:46,496] INFO [GroupCoordinator 1004]: Member logstash-0-acddeab4-e90a-40f7-a4ff-0ae27c863d89 in group logstash has failed, removing it from t
he group (kafka.coordinator.group.GroupCoordinator)                                                                                                            [2019-08-20 23:09:47,215] INFO [GroupCoordinator 1004]: Stabilized group logstash generation 4128 (__consumer_offsets-49) (kafka.coordinator.group.GroupCoordin
ator)                                                                                                                                                          [2019-08-20 23:09:47,218] INFO [GroupCoordinator 1004]: Assignment received from leader for group logstash for generation 4128 (kafka.coordinator.group.GroupCo
ordinator)                                                                                                                                                     [2019-08-20 23:09:58,916] INFO [GroupCoordinator 1004]: Preparing to rebalance group logstash with old generation 4128 (__consumer_offsets-49) (kafka.coordinat
or.group.GroupCoordinator)                                                                                                                                     [2019-08-20 23:09:59,674] INFO [GroupCoordinator 1004]: Member logstash-0-9f2023d4-d5fb-49d1-b9e2-d5425fdf2d90 in group logstash-codfw has failed, removing it 
from the group (kafka.coordinator.group.GroupCoordinator)                                                                                                      [2019-08-20 23:09:59,674] INFO [GroupCoordinator 1004]: Preparing to rebalance group logstash-codfw with old generation 2450 (__consumer_offsets-37) (kafka.coo
rdinator.group.GroupCoordinator)
Krinkle renamed this task from Logstash gets significantly lower number of messages from mediawiki to Logstash missing most messages from mediawiki (Aug 2019).Dec 17 2019, 10:18 PM

I've created an initial draft at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190820-logstash.

Some open questions still:

  • Did we lose the logs from these 30 minutes, or were they recoverable (and recovered) from Kafka?
  • Could this "Kafka failure" happen again? Did we detect it early enough?
  • Was the mitigation trivial / automatable?

Thanks @Krinkle, to answer your questions:

  • Yes logs were in Kafka and fully recovered once the consumers caught up
  • We did detect it early enough I think, it is a logstash failure not a kafka failure though
  • Mitigation was to restart logstash IIRC, it could be automatable but IIRC we haven't seen another occurrence and we're moving to logstash 7 this quarter, IMHO good to close

I'm not even convinced an incident report is in order in this case, but I'm willing to be convinced otherwise

Krinkle closed this task as Resolved.Feb 18 2020, 3:53 PM
Krinkle added a comment.EditedFeb 20 2020, 12:00 AM

Thx, I've updated the incident page. I'll close this now then.