Page MenuHomePhabricator

Logstash gets significantly lower number of messages from mediawiki
Open, NormalPublic

Description

01:24 <+icinga-wm> PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/o
01:24 <+icinga-wm> gging-eqiad&var-topic=All&var-consumer_group=All

See T228089 for previous occurance of this.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Aug 20, 11:29 PM
Urbanecm triaged this task as Unbreak Now! priority.Tue, Aug 20, 11:29 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptTue, Aug 20, 11:29 PM
Urbanecm updated the task description. (Show Details)Tue, Aug 20, 11:35 PM

Mentioned in SAL (#wikimedia-operations) [2019-08-20T23:42:54Z] <Urbanecm> Evening SWAT aborted due to no logs logged for some period of time (T230847), no patches were reverted

fgiunchedi lowered the priority of this task from Unbreak Now! to Normal.Wed, Aug 21, 8:40 AM
fgiunchedi added a subscriber: fgiunchedi.

Downgrading to normal since the situation has stabilized and returned to normal as 23:40 UTC, still pending investigation on what caused it

Looks like a logstash consumer has failed, according to kafka logs on logstash1010

[2019-08-20 23:09:36,042] INFO [GroupCoordinator 1004]: Member logstash-0-b1f8c6cc-31aa-44bb-98df-349ff911179e in group logstash has failed, removing it from t
he group (kafka.coordinator.group.GroupCoordinator)                                                                                                            [2019-08-20 23:09:36,042] INFO [GroupCoordinator 1004]: Preparing to rebalance group logstash with old generation 4127 (__consumer_offsets-49) (kafka.coordinat
or.group.GroupCoordinator)                                                                                                                                     [2019-08-20 23:09:39,151] INFO [GroupMetadataManager brokerId=1004] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
[2019-08-20 23:09:46,496] INFO [GroupCoordinator 1004]: Member logstash-0-acddeab4-e90a-40f7-a4ff-0ae27c863d89 in group logstash has failed, removing it from t
he group (kafka.coordinator.group.GroupCoordinator)                                                                                                            [2019-08-20 23:09:47,215] INFO [GroupCoordinator 1004]: Stabilized group logstash generation 4128 (__consumer_offsets-49) (kafka.coordinator.group.GroupCoordin
ator)                                                                                                                                                          [2019-08-20 23:09:47,218] INFO [GroupCoordinator 1004]: Assignment received from leader for group logstash for generation 4128 (kafka.coordinator.group.GroupCo
ordinator)                                                                                                                                                     [2019-08-20 23:09:58,916] INFO [GroupCoordinator 1004]: Preparing to rebalance group logstash with old generation 4128 (__consumer_offsets-49) (kafka.coordinat
or.group.GroupCoordinator)                                                                                                                                     [2019-08-20 23:09:59,674] INFO [GroupCoordinator 1004]: Member logstash-0-9f2023d4-d5fb-49d1-b9e2-d5425fdf2d90 in group logstash-codfw has failed, removing it 
from the group (kafka.coordinator.group.GroupCoordinator)                                                                                                      [2019-08-20 23:09:59,674] INFO [GroupCoordinator 1004]: Preparing to rebalance group logstash-codfw with old generation 2450 (__consumer_offsets-37) (kafka.coo
rdinator.group.GroupCoordinator)