analytics1021-AllTopicsMessagesInPerSec-OneMinuteRate
From http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140807.txt
[13:45:20] <mutante> analytics1021:
[13:45:22] <mutante> 3/3 kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 7.42708492353e-59
[13:54:36] <tnegrin> gage?
[14:01:03] <tnegrin> mutante: andrew is out today -- is that alert repeating?
[14:01:50] <mutante> tnegrin: yes, it started a little over 1 day ago
[14:02:05] <tnegrin> hmm -- the graphs I look at all look normal
[14:02:06] <mutante> at wikimania but not sure how criticial it is
[14:03:06] <tnegrin> SF comes online in a few hours -- can you sleep it for 2 hours?
[14:03:12] <tnegrin> I will have gage look at it
[14:03:24] <tnegrin> (I don't think it's critical)
[14:04:14] <mutante> yes, i can
[14:04:18] <mutante> ok, thanks
[14:04:35] <tnegrin> thank
[14:04:37] <tnegrin> thanks
Ganglia shows analytics1021 Messages going down, and other brokers
taking over.
(See attachments
analytics1021-AllTopicsMessagesInPerSec-OneMinuteRate.png Cluster-MessagesInPerSec-OneMinuteRate.png Cluster-RequestsPerSec-OneMinuteRate.png
)
It seems to have happened around 2014-08-07 01:44
There, according to /var/log/kafka/kafka.log on analytics1021, the
zookeeper connection expired [1]:
[...] [2014-08-06 01:44:36,974] 101327050 [main-EventThread] INFO org.I0Itec.zkclient.ZkClient - zookeeper state changed (Expired) [...]
and could not connect to the ZooKeeper again
[...] [2014-08-06 01:44:37,061] 101327137 [main-SendThread(analytics1024.eqiad.wmnet:2181)] INFO org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, session 0x146fd72a83d0dbe has expired, closing socket connection [...]
Then after re-connection, re-election took part:
[2014-08-06 01:44:37,215] 101327291 [ZkClient-EventThread-14-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad] INFO kafka.controller.KafkaController$SessionExpirationListener - [SessionExpirationListener on 21], ZK expired; shut down all controller components and try to re-elect
[2014-08-06 01:44:37,272] 101327348 [ZkClient-EventThread-14-analytics1023.eqiad.wmnet,analytics1024.eqiad.wmnet,analytics1025.eqiad.wmnet/kafka/eqiad] INFO kafka.utils.ZkUtils$ - conflict in /controller data: {"version":1,"brokerid":21,"timestamp":"1407289477248"} stored data: {"version":1,"brokerid":22,"timestamp":"1407187809296"}
[1] Typically changes between Disconnected and SyncConected, with only a few hundret ms in Disconnected state
Version: unspecified
Severity: normal
Attached:


