Page MenuHomePhabricator

Packetloss was critical on 2014-07-29 ~2:00 for oxygen, analytics1003, erbium
Closed, DeclinedPublic


On 2014-07-29 ~02:00, there were packet loss alarms for oxygen, analytics1003, erbium in the #wikimedia-operations IRC channel:

[01:52:47] <icinga-wm> PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 37.5854172414
[01:56:47] <icinga-wm> RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: -0.0539559302326  
[01:57:17] <icinga-wm> PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: packet_loss_average CRITICAL: 14.0737649167  
[02:01:17] <icinga-wm> RECOVERY - Packetloss_Average on analytics1003 is OK: packet_loss_average OKAY: 1.17930608333  
[02:02:57] <icinga-wm> PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 9.18785825  
[02:06:57] <icinga-wm> RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 1.15079566667

The packetloss periods were short, and there was much monitoring noise in the
IRC channel around that time, so those might have been flukes.

Version: unspecified
Severity: normal
Whiteboard: u=Community c=General/Unknown p=0 s=2014-07-24



Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:31 AM
bzimport set Reference to bz68796.

The issue was a flapping esams link [1], which (depending on the stream)
killed half up to all esams traffic (eqiad and ulsfo were unaffected) to the
udp2log instances between 2014-07-29T01:35:45 and 2014-07-29T01:42:00.

This issue affects all of our logging infrastructure, from TSVs to
webstatscollector to pagecounts.

[1] See
between [01:36:07] [02:02:19]