Page MenuHomePhabricator

Investigate PyBal dead on lvs1007-12
Closed, ResolvedPublic

Description

So this happened:

[02:35:30] <icinga-wm>	 PROBLEM - HHVM rendering on mw1053 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 9.423 second response time
[02:36:00] <icinga-wm>	 PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:37:10] <icinga-wm>	 RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 64346 bytes in 0.221 second response time
[02:37:10] <icinga-wm>	 PROBLEM - pybal on lvs1012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal
[02:37:30] <icinga-wm>	 PROBLEM - pybal on lvs1010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal
[02:37:40] <icinga-wm>	 RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.174 second response time
[02:37:49] <icinga-wm>	 PROBLEM - configured eth on lvs1008 is CRITICAL: eth3 reporting no carrier.
[02:37:49] <icinga-wm>	 PROBLEM - pybal on lvs1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal
[02:37:49] <icinga-wm>	 PROBLEM - configured eth on lvs1011 is CRITICAL: eth3 reporting no carrier.
[02:38:01] <icinga-wm>	 PROBLEM - configured eth on lvs1010 is CRITICAL: eth3 reporting no carrier.
[02:38:30] <icinga-wm>	 PROBLEM - pybal on lvs1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal
[02:38:30] <icinga-wm>	 PROBLEM - pybal on lvs1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal
[02:38:51] <icinga-wm>	 PROBLEM - pybal on lvs1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal``

On investigating, I found that pybal is dead, and eth3 on all these hosts is set to row D and they're all configured as one block in puppet manifests/site.pp

However, there's no spike in 5xx and only a tiny spike in http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Application%2520servers%2520eqiad&tab=m&vn=&hide-hf=false that seems to correspond, so I'm not calling in everyone.

Event Timeline

yuvipanda raised the priority of this task from to Unbreak Now!.
yuvipanda updated the task description. (Show Details)
yuvipanda added projects: SRE, Traffic.
yuvipanda subscribed.

I started pybal on 1007 and it seems ok...

Aaah, and puppet is disabled on all of those hosts with:

Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'under provisioning, ask faidon/bblack');

/cc @BBlack @faidon

but I guess these machines aren't fully setup, so this is probably ok? I'm not fully sure, since tcpdump does show traffic going through.

BBlack claimed this task.

Sorry, that's my bad. These were all downtimed in icinga, but the downtimes expired which triggered the alerts to show. I've re-downtimed them (for a month this time, although hopefully it won't be that long). They're not in production service, they're still being prepped to eventually replace lvs1001-6 in T104458.