So this happened:
[02:35:30] <icinga-wm> PROBLEM - HHVM rendering on mw1053 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 9.423 second response time [02:36:00] <icinga-wm> PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:37:10] <icinga-wm> RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 64346 bytes in 0.221 second response time [02:37:10] <icinga-wm> PROBLEM - pybal on lvs1012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [02:37:30] <icinga-wm> PROBLEM - pybal on lvs1010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [02:37:40] <icinga-wm> RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.174 second response time [02:37:49] <icinga-wm> PROBLEM - configured eth on lvs1008 is CRITICAL: eth3 reporting no carrier. [02:37:49] <icinga-wm> PROBLEM - pybal on lvs1009 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [02:37:49] <icinga-wm> PROBLEM - configured eth on lvs1011 is CRITICAL: eth3 reporting no carrier. [02:38:01] <icinga-wm> PROBLEM - configured eth on lvs1010 is CRITICAL: eth3 reporting no carrier. [02:38:30] <icinga-wm> PROBLEM - pybal on lvs1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [02:38:30] <icinga-wm> PROBLEM - pybal on lvs1011 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [02:38:51] <icinga-wm> PROBLEM - pybal on lvs1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal`` On investigating, I found that pybal is dead, and eth3 on all these hosts is set to row D and they're all configured as one block in puppet manifests/site.pp However, there's no spike in 5xx and only a tiny spike in http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Application%2520servers%2520eqiad&tab=m&vn=&hide-hf=false that seems to correspond, so I'm not calling in everyone.