during or after rcs1* reboots icinga has pybal alerts that it couldn't depool servers
17:56 <mutante> !log rcs1001 - depool from rcstream service 17:56 <mutante> arg, the other one.. 1002 17:56 -icinga-wm:#wikimedia-operations- RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures 17:57 <mutante> !log rcs1002 - the last message was about 1002 17:59 <mutante> !log rcs1002 - traffic graph flat in ganglia, reboot 18:01 -icinga-wm:#wikimedia-operations- PROBLEM - Host rcs1002 is DOWN: PING CRITICAL - Packet loss = 100% 18:04 -icinga-wm:#wikimedia-operations- RECOVERY - Host rcs1002 is UP: PING OK - Packet loss = 0%, RTA = 1.81 ms 18:05 <mutante> !log repooling rcs1002 18:10 <mutante> !log depool rcs1001 18:14 <mutante> !log rebooting rcs1001 18:16 -icinga-wm:#wikimedia-operations- PROBLEM - Host rcs1001 is DOWN: PING CRITICAL - Packet loss = 100% 18:19 -icinga-wm:#wikimedia-operations- PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down!: streamlb6_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! 18:19 -icinga-wm:#wikimedia-operations- RECOVERY - Host rcs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms 18:20 <mutante> !log repooling rcs1001 18:24 -icinga-wm:#wikimedia-operations- PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - streamlb_443 - Could not depool server rcs1002.eqiad.wmnet because of too many down! 18:25 -icinga-wm:#wikimedia-operations- PROBLEM - PyBal backends health check on lvs1011 is CRITICAL: PYBAL CRITICAL - streamlb_80 - Could not depool server rcs1001.eqiad.wmnet because of too many down!: streamlb6_80 - Could not depool server rcs1002.eqiad.wmnet because of too many down! 18:29 -icinga-wm:#wikimedia-operations- RECOVERY - Disk space on tin is OK: DISK OK 18:32 <mutante> the PyBal checks there should recover soon 18:32 <mutante> both rcs backends are repooled 18:33 <mutante> and the service is up
note that the service is up, it also shows up from monitoring the service from labs that it flapped twice