Today when Moritz depooled and rebooted acamar in codfw (DNS recursor) all the kafka200[123] hosts have failed to answer Pybal's ProxyFetch health checks ending up in a brief outage.
Moritz rebooted acamar at ~11:38 UTC, and this is the first error in pybal logs on lvs2003:
[eventbus_8085 ProxyFetch] WARN: kafka2001.codfw.wmnet (enabled/up/pooled): Fetch failed, 30.000 s
The same happened to kafka2002 too, meanwhile kafka2003 was depooled due to the network maintenance happened earlier on for the row-d switches.
@ema tried to depool acamar again from dns-rec-ln.codfw.wmnet (listed in the /etc/resolv.conf of the hosts) at ~12:09 UTC, and these logs popped up in the pybal logs:
Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2001.codfw.wmnet (enabled/up/pooled): Fetch failed, 6.060 s Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2001.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds. Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2002.codfw.wmnet (enabled/up/pooled): Fetch failed, 5.001 s Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2002.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds. Jul 19 12:09:37 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Could not depool server kafka2002.codfw.wmnet because of too many down! Jul 19 12:09:40 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2003.codfw.wmnet (enabled/up/pooled): Fetch failed, 8.024 s Jul 19 12:09:40 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2003.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds. Jul 19 12:09:40 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Could not depool server kafka2003.codfw.wmnet because of too many down! Jul 19 12:09:47 lvs2003 pybal[45967]: [eventbus_8085] INFO: Server kafka2001.codfw.wmnet (enabled/partially up/not pooled) is up Jul 19 12:09:55 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2003.codfw.wmnet (enabled/partially up/pooled): Fetch failed, 5.001 s Jul 19 12:09:55 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2002.codfw.wmnet (enabled/partially up/pooled): Fetch failed, 8.040 s Jul 19 12:10:10 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2002.codfw.wmnet (enabled/partially up/pooled): Fetch failed, 5.001 s Jul 19 12:10:12 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2001.codfw.wmnet (enabled/up/pooled): Fetch failed, 5.001 s Jul 19 12:10:12 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Monitoring instance ProxyFetch reports server kafka2001.codfw.wmnet (enabled/up/pooled) down: Getting http://localhost/v1/topics took longer than 5 seconds. Jul 19 12:10:12 lvs2003 pybal[45967]: [eventbus_8085] ERROR: Could not depool server kafka2001.codfw.wmnet because of too many down! Jul 19 12:10:13 lvs2003 pybal[45967]: [eventbus_8085 ProxyFetch] WARN: kafka2003.codfw.wmnet (enabled/partially up/pooled): Fetch failed, 8.023 s Jul 19 12:10:32 lvs2003 pybal[45967]: [eventbus_8085] INFO: Server kafka2001.codfw.wmnet (enabled/partially up/pooled) is up Jul 19 12:10:32 lvs2003 pybal[45967]: [eventbus_8085] INFO: Leaving previously pooled but down server kafka2001.codfw.wmnet pooled Jul 19 12:10:32 lvs2003 pybal[45967]: [eventbus_8085] INFO: Server kafka2002.codfw.wmnet (enabled/partially up/pooled) is up Jul 19 12:10:32 lvs2003 pybal[45967]: [eventbus_8085] INFO: Leaving previously pooled but down server kafka2002.codfw.wmnet pooled Jul 19 12:10:34 lvs2003 pybal[45967]: [eventbus_8085] INFO: Server kafka2003.codfw.wmnet (enabled/partially up/not pooled) is up
There might be an issue in eventbus code while handling changes in DNS recursors.