Page MenuHomePhabricator

Restarting pybal caused icinga error
Closed, ResolvedPublic

Description

I was following the instructions at https://wikitech.wikimedia.org/wiki/LVS#Deploy_a_change_to_an_existing_service for https://phabricator.wikimedia.org/T298940 and when I restarted pybal I noticed the following error in #wikimedia-operations:

PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal

That message was posted at 18:07 UTC, and here are the logs from pybal at that time, showing the restart:

May 11 18:06:57 lvs1020 pybal[3858]: [pybal] INFO: Exiting...
May 11 18:06:57 lvs1020 systemd[1]: pybal.service: Succeeded.
May 11 18:06:57 lvs1020 systemd[1]: Stopped PyBal LVS monitor.
May 11 18:07:55 lvs1020 systemd[1]: Started PyBal LVS monitor.
May 11 18:07:55 lvs1020 pybal[17199]: [pybal] INFO: Created LVS service 'apertium_4737'
May 11 18:07:55 lvs1020 pybal[17199]: [pybal] INFO: Created LVS service 'api-gateway_8087'

Pybal started fine and the logs show normal activity again, but the alert has not recovered: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr2-eqiad&service=BGP+status

I stopped the update before restarting pybal on lvs1019, the active LVS server, so things are in an in-between state, but I figured I should figure out this alert before proceeding.

Event Timeline

So, the icinga check in question was already in a bad state before the pybal restart. This check activates on any BGP session issue for the whole router (cr2-eqiad), and there are already other ongoing session issues (as is somewhat normal!). It just went from warn -> crit and then back to warn as pybal restarted. Looking at the alert history logs for this alert in:https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=cr2-eqiad&service=BGP+status , you can see this sequence at the top (in reverse time order):

[2022-05-11 18:24:32] SERVICE ALERT: cr2-eqiad;BGP status;WARNING;HARD;3;BGP WARNING - AS16509/IPv6: Connect (for 189d9h), AS13150/IPv4: Active (for 54d6h), AS16509/IPv6: Connect (for 189d9h)
[2022-05-11 18:22:16] SERVICE ALERT: cr2-eqiad;BGP status;CRITICAL;HARD;3;BGP CRITICAL - AS64605/IPv6: Active - Anycast
[2022-05-11 18:09:10] SERVICE ALERT: cr2-eqiad;BGP status;WARNING;HARD;3;BGP WARNING - AS13150/IPv4: Connect (for 54d6h), AS16509/IPv6: Connect (for 189d9h), AS16509/IPv6: Connect (for 189d9h)
[2022-05-11 18:07:28] SERVICE ALERT: cr2-eqiad;BGP status;CRITICAL;HARD;3;BGP CRITICAL - AS64600/IPv4: Active - PyBal

"Active" is, confusingly, a bad state. Pybal went "Active" and then things returned to "normal" with warnings on some other sessions, and then afterwards a similar but unrelated event happened with the Anycast session.

razzi claimed this task.

Thanks for the explanation @BBlack, nothing to do here so I'll close this.

I saw your comment about no pybal restart being necessary, and that makes sense; I could even see both dbproxy hosts via confctl and I was wondering why.

razzi@cumin1001:~$ sudo confctl select service=wikireplicas-a get                                                                                                                 
{"dbproxy1018.eqiad.wmnet": {"weight": 0, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplicas-a,service=wikireplicas-a"}
{"dbproxy1019.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=wikireplicas-a,service=wikireplicas-a"}

This can be closed, thanks again!