Page MenuHomePhabricator

Toolforge hosted IRC bots occasionally disconnecting
Closed, ResolvedPublic

Event Timeline

[18:31]  <   danilo> someone has an idea why bots are restarts so frequently? I tried to find some reason using wmopbot logs but I just found the problem is related to the connection, the proccess continue to work but it not receive the signal the connection was closed
[18:33]  <   wm-bot> !log tools.bridgebot <lucaswerkmeister> Double IRC messages to other bridges
[18:33]  < stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL
[18:34]  <    wm-bb> <lucaswerkmeister> (speak of the devil)
[21:37]  <    bd808> danilo: I think it is networking related, but I don't know if it is inside WMCS or between WMCS and libra.chat. I think I saw that taavi made a phab task to track the issue, but I can't find it right now. Something has definitely been different since early October.
[21:46]  <   danilo> bd808: it is https://phabricator.wikimedia.org/T320975 (without description yet), I also noted that wm-bot does not have those issues, so I suspect that is related to something wm-bot does not use
[21:50]  <    bd808> danilo: *nod* I think that taavi is agreeing by naming it a Toolforge problem too. wm-bot lives on a Cloud VPS instance in its own project. That means it shares the Cloud VPS network with things living in the Toolforge project, but Toolforge's Kubernetes cluster adds yet another software defined network layer for packets to transit.
[21:51]  <    bd808> danilo: if you have concrete timing related data that you can add to the ticket, especially anything that might help narrow in on when this started, that could be helpful.
[21:55]  <   danilo> ok, I will search in the logs when exaclty that started and add to the task

Based on the frequency of manual restarts recorded in https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL I would guess this started being a pronounced issue on 2022-10-10.

https://sal.toolforge.org/tools?d=2022-10-10 does not show anything obviously Kubernetes networking related.

The new cloud wide network nodes from T316284: Replace cloudnet100[34] with cloudnet100[56] were put into active service on 2022-10-06 by @aborrero. If wm-bot really has not been affected I'm not sure how these new hosts could be only effecting traffic to/from the Toolforge Kubernetes cluster.

I also found that started on 2022-10-10, approximately at 13:50 UTC

I noted that wmopbot had more restarts than other bots, but wmopbot2 that run in a second connection in the same proccess has many few ping timeouts than wmopbot and other bots. wmopbot is in 120 channels, including channels with more users and messages, and wmopbot2 is in 23 channels with fewer users and messages, so I suspect the amount of traffic each bot is receiving has some influence in the restarts.

I asked a Libera staffer (moonmoon) and they said the traffic between Libera servers and Toolforge had a spike on october 10th that was not seen in other connections, after that the traffic didn't have a significant increase.

Based on the frequency of manual restarts recorded in https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bridgebot/SAL I would guess this started being a pronounced issue on 2022-10-10.

https://sal.toolforge.org/tools?d=2022-10-10 does not show anything obviously Kubernetes networking related.

The new cloud wide network nodes from T316284: Replace cloudnet100[34] with cloudnet100[56] were put into active service on 2022-10-06 by @aborrero. If wm-bot really has not been affected I'm not sure how these new hosts could be only effecting traffic to/from the Toolforge Kubernetes cluster.

I didn't realize this before, but wm-bot has a floating IP, and Toolforge k8s nodes do not. So one possibility is that the new hardware has some issue that's affecting NAT'd connections only.

The new cloud wide network nodes from T316284: Replace cloudnet100[34] with cloudnet100[56] were put into active service on 2022-10-06 by @aborrero. If wm-bot really has not been affected I'm not sure how these new hosts could be only effecting traffic to/from the Toolforge Kubernetes cluster.

I didn't realize this before, but wm-bot has a floating IP, and Toolforge k8s nodes do not. So one possibility is that the new hardware has some issue that's affecting NAT'd connections only.

NAT on the new cloudnet hosts seems worth exploring. It could also explain the possible correlation that @Danilo mentioned in T320975#8347831 of connections with more traffic being more commonly effected. If the NAT connection tracking is having problems it is likely that those problems would be more pronounced for clients expecting more packets.

Looks like keepalived seems to be misbehaving on cloudgw1001:

Oct 28 10:07:17 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Entering BACKUP STATE
Oct 28 10:16:06 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Entering MASTER STATE
Oct 28 10:16:06 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Master received advert from 185.15.56.246 with higher priority 74, ours 57
Oct 28 10:16:06 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Entering BACKUP STATE
Oct 28 11:22:34 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Entering MASTER STATE
Oct 28 11:22:34 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Master received advert from 185.15.56.246 with higher priority 74, ours 57
Oct 28 11:22:34 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Entering BACKUP STATE
Oct 28 12:16:59 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Entering MASTER STATE
Oct 28 12:16:59 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Master received advert from 185.15.56.246 with higher priority 74, ours 57
Oct 28 12:16:59 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Entering BACKUP STATE
Oct 28 07:43:24 cloudgw1002 Keepalived_vrrp[949]: (VRRP1) Received advert from 185.15.56.245 with lower priority 57, ours 74, forcing new election
Oct 28 08:01:07 cloudgw1002 Keepalived_vrrp[949]: (VRRP1) Received advert from 185.15.56.245 with lower priority 57, ours 74, forcing new election
Oct 28 08:15:20 cloudgw1002 Keepalived_vrrp[949]: (VRRP1) Received advert from 185.15.56.245 with lower priority 57, ours 74, forcing new election
Oct 28 08:17:20 cloudgw1002 Keepalived_vrrp[949]: (VRRP1) Received advert from 185.15.56.245 with lower priority 57, ours 74, forcing new election
Oct 28 09:20:33 cloudgw1002 Keepalived_vrrp[949]: (VRRP1) Received advert from 185.15.56.245 with lower priority 57, ours 74, forcing new election
Oct 28 10:07:17 cloudgw1002 Keepalived_vrrp[949]: (VRRP1) Received advert from 185.15.56.245 with lower priority 57, ours 74, forcing new election
Oct 28 10:16:06 cloudgw1002 Keepalived_vrrp[949]: (VRRP1) Received advert from 185.15.56.245 with lower priority 57, ours 74, forcing new election
Oct 28 11:22:34 cloudgw1002 Keepalived_vrrp[949]: (VRRP1) Received advert from 185.15.56.245 with lower priority 57, ours 74, forcing new election
Oct 28 12:16:59 cloudgw1002 Keepalived_vrrp[949]: (VRRP1) Received advert from 185.15.56.245 with lower priority 57, ours 74, forcing new election

Looks like keepalived seems to be misbehaving on cloudgw1001:

Thanks @taavi This flapping is likely the problem.

Why is happening I'm not able to explain it just yet.

Looks like keepalived seems to be misbehaving on cloudgw1001:

Oct 28 10:07:17 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Entering BACKUP STATE
Oct 28 10:16:06 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Entering MASTER STATE
Oct 28 10:16:06 cloudgw1001 Keepalived_vrrp[950]: (VRRP1) Master received advert from 185.15.56.246 with higher priority 74, ours 57

This is an odd pattern. Happens only occasionally. The logical assumption of what is happening here is as follows:

  1. Backup cloudgw goes 3 seconds without seeing a VRRP advertisement from the current master (adv_interval is set to 1sec on both)
  2. Due to this the backup promotes itself to MASTER, adding the IP to its interface and (presumably) sending gratuitous ARP broadcasts on the Vlan.
  3. Every single time this happens (as above) there is a log message within the same second showing it getting an advert from the master.
  4. On receipt of that advert (with higher priority) the backup resets itself to normal backup state.

The obvious question is why does it miss 3 adverts in a row, if only occasionally. Traffic between these devices goes device->cloudsw->cloudsw->device. I was expecting this path to show no discards, given the available 80Gb bandwidth and normal peaks of less than 10Gb. But we do show drops on it from time to time (see 'discards out' on the errors graph):

https://librenms.wikimedia.org/device/device=185/tab=port/port=20103/

Doing some tests there doesn't appear to be any sign of constant packet loss, far from it in fact, below 100,000 pings sent and responded to in ~7 seconds with no drops:

cmooney@cloudgw1001:~$ sudo ip vrf exec vrf-cloudgw ping -f -c 100000 185.15.56.246
PING 185.15.56.246 (185.15.56.246) 56(84) bytes of data.
 
--- 185.15.56.246 ping statistics ---
100000 packets transmitted, 100000 received, 0% packet loss, time 6836ms
rtt min/avg/max/mdev = 0.030/0.044/0.778/0.011 ms, ipg/ewma 0.068/0.040 ms

Clearly we have instantaneous peaks of bandwidth on these links at much higher rates than the 5-min averages LibreNMS gives us (hence discards). But in terms of addressing there isn't much we can do with the existing hardware:

  1. We already have the class-of-service config to allocate all the buffer space available
  2. There are no more free QSFP+ ports to add to the existing 2x40G link

In the absence of a fix for that what we could do is add preempt_delay X (suggest we start with 2) to the vrrp block in /etc/keepalived/keepalived.conf. That would tell the backup to wait longer, in the case it's missed 3 adverts from the master, before trying to take over. See the keepalived man page for more info.

While such techniques can possibly help prevent these blips, they also would delay the time to failover in the case of a genuine failure on the master. So we need to weigh up the tolerances, likelihood of events, and impact the current flaps are having.

I've opened an upstream bug, because I suspect of keepalived misbehaving: https://github.com/acassen/keepalived/issues/2220

note to self: I've let a tcpdump running on a screen session on cloudgw2002-dev/2003-dev over the weekend to see if I see what packets flows we have on that event.

I've opened an upstream bug, because I suspect of keepalived misbehaving: https://github.com/acassen/keepalived/issues/2220

I’m not sure we have evidence of that.

We know the switch is discarding packets, and the behaviour matches that scenario.

Not saying there is not something host-wise affecting it, but I’m doubtful.

note to self: I've let a tcpdump running on a screen session on cloudgw2002-dev/2003-dev over the weekend to see if I see what packets flows we have on that event.

I have now capture at least 2 or 3 of such flapping events on tcpdump. There is, however, nothing special going on when the failover happens.

Mentioned in SAL (#wikimedia-cloud) [2022-10-31T13:09:23Z] <arturo> restart keepalived on all 4 cloudgw servers to run them with -D in /etc/default/keepalived to further debug T320975

Another idea: what if conntrack is to blame here. There is a stateful firewall in cloudgw. I can see the conntack entries being present for the VRRP connection:

aborrero@cloudgw2002-dev:~$ sudo conntrack -L -p 112
unknown  112 599 src=208.80.153.189 dst=208.80.153.188 src=208.80.153.188 dst=208.80.153.189 mark=0 use=1
aborrero@cloudgw2003-dev:~$ sudo conntrack -L -p 112
unknown  112 599 src=208.80.153.188 dst=208.80.153.189 src=208.80.153.189 dst=208.80.153.188 mark=0 use=1

What if the local firewall is not letting the VRRP package come in until the node itself sends one, so a conntrack entry is created and the connection is therefore ESTABLISHED. I just checked and we don't have a rule opened for VRRP traffic in cloudgw local firewall. So I will investigate this theory next.

Change 851087 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: explicitly accept VRRP packets from the peer

https://gerrit.wikimedia.org/r/851087

Change 851087 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: explicitly accept VRRP packets from the peer

https://gerrit.wikimedia.org/r/851087

I have now capture at least 2 or 3 of such flapping events on tcpdump. There is, however, nothing special going on when the failover happens.

Hmm ok. So you see a packet from the master every 1 second at those times? This was something I was going to try and do myself. If you don't see a gap in packets arriving at the same time the flap happens then I think we can rule out it being packet drops at the switch level.

What if the local firewall is not letting the VRRP package come in until the node itself sends one, so a conntrack entry is created and the connection is therefore ESTABLISHED. I just checked and we don't have a rule opened for VRRP traffic in cloudgw local firewall. So I will investigate this theory next.

This is a very interesting idea. When the flap happens the former backup device starts sending adverts as it's now master, and thus could populate such an entry.

The problem I think is that this should produce a regular pattern time-wise, I think this entry means it would have a 10 minute timeout for the conntrack, so if this was the cause would it not happen regularly every 10 minutes?

cmooney@cloudgw1001:~$ sudo sysctl net.netfilter.nf_conntrack_generic_timeout
net.netfilter.nf_conntrack_generic_timeout = 600

Now the incoming packets should probably refresh the timer, but again that would mean it never times out, so not the cause.

Looking at the rules it seems there is one for VRRP, I'm not 100% sure it's correct to work in the vrf but I think this is valid for the input chain:

cmooney@cloudgw1001:/etc/nftables$ grep vrrp *
0_keepalived_vrrp_puppet.nft:add rule inet basefirewall input ip saddr 185.15.56.246 ip protocol vrrp accept

wmopbot doesn't disconnect for more than 24h, the problem seems to has been resolved, thank you to all involved in the solution.

Looks like we're not getting off quite that easy, as stashbot's been timing out all day and StewardBot timed out an hour ago.

Every time wmopbot restarts it restarts the k8s container. Maybe other bots reconnect without restart the process and container and that is preventing the changes to be applied to them.

aborrero claimed this task.
aborrero triaged this task as Low priority.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

A week has passed and IRC bots seems pretty stable. Closing task now, please reopen if required.