Today 2021-11-02 there was a severe network disruption in Cloud VPS & Toolforge.
The timeline is as follows (UTC times):
* 10:30 - @aborrero starts evaluating a kernel upgrade for several Cloud VPS network components (cloudnet, cloudgw servers) see T291813
* 10:43 - reboot of cloudgw2001-dev/2002-dev servers (codfw1dev)
* 11:35 - reboot cloudgw1001 & cloudgw1002 (**outage starts**)
* 11:40 - @Majavah reports on IRC problems in Toolforge NFS on Kubernetes. @dcaro responds to the incident and joins us.
* 11:45 - we start debugging common problems: NFS mounts, LDAP connection issues, etc, mostly focusing on dumps NFS mounts
* 12:30 - we tested several things: flushing caches, restarting NFS services, rebooting VMs, flushing conntrack entries in edge routers for NFS connections, etc.
* 12:50 - @Majavah points out loud that that the problem is more widespread than just the dumps NFS mounts, and we start looking more into the network stack
* 13:00 - at this time is clear the problem is somewhere in the edge of the network. TCP/IP packets flow in egress direction but no ingress, for VMs without floating IPs.
* 13:20 - @aborrero rolls back the kernel upgrade on cloudgw servers and the network goes back to normal (**outage ends**)
The outage lasted 1h40m.
As of this writing, the main theory is that the kernel update introduces a regression in the cloudgw setup.
* Good kernel: 5.10.40 (debian 5.10.0-0.bpo.7-amd64) <-- the one running as I write this, causing no problem.
* Bad kernel: 5.10.70 (debian 5.10.0-0.bpo.9-amd64) <-- the one that caused the problem.
After a quick scan, this patch was identified as potentially being the root cause: https://x-lore.kernel.org/stable/20210824165908.709932-58-sashal@kernel.org/
We communicated with upstream to validate the theory, see https://marc.info/?l=netfilter-devel&m=163586356405323&w=2 and the patch was indeed reverted https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.15&id=55161e67d44fdd23900be166a81e996abd6e3be9
===== Action items =====
There are several things that could have been done better:
* the kernel upgrade was done on codfw1dev first. But not enough tests were conducted. Specifically, @aborrero only tested that cloudgw servers were alive after the upgrade and didn't do any actual network functionality tests.
** this defeats the purpose of the codfw1dev deployment, which is to tests changes before introducing them in eqiad1.
* there are several operations that should be run by hand to verify the network is properly working. This could be automated. @aborrero & @dcaro have some ideas for this (possibly based on P15659)