Page MenuHomePhabricator

2021-11-02 Cloud VPS network outage
Closed, ResolvedPublic

Description

Incident doc: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-02_Cloud_VPS_networking

Today 2021-11-02 there was a severe network disruption in Cloud VPS & Toolforge.

The timeline is as follows (UTC times):

  • 10:30 - @aborrero starts evaluating a kernel upgrade for several Cloud VPS network components (cloudnet, cloudgw servers) see T291813
  • 10:43 - reboot of cloudgw2001-dev/2002-dev servers (codfw1dev)
  • 11:35 - reboot cloudgw1001 & cloudgw1002 (outage starts)
  • 11:40 - @Majavah reports on IRC problems in Toolforge NFS on Kubernetes. @dcaro responds to the incident and joins us.
  • 11:45 - we start debugging common problems: NFS mounts, LDAP connection issues, etc, mostly focusing on dumps NFS mounts
  • 12:30 - we tested several things: flushing caches, restarting NFS services, rebooting VMs, flushing conntrack entries in edge routers for NFS connections, etc.
  • 12:50 - @Majavah points out loud that that the problem is more widespread than just the dumps NFS mounts, and we start looking more into the network stack
  • 13:00 - at this time is clear the problem is somewhere in the edge of the network. TCP/IP packets flow in egress direction but no ingress, for VMs without floating IPs.
  • 13:20 - @aborrero rolls back the kernel upgrade on cloudgw servers and the network goes back to normal (outage ends)

The outage lasted 1h40m.

As of this writing, the main theory is that the kernel update introduces a regression in the cloudgw setup.

  • Good kernel: 5.10.40 (debian 5.10.0-0.bpo.7-amd64) <-- the one running as I write this, causing no problem.
  • Bad kernel: 5.10.70 (debian 5.10.0-0.bpo.9-amd64) <-- the one that caused the problem.

After a quick scan, this patch was identified as potentially being the root cause: https://x-lore.kernel.org/stable/20210824165908.709932-58-sashal@kernel.org/
We communicated with upstream to validate the theory, see https://marc.info/?l=netfilter-devel&m=163586356405323&w=2 and the patch was indeed reverted https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.15&id=55161e67d44fdd23900be166a81e996abd6e3be9

Action items

There are several things that could have been done better:

  • the kernel upgrade was done on codfw1dev first. But not enough tests were conducted. Specifically, @aborrero only tested that cloudgw servers were alive after the upgrade and didn't do any actual network functionality tests.
    • this defeats the purpose of the codfw1dev deployment, which is to tests changes before introducing them in eqiad1.
  • there are several operations that should be run by hand to verify the network is properly working. This could be automated. @aborrero & @dcaro have some ideas for this (possibly based on P15659)

Event Timeline

aborrero triaged this task as Medium priority.Nov 2 2021, 5:03 PM
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2021-11-03T11:45:35Z] <arturo> [codfw1dev] downgrade kernel on cloudgw2001-dev/2002-dev (T294853, T291813)