Change Details

Since around 14:20-14:30 UTC all of my k8s jobs have started to fail due to not being able to connect to en.wikipedia.org. ``` requests.exceptions.ConnectionError: HTTPSConnectionPool(host='en.wikipedia.org', port=443): Max retries exceeded with url: /w/api.php (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f7eaf221d30>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')) ``` This was suspiciously paired with {T330165}, which got cloudvirt1019/1020 off the network during the operation. They are hypervisors with local storage VMs, which include Toolforge kubernetes etcd servers. We discovered that 2/3 of etcd being down resulted in calico-node getting somehow confused, see {P45970} and in particular: ``` 2023-03-28 14:20:43.928 [INFO][67] felix/conntrack.go 90: Removing conntrack flows ip=192.168.222.147 2023-03-28 14:20:43.928 [INFO][67] felix/route_table.go 896: Remove old route dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 routeProblems=[]string{"unexpected route"} tableIndex=0 2023-03-28 14:20:43.928 [INFO][67] felix/conntrack.go 90: Removing conntrack flows ip=192.168.222.151 ``` If local worker node conntrack NAT information was flushed by calico, the it is expected that network flows would be affected, explaining the outage.