Page MenuHomePhabricator

Toolforge k8s: network connetivity problems
Closed, ResolvedPublic

Description

Since around 14:20-14:30 UTC all of my k8s jobs have started to fail due to not being able to connect to en.wikipedia.org.

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='en.wikipedia.org', port=443): Max retries exceeded with url: /w/api.php (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f7eaf221d30>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))

This was suspiciously paired with T330165: eqiad row B switches upgrade, which got cloudvirt1019/1020 off the network during the operation.
They are hypervisors with local storage VMs, which include Toolforge kubernetes etcd servers.

We discovered that 2/3 of etcd being down resulted in calico-node getting somehow confused, see

12023-03-28 14:12:39.715 [INFO][67] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m3.3s: avg=39ms longest=172ms (resync-nat-v4)
22023-03-28 14:13:03.835 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
32023-03-28 14:13:43.066 [INFO][67] felix/summary.go 100: Summarising 10 dataplane reconciliation loops over 1m3.4s: avg=9ms longest=33ms (resync-routes-v4,resync-routes-v4,resync-rules-v4,resync-wg)
42023-03-28 14:14:03.838 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
52023-03-28 14:14:46.909 [INFO][67] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m3.8s: avg=33ms longest=160ms (resync-nat-v4)
62023-03-28 14:15:03.842 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
72023-03-28 14:15:49.139 [INFO][67] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m2.2s: avg=40ms longest=206ms (resync-nat-v4)
82023-03-28 14:16:03.846 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
92023-03-28 14:16:52.194 [INFO][67] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m3.1s: avg=13ms longest=52ms (resync-routes-v4,resync-routes-v4,resync-rules-v4,resync-wg)
102023-03-28 14:17:03.848 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
112023-03-28 14:17:55.498 [INFO][67] felix/summary.go 100: Summarising 10 dataplane reconciliation loops over 1m3.3s: avg=37ms longest=166ms (resync-nat-v4)
122023-03-28 14:18:03.851 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
132023-03-28 14:18:58.709 [INFO][67] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m3.2s: avg=34ms longest=170ms (resync-nat-v4)
142023-03-28 14:19:03.856 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
152023-03-28 14:19:24.273 [INFO][67] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=85125 ifaceName="cali356f440101e" state="down"
162023-03-28 14:19:24.273 [INFO][67] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"cali356f440101e", State:"down", Index:85125}
172023-03-28 14:19:24.274 [INFO][67] felix/endpoint_mgr.go 418: Workload interface state changed; marking for status update. ifaceName="cali356f440101e"
182023-03-28 14:19:24.274 [INFO][67] felix/endpoint_mgr.go 477: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=false status="down" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"tool-glamtools/inventory-scan-27993790-fthqk", EndpointId:"eth0"}
192023-03-28 14:19:24.274 [INFO][67] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="down" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"tool-glamtools/inventory-scan-27993790-fthqk", EndpointId:"eth0"}
202023-03-28 14:19:24.274 [INFO][67] felix/status_combiner.go 78: Endpoint down for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"tool-glamtools/inventory-scan-27993790-fthqk", EndpointId:"eth0"} ipVersion=0x4 status="down"
212023-03-28 14:19:24.275 [INFO][67] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"tool-glamtools/inventory-scan-27993790-fthqk", EndpointId:"eth0"} status="down"
222023-03-28 14:19:24.276 [INFO][67] felix/int_dataplane.go 1071: Linux interface addrs changed. addrs=set.mapSet{} ifaceName="cali356f440101e"
232023-03-28 14:19:24.276 [INFO][67] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"cali356f440101e", Addrs:set.mapSet{}}
242023-03-28 14:19:24.276 [INFO][67] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"cali356f440101e", Addrs:set.mapSet{}}
252023-03-28 14:19:24.276 [INFO][67] felix/iface_monitor.go 201: Netlink address update. addr="fe80::ecee:eeff:feee:eeee" exists=false ifIndex=85125
262023-03-28 14:19:24.293 [INFO][67] felix/int_dataplane.go 1071: Linux interface addrs changed. addrs=<nil> ifaceName="cali356f440101e"
272023-03-28 14:19:24.293 [INFO][67] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"cali356f440101e", Addrs:set.Set(nil)}
282023-03-28 14:19:24.293 [INFO][67] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"cali356f440101e", Addrs:set.Set(nil)}
292023-03-28 14:20:01.854 [INFO][67] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m3.1s: avg=6ms longest=24ms (resync-filter-v4)
302023-03-28 14:20:03.859 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
312023-03-28 14:20:26.368 [INFO][67] felix/int_dataplane.go 1071: Linux interface addrs changed. addrs=set.mapSet{} ifaceName="calib8b5df99f66"
322023-03-28 14:20:26.368 [INFO][67] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=101030 ifaceName="calib8b5df99f66" state="up"
332023-03-28 14:20:26.368 [INFO][67] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"calib8b5df99f66", Addrs:set.mapSet{}}
342023-03-28 14:20:26.368 [INFO][67] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"calib8b5df99f66", Addrs:set.mapSet{}}
352023-03-28 14:20:26.369 [INFO][67] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"calib8b5df99f66", State:"up", Index:101030}
362023-03-28 14:20:26.369 [INFO][67] felix/endpoint_mgr.go 366: Workload interface came up, marking for reconfiguration. ifaceName="calib8b5df99f66"
372023-03-28 14:20:26.370 [INFO][67] felix/endpoint_mgr.go 1179: Applying /proc/sys configuration to interface. ifaceName="calib8b5df99f66"
382023-03-28 14:20:26.372 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
392023-03-28 14:20:26.372 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
402023-03-28 14:20:26.372 [INFO][67] felix/route_table.go 567: Interface in cleanup grace period, will retry after. ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
412023-03-28 14:20:28.113 [INFO][67] felix/iface_monitor.go 201: Netlink address update. addr="fe80::ecee:eeff:feee:eeee" exists=true ifIndex=101030
422023-03-28 14:20:28.114 [INFO][67] felix/int_dataplane.go 1071: Linux interface addrs changed. addrs=set.mapSet{"fe80::ecee:eeff:feee:eeee":set.empty{}} ifaceName="calib8b5df99f66"
432023-03-28 14:20:28.114 [INFO][67] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"calib8b5df99f66", Addrs:set.mapSet{"fe80::ecee:eeff:feee:eeee":set.empty{}}}
442023-03-28 14:20:28.114 [INFO][67] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"calib8b5df99f66", Addrs:set.mapSet{"fe80::ecee:eeff:feee:eeee":set.empty{}}}
452023-03-28 14:20:28.115 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
462023-03-28 14:20:28.116 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
472023-03-28 14:20:28.116 [INFO][67] felix/route_table.go 567: Interface in cleanup grace period, will retry after. ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
482023-03-28 14:20:30.471 [INFO][67] felix/int_dataplane.go 1071: Linux interface addrs changed. addrs=set.mapSet{} ifaceName="calif1b08e6e3d7"
492023-03-28 14:20:30.472 [INFO][67] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=101031 ifaceName="calif1b08e6e3d7" state="up"
502023-03-28 14:20:30.490 [INFO][67] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"calif1b08e6e3d7", Addrs:set.mapSet{}}
512023-03-28 14:20:30.490 [INFO][67] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"calif1b08e6e3d7", Addrs:set.mapSet{}}
522023-03-28 14:20:30.494 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
532023-03-28 14:20:30.496 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
542023-03-28 14:20:30.496 [INFO][67] felix/route_table.go 567: Interface in cleanup grace period, will retry after. ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
552023-03-28 14:20:30.496 [INFO][67] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"calif1b08e6e3d7", State:"up", Index:101031}
562023-03-28 14:20:30.497 [INFO][67] felix/endpoint_mgr.go 366: Workload interface came up, marking for reconfiguration. ifaceName="calif1b08e6e3d7"
572023-03-28 14:20:30.497 [INFO][67] felix/endpoint_mgr.go 1179: Applying /proc/sys configuration to interface. ifaceName="calif1b08e6e3d7"
582023-03-28 14:20:30.500 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
592023-03-28 14:20:30.503 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.147/32 ifaceName="calif1b08e6e3d7" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
602023-03-28 14:20:30.511 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
612023-03-28 14:20:30.512 [INFO][67] felix/route_table.go 567: Interface in cleanup grace period, will retry after. ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
622023-03-28 14:20:30.512 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.147/32 ifaceName="calif1b08e6e3d7" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
632023-03-28 14:20:30.512 [INFO][67] felix/route_table.go 567: Interface in cleanup grace period, will retry after. ifaceName="calif1b08e6e3d7" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
642023-03-28 14:20:31.566 [INFO][67] felix/iface_monitor.go 201: Netlink address update. addr="fe80::ecee:eeff:feee:eeee" exists=true ifIndex=101031
652023-03-28 14:20:31.566 [INFO][67] felix/int_dataplane.go 1071: Linux interface addrs changed. addrs=set.mapSet{"fe80::ecee:eeff:feee:eeee":set.empty{}} ifaceName="calif1b08e6e3d7"
662023-03-28 14:20:31.566 [INFO][67] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"calif1b08e6e3d7", Addrs:set.mapSet{"fe80::ecee:eeff:feee:eeee":set.empty{}}}
672023-03-28 14:20:31.566 [INFO][67] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"calif1b08e6e3d7", Addrs:set.mapSet{"fe80::ecee:eeff:feee:eeee":set.empty{}}}
682023-03-28 14:20:31.568 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
692023-03-28 14:20:31.568 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.147/32 ifaceName="calif1b08e6e3d7" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
702023-03-28 14:20:31.569 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
712023-03-28 14:20:31.569 [INFO][67] felix/route_table.go 567: Interface in cleanup grace period, will retry after. ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
722023-03-28 14:20:31.570 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.147/32 ifaceName="calif1b08e6e3d7" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
732023-03-28 14:20:31.570 [INFO][67] felix/route_table.go 567: Interface in cleanup grace period, will retry after. ifaceName="calif1b08e6e3d7" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
742023-03-28 14:20:33.792 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
752023-03-28 14:20:33.793 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.147/32 ifaceName="calif1b08e6e3d7" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
762023-03-28 14:20:33.794 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
772023-03-28 14:20:33.794 [INFO][67] felix/route_table.go 567: Interface in cleanup grace period, will retry after. ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
782023-03-28 14:20:33.795 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.147/32 ifaceName="calif1b08e6e3d7" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
792023-03-28 14:20:33.796 [INFO][67] felix/route_table.go 567: Interface in cleanup grace period, will retry after. ifaceName="calif1b08e6e3d7" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
802023-03-28 14:20:43.927 [INFO][67] felix/route_table.go 896: Remove old route dest=192.168.222.147/32 ifaceName="calif1b08e6e3d7" ifaceRegex="^cali.*" ipVersion=0x4 routeProblems=[]string{"unexpected route"} tableIndex=0
812023-03-28 14:20:43.928 [INFO][67] felix/conntrack.go 90: Removing conntrack flows ip=192.168.222.147
822023-03-28 14:20:43.928 [INFO][67] felix/route_table.go 896: Remove old route dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 routeProblems=[]string{"unexpected route"} tableIndex=0
832023-03-28 14:20:43.928 [INFO][67] felix/conntrack.go 90: Removing conntrack flows ip=192.168.222.151
842023-03-28 14:21:03.863 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
852023-03-28 14:21:04.412 [INFO][67] felix/summary.go 100: Summarising 17 dataplane reconciliation loops over 1m2.6s: avg=24ms longest=181ms (resync-nat-v4)
862023-03-28 14:21:13.096 [INFO][67] felix/iface_monitor.go 201: Netlink address update. addr="fe80::ecee:eeff:feee:eeee" exists=false ifIndex=101031
872023-03-28 14:21:13.096 [INFO][67] felix/int_dataplane.go 1071: Linux interface addrs changed. addrs=set.mapSet{} ifaceName="calif1b08e6e3d7"
882023-03-28 14:21:13.096 [INFO][67] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=101031 ifaceName="calif1b08e6e3d7" state="down"
892023-03-28 14:21:13.097 [INFO][67] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"calif1b08e6e3d7", Addrs:set.mapSet{}}
902023-03-28 14:21:13.097 [INFO][67] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"calif1b08e6e3d7", Addrs:set.mapSet{}}
912023-03-28 14:21:13.097 [INFO][67] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"calif1b08e6e3d7", State:"down", Index:101031}
922023-03-28 14:21:13.132 [INFO][67] felix/int_dataplane.go 1071: Linux interface addrs changed. addrs=<nil> ifaceName="calif1b08e6e3d7"
932023-03-28 14:21:13.133 [INFO][67] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"calif1b08e6e3d7", Addrs:set.Set(nil)}
942023-03-28 14:21:13.133 [INFO][67] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"calif1b08e6e3d7", Addrs:set.Set(nil)}
952023-03-28 14:22:03.865 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
962023-03-28 14:22:08.200 [INFO][67] felix/summary.go 100: Summarising 14 dataplane reconciliation loops over 1m3.8s: avg=24ms longest=171ms (resync-nat-v4)
972023-03-28 14:23:03.868 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
982023-03-28 14:23:09.769 [INFO][67] felix/summary.go 100: Summarising 8 dataplane reconciliation loops over 1m1.6s: avg=30ms longest=176ms (resync-nat-v4)
992023-03-28 14:24:03.870 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
1002023-03-28 14:24:14.402 [INFO][67] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m4.6s: avg=20ms longest=130ms ()
1012023-03-28 14:25:03.872 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
1022023-03-28 14:25:16.025 [INFO][67] felix/summary.go 100: Summarising 11 dataplane reconciliation loops over 1m1.6s: avg=39ms longest=215ms (resync-nat-v4)
1032023-03-28 14:26:03.875 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
1042023-03-28 14:26:21.225 [INFO][67] felix/summary.go 100: Summarising 10 dataplane reconciliation loops over 1m5.2s: avg=22ms longest=141ms (resync-nat-v4)
1052023-03-28 14:27:03.877 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
1062023-03-28 14:27:23.505 [INFO][67] felix/summary.go 100: Summarising 9 dataplane reconciliation loops over 1m2.3s: avg=22ms longest=120ms ()
1072023-03-28 14:28:03.880 [INFO][63] monitor-addresses/startup.go 713: Using autodetected IPv4 address on interface eth0: 172.16.2.47/21
1082023-03-28 14:28:05.946 [INFO][67] felix/int_dataplane.go 1071: Linux interface addrs changed. addrs=set.mapSet{} ifaceName="calia22ceccf57c"
1092023-03-28 14:28:05.947 [INFO][67] felix/int_dataplane.go 1036: Linux interface state changed. ifIndex=101032 ifaceName="calia22ceccf57c" state="up"
1102023-03-28 14:28:05.947 [INFO][67] felix/int_dataplane.go 1539: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"calia22ceccf57c", Addrs:set.mapSet{}}
1112023-03-28 14:28:05.947 [INFO][67] felix/hostip_mgr.go 85: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"calia22ceccf57c", Addrs:set.mapSet{}}
1122023-03-28 14:28:05.947 [INFO][67] felix/int_dataplane.go 1521: Received interface update msg=&intdataplane.ifaceUpdate{Name:"calia22ceccf57c", State:"up", Index:101032}
1132023-03-28 14:28:05.948 [INFO][67] felix/endpoint_mgr.go 366: Workload interface came up, marking for reconfiguration. ifaceName="calia22ceccf57c"
1142023-03-28 14:28:05.948 [INFO][67] felix/endpoint_mgr.go 1179: Applying /proc/sys configuration to interface. ifaceName="calia22ceccf57c"
1152023-03-28 14:28:05.949 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.181/32 ifaceName="calia22ceccf57c" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
1162023-03-28 14:28:05.950 [INFO][67] felix/route_table.go 892: Syncing routes: found unexpected route; ignoring due to grace period. dest=192.168.222.181/32 ifaceName="calia22ceccf57c" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
1172023-03-28 14:28:05.950 [INFO][67] felix/route_table.go 567: Interface in cleanup grace period, will retry after. ifaceName="calia22ceccf57c" ifaceRegex="^cali.*" ipVersion=0x4 tableIndex=0
and in particular:

	2023-03-28 14:20:43.928 [INFO][67] felix/conntrack.go 90: Removing conntrack flows ip=192.168.222.147
	2023-03-28 14:20:43.928 [INFO][67] felix/route_table.go 896: Remove old route dest=192.168.222.151/32 ifaceName="calib8b5df99f66" ifaceRegex="^cali.*" ipVersion=0x4 routeProblems=[]string{"unexpected route"} tableIndex=0
	2023-03-28 14:20:43.928 [INFO][67] felix/conntrack.go 90: Removing conntrack flows ip=192.168.222.151

If local worker node conntrack NAT information was flushed by calico, the it is expected that network flows would be affected, explaining the outage.

Event Timeline

JJMC89 renamed this task from Toolforge k8s not connecting to en.wikipedia.org to Toolforge k8s temporary failure in name resolution.Mar 28 2023, 3:37 PM
JJMC89 updated the task description. (Show Details)
taavi triaged this task as Unbreak Now! priority.Mar 28 2023, 3:38 PM
aborrero renamed this task from Toolforge k8s temporary failure in name resolution to Toolforge k8s: network connetivity problems.Mar 28 2023, 4:49 PM
aborrero lowered the priority of this task from Unbreak Now! to High.
aborrero updated the task description. (Show Details)
aborrero added subscribers: taavi, aborrero.

The solution proposed by @taavi was to reboot the Toolforge k8s worker node fleet. So I did, and Toolforge recovered is healthy status.

We didn't have any automation for that, so tracking it at T333379: toolforge kubernetes: create roll-reboot cookbook