Page MenuHomePhabricator

cloudvirts: the rocky/buster combo has iptables/ebtables issues, producing errors when launching VMs (and probably other stuff)
Closed, ResolvedPublic

Description

After a while the buster hypervisors (e.g. cloudvirt1031-1039) stop being able to launch new VMs.

neutron-linuxbridge-agent.log shows some distress:

2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -D neutron-linuxbri-sg-chain 27
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -D neutron-linuxbri-sg-chain 27
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-i12cfa48e-f
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-i76611dde-d
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-ia7cb22f9-1
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-ic2d341a6-0
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-id5d42f4d-8
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-iebbe5257-8
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-o12cfa48e-f
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-o76611dde-d
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-oa7cb22f9-1
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-oc2d341a6-0
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-od5d42f4d-8
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-oebbe5257-8
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-s12cfa48e-f
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-s76611dde-d
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-sa7cb22f9-1
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-sc2d341a6-0
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-sd5d42f4d-8
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -X neutron-linuxbri-sebbe5257-8
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent COMMIT
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent # Completed by iptables_manager
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent # Generated by iptables_manager
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent *raw
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 1 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for a945d81-14" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -D neutron-linuxbri-PREROUTING 2
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 2 -i brq7425e328-56 -m comment --comment "Set zone for a945d81-14" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 3 -m physdev --physdev-in tap0a945d81-14 -m comment --comment "Set zone for a945d81-14" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -D neutron-linuxbri-PREROUTING 4
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 4 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for d5bf8d1-fa" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -D neutron-linuxbri-PREROUTING 5
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -D neutron-linuxbri-PREROUTING 5
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 5 -i brq7425e328-56 -m comment --comment "Set zone for d5bf8d1-fa" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 6 -m physdev --physdev-in tap1d5bf8d1-fa -m comment --comment "Set zone for d5bf8d1-fa" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 7 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for 3c8f574-b1" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 8 -i brq7425e328-56 -m comment --comment "Set zone for 3c8f574-b1" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 9 -m physdev --physdev-in tap23c8f574-b1 -m comment --comment "Set zone for 3c8f574-b1" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 10 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for 9fb781f-47" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 11 -i brq7425e328-56 -m comment --comment "Set zone for 9fb781f-47" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 13 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for 319106d-f6" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 14 -i brq7425e328-56 -m comment --comment "Set zone for 319106d-f6" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 15 -m physdev --physdev-in tap5319106d-f6 -m comment --comment "Set zone for 319106d-f6" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 16 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for 98a9535-d8" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -D neutron-linuxbri-PREROUTING 17
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 17 -i brq7425e328-56 -m comment --comment "Set zone for 98a9535-d8" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 18 -m physdev --physdev-in tap698a9535-d8 -m comment --comment "Set zone for 98a9535-d8" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 19 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for 3326578-a3" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 20 -i brq7425e328-56 -m comment --comment "Set zone for 3326578-a3" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 21 -m physdev --physdev-in tap83326578-a3 -m comment --comment "Set zone for 3326578-a3" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 22 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for c63d731-ca" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 23 -i brq7425e328-56 -m comment --comment "Set zone for c63d731-ca" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 24 -m physdev --physdev-in tap9c63d731-ca -m comment --comment "Set zone for c63d731-ca" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -D neutron-linuxbri-PREROUTING 25
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 25 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for 7a79673-91" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 26 -i brq7425e328-56 -m comment --comment "Set zone for 7a79673-91" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 27 -m physdev --physdev-in tapa7a79673-91 -m comment --comment "Set zone for 7a79673-91" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 28 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for bb00f7c-5f" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 29 -i brq7425e328-56 -m comment --comment "Set zone for bb00f7c-5f" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 30 -m physdev --physdev-in tapcbb00f7c-5f -m comment --comment "Set zone for bb00f7c-5f" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 31 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for 62efe63-a7" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 32 -i brq7425e328-56 -m comment --comment "Set zone for 62efe63-a7" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 33 -m physdev --physdev-in tape62efe63-a7 -m comment --comment "Set zone for 62efe63-a7" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 34 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for 6ab0568-59" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 35 -i brq7425e328-56 -m comment --comment "Set zone for 6ab0568-59" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 36 -m physdev --physdev-in tape6ab0568-59 -m comment --comment "Set zone for 6ab0568-59" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 37 -m physdev --physdev-in brq7425e328-56 -m comment --comment "Set zone for a1a0fe7-f0" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 38 -i brq7425e328-56 -m comment --comment "Set zone for a1a0fe7-f0" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent -I neutron-linuxbri-PREROUTING 39 -m physdev --physdev-in tapea1a0fe7-f0 -m comment --comment "Set zone for a1a0fe7-f0" -j CT --zone 4097
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent COMMIT
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent # Completed by iptables_manager
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent ; Stdout: ; Stderr: iptables-restore v1.8.2 (nf_tables): 
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 44: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-FORWARD
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 45: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-FORWARD
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 46: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-FORWARD
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 58: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-FORWARD
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 59: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-FORWARD
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 66: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-INPUT
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 74: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-INPUT
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 76: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-INPUT
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 444: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-sg-chain
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 445: RULE_DELETE failed (No such file or directory): rule in chain neutron-linuxbri-sg-chain
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent line 459: RULE_DELETE failed (No such file or directory): rule i
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent 
2020-09-15 20:31:02.851 57275 ERROR neutron.plugins.ml2.drivers.agent._common_agent 
2020-09-15 20:31:03.956 57275 INFO neutron.plugins.ml2.drivers.agent._common_agent [req-73395e27-3743-4723-8923-cac25608efa3 - - - - -] Linux bridge agent Agent out of sync with plugin!

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2020-09-15T20:32:34Z] <andrewbogott> rebooting cloudvirt1038 to see if it resolves T262979

Rebooting a host resolves the issue. Restarting nova-compute and neutron-linuxbridge-agent does not.

@arturo, this issue is currently present on cloudvirt1033.

So I drained cloudvirt1033 to see if stopping all kvm processes would resolve the problem. It didn't. BUT, the process of draining 1033 caused the problem to appear on a second cloudvirt, 1034. So maybe there's one unlucky VM that triggers the issue, or maybe being a recipient of live migrations triggers it. Going to see if I can make it happen again...

Andrew added a subscriber: aborrero.

Welp, draining cloudvirt1034 did NOT cause the issue to pop up on a different host. So now no servers are exhibiting the issue, which is both good and bad. I've no doubt it will reappear after tomorrow's migrations in any case.

Change 627773 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: rocky/buster: use more modern netfilter components

https://gerrit.wikimedia.org/r/627773

I guess the reboot resulted in the server running a newer kernel.

For the record:

aborrero@cumin1001:~ 15s $ sudo cumin 'cloudvirt1*' 'grep VERSION_CODENAME /etc/os-release ; uname -r'
37 hosts will be targeted:
cloudvirt[1001-1009,1012-1039].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                   
(12) cloudvirt[1004,1006,1024,1031-1039].eqiad.wmnet                                                     
----- OUTPUT of 'grep VERSION_COD...lease ; uname -r' -----                                              
VERSION_CODENAME=buster                                                                                  
4.19.0-10-amd64                                                                                          
===== NODE GROUP =====                                                                                   
(1) cloudvirt1020.eqiad.wmnet                                                                            
----- OUTPUT of 'grep VERSION_COD...lease ; uname -r' -----                                              
VERSION_CODENAME=stretch                                                                                 
4.9.0-8-amd64                                                                                            
===== NODE GROUP =====                                                                                   
(1) cloudvirt1015.eqiad.wmnet                                                                            
----- OUTPUT of 'grep VERSION_COD...lease ; uname -r' -----                                              
VERSION_CODENAME=stretch                                                                                 
4.9.0-13-amd64                                                                                           
===== NODE GROUP =====                                                                                   
(7) cloudvirt[1001-1003,1005,1007,1017-1018].eqiad.wmnet                                                 
----- OUTPUT of 'grep VERSION_COD...lease ; uname -r' -----                                              
VERSION_CODENAME=stretch                                                                                 
4.9.0-9-amd64                                                                                            
===== NODE GROUP =====                                                                                   
(2) cloudvirt[1014,1022].eqiad.wmnet                                                                     
----- OUTPUT of 'grep VERSION_COD...lease ; uname -r' -----                                              
VERSION_CODENAME=stretch                                                                                 
4.9.0-12-amd64                                                                                           
===== NODE GROUP =====                                                                                   
(14) cloudvirt[1008-1009,1012-1013,1016,1019,1021,1023,1025-1030].eqiad.wmnet                            
----- OUTPUT of 'grep VERSION_COD...lease ; uname -r' -----                                              
VERSION_CODENAME=stretch                                                                                 
4.9.0-11-amd64                                                                                           
================    

Anyway, I wrote https://gerrit.wikimedia.org/r/627773 which should help. But if we are unable to reproduce the issue, then I would simply blame the kernel and don't merge the patch.

aborrero triaged this task as Medium priority.Sep 16 2020, 10:06 AM
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
aborrero updated the task description. (Show Details)

I'm checking periodically for this issue with

sudo cumin --force --timeout 500 "cloudvirt10*" "grep 'RULE_DELETE failed' /var/log/neutron/neutron-linuxbridge-agent.log"

Mentioned in SAL (#wikimedia-cloud) [2020-09-18T08:50:26Z] <arturo> installing iptables from buster-bpo in cloudvirt1036 (T263205 and T262979)

I think T263205: Strange NFS client outage on VMs running on cloudvirt1036 is related to this issue. I discovered the same error on cloudvirt1036, which banished as soon as I installed the newer iptables package.

I'm merging the patch and upgrading iptables in all buster cloudvirts. I'm not fan of doing this kind of stuff on friday, but this may make our weekend easier.

Mentioned in SAL (#wikimedia-cloud) [2020-09-18T08:59:10Z] <arturo> disable puppet in all buster cloudvirts (cloudvirt[1024,1031-1039].eqiad.wmnet) to merge a patch for T263205 and T262979

Change 627773 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: rocky/buster: use more modern netfilter components

https://gerrit.wikimedia.org/r/627773

it seems we don't have any buster cloudvirt in codfw1dev. It would have been awesome to be able to test this there first.

Change 628302 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: rocky/buster: also pin other related packages required by modern iptables

https://gerrit.wikimedia.org/r/628302

Change 628302 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: rocky/buster: fixes for iptables updates

https://gerrit.wikimedia.org/r/628302

Mentioned in SAL (#wikimedia-cloud) [2020-09-18T09:50:18Z] <arturo> enabling puppet in cloudvirts and effectively merging patches from T262979

I suspect what's happening here is:

  • there is some compatibility issue with older iptables/ebtables. The version from buster-bpo should be much more reliable.
  • some VMs have a security group configuration attached to them that when relocated from one cloudvirt to other triggers the compatibility bug in iptables/ebtables, thus making the virtual network config unreliable (see also T263205)
  • simply running a fixed/more modern iptables/ebtables suite prevents the issue from happening.

Other alternatives we could explore in case we see additional issues:

  • move away from iptables-nft and use iptables-legacy instead. This is relatively easy (already done in k8s), but not future-proof either.
  • move away from neutron-linuxbridge-agent and use neutron-openvswitch-agent instead. This seems to be the preferred way upstream and therefore more future-proof, but this involves a lot of work because is basically reworking all our config for the neutron switching driver. Side benefit: vxlan (multi-row) support.
aborrero claimed this task.
aborrero raised the priority of this task from Medium to High.

With all that being said, I'm closing the task. Please reopen if required!

aborrero renamed this task from Buster cloudvirts unable to launch new VMs to cloudvirts: the rocky/buster combo has iptables/ebtables issues, producing errors when launching VMs (and probably other stuff).Sep 18 2020, 10:13 AM

We are experiencing more issues, reopening.

Sent a report upstream: https://marc.info/?l=netfilter-devel&m=160145994003146&w=2

I'll be switching the servers to iptables-legacy meanwhile.

Change 631167 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: neutron: use iptables-legacy

https://gerrit.wikimedia.org/r/631167

Mentioned in SAL (#wikimedia-cloud) [2020-09-30T11:33:12Z] <arturo> disabling puppet and downtiming every virt/net server in the fleet in preparation for merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/631167 (T262979)

Change 631167 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: neutron: use iptables-legacy

https://gerrit.wikimedia.org/r/631167

Mentioned in SAL (#wikimedia-cloud) [2020-09-30T16:47:32Z] <andrewbogott> rebooting cloudvir1032, 1033, 1034 for T262979

I've now forced a puppet run and rebooted all the Buster cloudvirts.