Page MenuHomePhabricator

cloudvps: neutron issue with split brain
Closed, ResolvedPublic

Description

The issue is that under some circumstances, Neutron can have 2 active routing nodes (l3-agent in cloudnet boxes), while our current configuration only supports the active-passive model.
This results in duplicated IP addresses in the network (neutron gw), asymmetric routing, duplicated packets, and other issues.

Trigger is usually a simple puppet change, to some neutron component because currently puppet agent restarts the l3-agent. The restarted l3-agent tries to go directly to the active role within neutron, even if there is already another l3-agent in active role.

At first we were thinking this was just some nova-network<->neutron compat network thing, but eqiad1-r hosts are behaving strangely over the internet.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

SSH sometimes completely breaks:

alex@alex-laptop:~$ ssh eqiad1.bastion.wmflabs.org
Connection reset by 185.15.56.13 port 22
alex@alex-laptop:~$ ssh eqiad1.bastion.wmflabs.org
Linux bastion-eqiad1-01 4.9.0-7-amd64 #1 SMP Debian 4.9.110-1 (2018-07-05) x86_64
Debian GNU/Linux 9.5 (stretch)
bastion-eqiad1-01 is a Cloud VPS bastion host (with mosh enabled) (labs::bastion)
The last Puppet run was at Wed Nov  7 00:10:55 UTC 2018 (10 minutes ago). 
Last login: Tue Nov  6 23:28:48 2018 from 31.48.107.117
krenair@bastion-eqiad1-01:~$ exit
logout
Connection to eqiad1.bastion.wmflabs.org closed.
alex@alex-laptop:~$ ssh eqiad1.bastion.wmflabs.org
Connection reset by 185.15.56.13 port 22

Sometimes you get other errors like packet_write_wait: Connection to 185.15.56.13 port 22: Broken pipe or ssh_exchange_identification: read: Connection reset by peer

pinging is getting duplicate packets, both from within labs:

krenair@bastion-01:~$ ping eqiad1.bastion.wmflabs.org
PING eqiad1.bastion.wmflabs.org (172.16.1.136) 56(84) bytes of data.
64 bytes from bastion-eqiad1-01.bastion.eqiad.wmflabs (172.16.1.136): icmp_seq=1 ttl=63 time=0.545 ms
64 bytes from bastion-eqiad1-01.bastion.eqiad.wmflabs (172.16.1.136): icmp_seq=1 ttl=63 time=0.592 ms (DUP!)
64 bytes from bastion-eqiad1-01.bastion.eqiad.wmflabs (172.16.1.136): icmp_seq=2 ttl=63 time=0.490 ms
64 bytes from bastion-eqiad1-01.bastion.eqiad.wmflabs (172.16.1.136): icmp_seq=3 ttl=63 time=1.09 ms
64 bytes from bastion-eqiad1-01.bastion.eqiad.wmflabs (172.16.1.136): icmp_seq=4 ttl=63 time=0.616 ms
64 bytes from bastion-eqiad1-01.bastion.eqiad.wmflabs (172.16.1.136): icmp_seq=5 ttl=63 time=0.784 ms
64 bytes from bastion-eqiad1-01.bastion.eqiad.wmflabs (172.16.1.136): icmp_seq=6 ttl=63 time=8.02 ms
64 bytes from bastion-eqiad1-01.bastion.eqiad.wmflabs (172.16.1.136): icmp_seq=6 ttl=63 time=8.04 ms (DUP!)

and from my own device:

alex@alex-laptop:~$ ping eqiad1.bastion.wmflabs.org
PING eqiad1.bastion.wmflabs.org (185.15.56.13) 56(84) bytes of data.
64 bytes from eqiad1.bastion.wmflabs.org (185.15.56.13): icmp_seq=1 ttl=48 time=90.5 ms
64 bytes from eqiad1.bastion.wmflabs.org (185.15.56.13): icmp_seq=4 ttl=48 time=92.2 ms
64 bytes from eqiad1.bastion.wmflabs.org (185.15.56.13): icmp_seq=5 ttl=48 time=89.3 ms
64 bytes from eqiad1.bastion.wmflabs.org (185.15.56.13): icmp_seq=6 ttl=48 time=90.7 ms
64 bytes from eqiad1.bastion.wmflabs.org (185.15.56.13): icmp_seq=6 ttl=48 time=90.7 ms (DUP!)
64 bytes from eqiad1.bastion.wmflabs.org (185.15.56.13): icmp_seq=7 ttl=48 time=89.8 ms
Krenair renamed this task from Network failure around eqiad1-r to Network failure around eqiad1-r (cloud-vps neutron).Nov 7 2018, 12:26 AM
Krenair updated the task description. (Show Details)

and when you do get onto an eqiad1-r host, you also see DUPs while pinging out to other stuff:

krenair@gerrit-mysql:~$ ping bastion.wmflabs.org
PING bastion.wmflabs.org (10.68.17.232) 56(84) bytes of data.
64 bytes from bastion-01.bastion.eqiad.wmflabs (10.68.17.232): icmp_seq=1 ttl=63 time=4.03 ms
64 bytes from bastion-01.bastion.eqiad.wmflabs (10.68.17.232): icmp_seq=2 ttl=63 time=2.08 ms
64 bytes from bastion-01.bastion.eqiad.wmflabs (10.68.17.232): icmp_seq=3 ttl=63 time=4.52 ms
64 bytes from bastion-01.bastion.eqiad.wmflabs (10.68.17.232): icmp_seq=5 ttl=63 time=6.64 ms
64 bytes from bastion-01.bastion.eqiad.wmflabs (10.68.17.232): icmp_seq=6 ttl=63 time=0.632 ms
64 bytes from bastion-01.bastion.eqiad.wmflabs (10.68.17.232): icmp_seq=4 ttl=63 time=2041 ms
64 bytes from bastion-01.bastion.eqiad.wmflabs (10.68.17.232): icmp_seq=5 ttl=63 time=1027 ms (DUP!)
64 bytes from bastion-01.bastion.eqiad.wmflabs (10.68.17.232): icmp_seq=7 ttl=63 time=9.56 ms
^C
--- bastion.wmflabs.org ping statistics ---
7 packets transmitted, 7 received, +1 duplicates, 0% packet loss, time 6021ms
rtt min/avg/max/mdev = 0.632/387.139/2041.842/709.455 ms, pipe 3
krenair@gerrit-mysql:~$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=124 time=0.759 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=124 time=2.59 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=124 time=1.03 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=124 time=1.05 ms (DUP!)
Paladox triaged this task as Unbreak Now! priority.Nov 7 2018, 12:32 AM

Mentioned in SAL (#countervandalism) [2018-11-07T00:44:58Z] <Krinkle> Problems with bots are not due to unexpected reboot (there was no such reboot). Instead, the cause is https://phabricator.wikimedia.org/T208916 (wmflabs internal network problems).

Paladox lowered the priority of this task from Unbreak Now! to High.Nov 7 2018, 12:53 AM

Lowering priority now that the issue has been fixed (so no user impact)

Bstorm subscribed.

After a puppet change earlier today, the openstack the neutron servers were both brought up running at the same time as masters--split brain. We resolved this by simply rebooting one of them so it took on the standby role, but removing the puppet service restart is in order until we can make it better handle the failover setup.

I'll reduce the priority due to the immediate situation's resolution, but we need to take steps to prevent a simple config change from bringing this about again.

Some bits from #wikimedia-cloud-admin during the incident:

[00:22]  <  paladox>	Krenair bd808 from what i can tell the problem started at "[23:10:38] (uk time)" (maybe a little earlier but i started getting notified at that time)
...
[00:40]  <  chasemp>	this looks off
[00:40]  <  chasemp>	cloudcontrol1003:~# neutron l3-agent-list-hosting-router cloudinstances2b-gw
[00:40]  <  chasemp>	| 8af5d8a1-2e29-40e6-baf0-3cd79a7ac77b | cloudnet1003 | True           | :-)   | active   |
[00:40]  <  chasemp>	| 970df1d1-505d-47a4-8d35-1b13c0dfe098 | cloudnet1004 | True           | :-)   | active   |
...
[00:44]  <  chasemp>	cloudnet1004:~# cat /var/lib/neutron/ha_confs/d93771ba-2711-4f88-804a-8df6fd03978a/state
[00:44]  <  chasemp>	master
...
[00:45]  <  chasemp>	cloudnet1003:~# cat /var/lib/neutron/ha_confs/d93771ba-2711-4f88-804a-8df6fd03978a/state
[00:45]  <  chasemp>	master
...
[00:49]  <  chasemp>	# Do not reschedule the dataplane components
[00:49]  <  chasemp>	# just because a control plane l3-agent is maint
[00:49]  <  chasemp>	# or restarted. This also means it will not get changes
[00:49]  <  chasemp>	# so we need to monitor l3-agent state administratively.
[00:49]  <  chasemp>	allow_automatic_l3agent_failover = false
[00:49]  <  chasemp> modules/openstack/templates/mitaka/neutron/neutron.conf.erb
[00:50]  <  chasemp>	I thought that means "don't fool with active/passive failover just bc the service restarts"
[00:50]  <  chasemp>	but apparently that is broke down
[00:50]  <  chasemp>	someone will have to investigate that business, seems fishy
[00:51]  <  chasemp>	I would consider removing the auto restart notify in puppet for l3-agent for now :)

@aborrero, one actionable from this: we need a better runbook for debugging neutron problems. There are crumbs at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron but they are a bit confusing to try and follow.

aborrero renamed this task from Network failure around eqiad1-r (cloud-vps neutron) to cloudvps: neutron issue with split brain.Nov 7 2018, 9:24 AM
aborrero claimed this task.
aborrero updated the task description. (Show Details)
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Several things here:

regarding the split brain issue itself:

  • ideally, we could set some puppet mechanism to ensure only a given l3-agent is in 'active' mode. But we don't have much integration between neutron state/live configuration (what lives in the DB) and puppet, which restrict our ability to programatically ensure neutron configuration/state sanity.
  • we could, in the short term, disable l3-agent service refresh when puppet changes happens (the subscribe mechanism in openstack::neutron::l3_agent).
  • @chasemp suggested we should investigate the allow_automatic_l3agent_failover = false setting in neutron.conf

Since is not the first time we suffer the issue, I will be working on this right now.

Change 472118 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: neutron: don't restart l3-agent on config changes

https://gerrit.wikimedia.org/r/472118

Change 472118 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: neutron: don't restart l3-agent on config changes

https://gerrit.wikimedia.org/r/472118

I guess we can close this now and reopen if necessary.