Page MenuHomePhabricator

Default source group (security group) allowances do not update properly
Closed, ResolvedPublic

Description

Steps to reproduce

  1. Create a new instance in tools to try to pool to be k8s worker node (https://wikitech.wikimedia.org/wiki/Tools_Kubernetes#Worker_nodes)
  2. Ssh in
  3. Start preparing to switch puppetmasters: 'sudo rm -rf /var/lib/puppet/ssl'
  4. Try to run puppet
  5. Watch as it times out
  6. Try to curl any port on the puppetmaster: 'curl tools-puppetmaster-01:3422', watch as it times out (rather than produce connection failure immediately)
  7. Wait for a while (10-20mins)
  8. Everyting is ok now.

Event Timeline

This behavior (wait forever then timeout) is consistent with what happens when there are security groups issues present, except there are no security between these nodes - they are also all in the same project. They also resolve with time with no other actions.

root@tools-puppetmaster-01:/home/yuvipanda# sudo tcpdump src 10.68.21.13
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
03:54:59.517707 ARP, Request who-has tools-puppetmaster-01.tools.eqiad.wmflabs tell tools-worker-1008.tools.eqiad.wmflabs, length 42
yuvipanda@tools-worker-1008:~$ sudo tcpdump dst 10.68.22.61
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
03:57:26.012958 IP tools-worker-1008.tools.eqiad.wmflabs.54108 > tools-puppetmaster-01.tools.eqiad.wmflabs.8140: Flags [S], seq 970365477, win 29200, options [mss 1460,sackOK,TS val 1093845 ecr 0,nop,wscale 9], length 0
03:57:27.010859 IP tools-worker-1008.tools.eqiad.wmflabs.54108 > tools-puppetmaster-01.tools.eqiad.wmflabs.8140: Flags [S], seq 970365477, win 29200, options [mss 1460,sackOK,TS val 1094095 ecr 0,nop,wscale 9], length 0
03:57:29.014845 IP tools-worker-1008.tools.eqiad.wmflabs.54108 > tools-puppetmaster-01.tools.eqiad.wmflabs.8140: Flags [S], seq 970365477, win 29200, options [mss 1460,sackOK,TS val 1094596 ecr 0,nop,wscale 9], length 0
03:57:33.022835 IP tools-worker-1008.tools.eqiad.wmflabs.54108 > tools-puppetmaster-01.tools.eqiad.wmflabs.8140: Flags [S], seq 970365477, win 29200, options [mss 1460,sackOK,TS val 1095598 ecr 0,nop,wscale 9], length 0

When I do

yuvipanda@tools-worker-1008:~$ curl tools-puppetmaster-01:8140

So packets leave tools-worker-1008 but never reach tools-puppetmaster-01.

This hasn't resolved after an hour of the initial node creation now.

I restarted nova-network, no change in behavior.

Mentioned in SAL [2016-08-05T04:07:17Z] <yuvipanda> restarted nova-network on labnet1002 for T142165

I encountered simliar behavior yesterday in deployment-prep. I created deployment-kafka05. It needed to talk to deployment-zookeeper01. For the first (at least) 15 minutes of its life, it could talk to several other nodes in deployment-prep, but not deployment-zookeeper01. Nor could deployment-zookeeper01 initiate a connection to deployment-kafka05. It was as if there was a network partition between deployment-kafka05 and deployment-zookeeper01, but each node could still talk to other nodes in the project.

I only tested a couple of other nodes in the project, so it is possible it affected more than just this one connection.

I can continue to hit another node that has different security groups applied but on the same labvirt host (tools-worker-1008 and tools-docker-builder-01), but not other node on different labvirt hosts (tools-exec-1201)

chasemp renamed this task from Super weird networking behavior to Default source group allowances do not work post Liberty upgrade.Aug 5 2016, 7:36 PM
chasemp triaged this task as High priority.

I'm helping on this as well but a lot of it falls into the realm of openstack internals @Andrew is best suited to sort out.

I've disabled:

  1. All access to Special:NovaSecurityGroup
  2. Add instance button on Special:NovaInstance

on wikitech with a Mediawiki:Common.js hack. So people can still technically work around it, but it seems good enough in this use case.

To revert, go to https://wikitech.wikimedia.org/wiki/MediaWiki:Common.js and remove code starting at line 114 (and marked with title of this bug).

I've run quite a lot of tests in the project 'testlabs,' generally using the host labvirt1014 which is a spare host and doesn't interfere with labs users. The tests generally consist of running "python -m SimpleHTTPServer" on an instance and then seeing if other hosts can telnet to port 8000.

I've reimaged 1014, both with a fresh liberty install and with a fresh kilo install. In both cases (and before the reimaging), VMs on labvirt1014 simply ignore all security rules and permit all traffic. No matter what I do (including many permutations of security rules), access to VMs on 1014 is fully unrestricted.

The same test on labvirt1002 works correctly. Access is restricted if the security rules say it should be, and access is open if the security rules open port 8000. This includes meddling with the 'source group' setting -- when the 'source group' says to only allow access to instances in the testlabs project, access is appropriately granted.

From this I conclude that at least /this/ issue is somehow not an OpenStack issue at all, but something to do with the network setup of the lab1014 system itself. First guess: nova-network traffic is on a different NIC from the one managed by iptables/security rules. I don't, of course, know if this is actually possible :/

I just ran this test on every single labvirt hosts. labvirt1001-1011 behave correctly. Only labvirt1012 through 1014 exhibit the bad 'no firewalling' behavior.

The 'no firewalling' issue is now resolved, thanks to kernel downgrades. The actual issue in question is probably this:

$ git describe --contains 34666d467cbf1e2e3c7bb15a63eccfb582cdd71f
10:23 AM v3.18-rc1~115^2~111^2~2
10:23 AM netfilter: bridge: move br_netfilter out of the core
10:23 AM Note that this is breaking compatibility for users that expect that
10:23 AM bridge netfilter is going to be available after explicitly 'modprobe
10:24 AM bridge' or via automatic load through brctl.
10:24 AM
10:24 AM However, the damage can be easily undone by modprobing br_netfilter.
10:24 AM The bridge core also spots a message to provide a clue to people that
10:24 AM didn't notice that this has been deprecated.

Today we tried to replace the normal source group rules in the 'default' service group for tools. Weirdly, the iptables rules were not applied on the labvirts as we expected.

A reboot of nova-compute did cause the rules to be refreshed, but that's not a good long-term solution.

I've called out for help on the openstack mailing list, here: http://lists.openstack.org/pipermail/openstack/2016-August/017258.html

Andrew renamed this task from Default source group allowances do not work post Liberty upgrade to Default source group (security group) allowances do not work post Liberty upgrade.Aug 10 2016, 4:39 PM

The specific failure causing this problem appears to be

https://phabricator.wikimedia.org/P3805

I don't see any such timeout when /removing/ rules, only when adding them. The timeout does not happen when adding a single (non source-group) rule, and also doesn't happen in a project with fewer VMs.

A source group rule sets up n^2 different rules. I suspect that for a sufficiently large n, something on the server side (e.g. conductor) is overflowing a buffer or something and throwing away the request.

Increasing rpc_response_timeout in nova.conf (section DEFAULT) from 60 to 300 resolves the problem. That's a pretty stupid fix, but may be fine...

A timout of 120 seems to work ok, so I'll get that change in place shortly. Meanwhile, some nova devs (mriedem and dansmith) seem to care about the issue now.

There's a candidate patch for this here which seems correct: https://review.openstack.org/#/c/288548/3

Change 304047 had a related patch set uploaded (by Andrew Bogott):
nova: Increase rpc_response_timeout to 180

https://gerrit.wikimedia.org/r/304047

Change 304047 merged by Andrew Bogott:
nova: Increase rpc_response_timeout to 180

https://gerrit.wikimedia.org/r/304047

Andrew renamed this task from Default source group (security group) allowances do not work post Liberty upgrade to Default source group (security group) allowances do not update properly.Aug 10 2016, 9:46 PM
Andrew lowered the priority of this task from High to Low.Sep 30 2016, 12:47 PM

This is sort of resolved by the timeout fix, but I'm still hoping that upstream will merge the proper fix into Liberty.

This is fixed in M and it looks like L isn't going to happen.