Default source group (security group) allowances do not update properly
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	yuvipanda
	Aug 5 2016, 3:06 AM

Description

Steps to reproduce

Create a new instance in tools to try to pool to be k8s worker node (https://wikitech.wikimedia.org/wiki/Tools_Kubernetes#Worker_nodes)
Ssh in
Start preparing to switch puppetmasters: 'sudo rm -rf /var/lib/puppet/ssl'
Try to run puppet
Watch as it times out
Try to curl any port on the puppetmaster: 'curl tools-puppetmaster-01:3422', watch as it times out (rather than produce connection failure immediately)
Wait for a while (10-20mins)
Everyting is ok now.

Details

	Subject	Repo	Branch	Lines +/-
	nova: Increase rpc_response_timeout to 180	operations/puppet	production	+12 -0

Customize query in gerrit

Related Objects

Mentioned In: T170492: figure out if nodepool is overwhelming rabbitmq and/or nova
T156604: Enable Special:NovaSecurityGroup again in MediaWiki:Common.js
T141803: fix puppet issues when applying role::gerrit::server in labs
rOPUPe5af1edb4143: nova: Increase rpc_response_timeout to 180
rOPUP42bd82a1ce5b: nova: Increase rpc_response_timeout to 180
T142277: Create new Phlogiston instance for production

Event Timeline

yuvipanda created this task.Aug 5 2016, 3:06 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 5 2016, 3:06 AM

This behavior (wait forever then timeout) is consistent with what happens when there are security groups issues present, except there are no security between these nodes - they are also all in the same project. They also resolve with time with no other actions.

root@tools-puppetmaster-01:/home/yuvipanda# sudo tcpdump src 10.68.21.13
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
03:54:59.517707 ARP, Request who-has tools-puppetmaster-01.tools.eqiad.wmflabs tell tools-worker-1008.tools.eqiad.wmflabs, length 42

yuvipanda@tools-worker-1008:~$ sudo tcpdump dst 10.68.22.61
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
03:57:26.012958 IP tools-worker-1008.tools.eqiad.wmflabs.54108 > tools-puppetmaster-01.tools.eqiad.wmflabs.8140: Flags [S], seq 970365477, win 29200, options [mss 1460,sackOK,TS val 1093845 ecr 0,nop,wscale 9], length 0
03:57:27.010859 IP tools-worker-1008.tools.eqiad.wmflabs.54108 > tools-puppetmaster-01.tools.eqiad.wmflabs.8140: Flags [S], seq 970365477, win 29200, options [mss 1460,sackOK,TS val 1094095 ecr 0,nop,wscale 9], length 0
03:57:29.014845 IP tools-worker-1008.tools.eqiad.wmflabs.54108 > tools-puppetmaster-01.tools.eqiad.wmflabs.8140: Flags [S], seq 970365477, win 29200, options [mss 1460,sackOK,TS val 1094596 ecr 0,nop,wscale 9], length 0
03:57:33.022835 IP tools-worker-1008.tools.eqiad.wmflabs.54108 > tools-puppetmaster-01.tools.eqiad.wmflabs.8140: Flags [S], seq 970365477, win 29200, options [mss 1460,sackOK,TS val 1095598 ecr 0,nop,wscale 9], length 0

When I do

yuvipanda@tools-worker-1008:~$ curl tools-puppetmaster-01:8140

So packets leave tools-worker-1008 but never reach tools-puppetmaster-01.

This hasn't resolved after an hour of the initial node creation now.

I restarted nova-network, no change in behavior.

Mentioned in SAL [2016-08-05T04:07:17Z] <yuvipanda> restarted nova-network on labnet1002 for T142165

tom29739 subscribed.Aug 5 2016, 8:15 AM

I encountered simliar behavior yesterday in deployment-prep. I created deployment-kafka05. It needed to talk to deployment-zookeeper01. For the first (at least) 15 minutes of its life, it could talk to several other nodes in deployment-prep, but not deployment-zookeeper01. Nor could deployment-zookeeper01 initiate a connection to deployment-kafka05. It was as if there was a network partition between deployment-kafka05 and deployment-zookeeper01, but each node could still talk to other nodes in the project.

I only tested a couple of other nodes in the project, so it is possible it affected more than just this one connection.

I can continue to hit another node that has different security groups applied but on the same labvirt host (tools-worker-1008 and tools-docker-builder-01), but not other node on different labvirt hosts (tools-exec-1201)

• chasemp renamed this task from Super weird networking behavior to Default source group allowances do not work post Liberty upgrade.Aug 5 2016, 7:36 PM

Mbch331 subscribed.Aug 5 2016, 8:26 PM

Paladox subscribed.Aug 5 2016, 8:27 PM

I'm helping on this as well but a lot of it falls into the realm of openstack internals @Andrew is best suited to sort out.

I've disabled:

All access to Special:NovaSecurityGroup
Add instance button on Special:NovaInstance

on wikitech with a Mediawiki:Common.js hack. So people can still technically work around it, but it seems good enough in this use case.

To revert, go to https://wikitech.wikimedia.org/wiki/MediaWiki:Common.js and remove code starting at line 114 (and marked with title of this bug).

I've run quite a lot of tests in the project 'testlabs,' generally using the host labvirt1014 which is a spare host and doesn't interfere with labs users. The tests generally consist of running "python -m SimpleHTTPServer" on an instance and then seeing if other hosts can telnet to port 8000.

I've reimaged 1014, both with a fresh liberty install and with a fresh kilo install. In both cases (and before the reimaging), VMs on labvirt1014 simply ignore all security rules and permit all traffic. No matter what I do (including many permutations of security rules), access to VMs on 1014 is fully unrestricted.

The same test on labvirt1002 works correctly. Access is restricted if the security rules say it should be, and access is open if the security rules open port 8000. This includes meddling with the 'source group' setting -- when the 'source group' says to only allow access to instances in the testlabs project, access is appropriately granted.

From this I conclude that at least /this/ issue is somehow not an OpenStack issue at all, but something to do with the network setup of the lab1014 system itself. First guess: nova-network traffic is on a different NIC from the one managed by iptables/security rules. I don't, of course, know if this is actually possible :/

I just ran this test on every single labvirt hosts. labvirt1001-1011 behave correctly. Only labvirt1012 through 1014 exhibit the bad 'no firewalling' behavior.

• JAufrecht mentioned this in T142277: Create new Phlogiston instance for production.Aug 8 2016, 7:22 PM

The 'no firewalling' issue is now resolved, thanks to kernel downgrades. The actual issue in question is probably this:

$ git describe --contains 34666d467cbf1e2e3c7bb15a63eccfb582cdd71f
10:23 AM v3.18-rc1~115^2~111^2~2
10:23 AM netfilter: bridge: move br_netfilter out of the core
10:23 AM Note that this is breaking compatibility for users that expect that
10:23 AM bridge netfilter is going to be available after explicitly 'modprobe
10:24 AM bridge' or via automatic load through brctl.
10:24 AM
10:24 AM However, the damage can be easily undone by modprobing br_netfilter.
10:24 AM The bridge core also spots a message to provide a clue to people that
10:24 AM didn't notice that this has been deprecated.

Today we tried to replace the normal source group rules in the 'default' service group for tools. Weirdly, the iptables rules were not applied on the labvirts as we expected.

A reboot of nova-compute did cause the rules to be refreshed, but that's not a good long-term solution.

I've called out for help on the openstack mailing list, here: http://lists.openstack.org/pipermail/openstack/2016-August/017258.html

Andrew renamed this task from Default source group allowances do not work post Liberty upgrade to Default source group (security group) allowances do not work post Liberty upgrade.Aug 10 2016, 4:39 PM

The specific failure causing this problem appears to be

https://phabricator.wikimedia.org/P3805

I don't see any such timeout when /removing/ rules, only when adding them. The timeout does not happen when adding a single (non source-group) rule, and also doesn't happen in a project with fewer VMs.

A source group rule sets up n^2 different rules. I suspect that for a sufficiently large n, something on the server side (e.g. conductor) is overflowing a buffer or something and throwing away the request.

Increasing rpc_response_timeout in nova.conf (section DEFAULT) from 60 to 300 resolves the problem. That's a pretty stupid fix, but may be fine...

Upstream bug: https://bugs.launchpad.net/nova/+bug/1611871

A timout of 120 seems to work ok, so I'll get that change in place shortly. Meanwhile, some nova devs (mriedem and dansmith) seem to care about the issue now.

Smalyshev subscribed.Aug 10 2016, 5:45 PM

There's a candidate patch for this here which seems correct: https://review.openstack.org/#/c/288548/3

Change 304047 had a related patch set uploaded (by Andrew Bogott):
nova: Increase rpc_response_timeout to 180

https://gerrit.wikimedia.org/r/304047

Andrew mentioned this in rOPUP42bd82a1ce5b: nova: Increase rpc_response_timeout to 180.Aug 10 2016, 6:37 PM

Change 304047 merged by Andrew Bogott:
nova: Increase rpc_response_timeout to 180

https://gerrit.wikimedia.org/r/304047

Andrew mentioned this in rOPUPe5af1edb4143: nova: Increase rpc_response_timeout to 180.Aug 10 2016, 7:10 PM

Andrew renamed this task from Default source group (security group) allowances do not work post Liberty upgrade to Default source group (security group) allowances do not update properly.Aug 10 2016, 9:46 PM

Paladox added a subtask: T141803: fix puppet issues when applying role::gerrit::server in labs.Aug 31 2016, 11:09 PM

Dzahn mentioned this in T141803: fix puppet issues when applying role::gerrit::server in labs.Aug 31 2016, 11:18 PM

Dzahn changed the status of subtask T141803: fix puppet issues when applying role::gerrit::server in labs from Open to Stalled.

Paladox removed a subtask: T141803: fix puppet issues when applying role::gerrit::server in labs.Sep 1 2016, 11:56 AM

This is sort of resolved by the timeout fix, but I'm still hoping that upstream will merge the proper fix into Liberty.

This is fixed in M and it looks like L isn't going to happen.

scfc mentioned this in T156604: Enable Special:NovaSecurityGroup again in MediaWiki:Common.js.Jan 30 2017, 3:02 AM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:41 PM

hashar mentioned this in T170492: figure out if nodepool is overwhelming rabbitmq and/or nova.Aug 31 2017, 7:25 PM

Default source group (security group) allowances do not update properlyClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Default source group (security group) allowances do not update properly
Closed, ResolvedPublic
Actions