Page MenuHomePhabricator

Ferm changes on the host node break networking for Kubernetes pods
Closed, ResolvedPublic

Description

Seen following merge of:

Symptoms:

  • Networking failures in Toolforge Kubernetes cluster due to SRC NAT failures
  • Networking failures in PAWS Kubernetes cluster due to IP forwarding failures

Mitigation:

  • Toolforge: clush -w @k8s-worker 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart'
  • PAWS: clush -w @paws-worker 'sudo iptables -P FORWARD ACCEPT'

Original report:

While trying to access https://tools.wmflabs.org/guc/?user=193.180.154.229 , I got the following nonstandard messages on the page:

Warning: dns_get_record(): A temporary server error occurred. in /data/project/guc/labs-tools-guc/src/IPInfo.php on line 87

Warning: PDO::__construct(): php_network_getaddresses: getaddrinfo failed: Name or service not known in /data/project/guc/labs-tools-guc/src/App.php on line 32

Error: Database error: Unable to connect to s1.web.db.svc.eqiad.wmflabs

TODO (Lessons learned in debugging):

  • build an image with reasonable diag tools (dig, ping, traceroute, mtr, ...)
  • Run a serviceset that places a diagnostic pod on all worker nodes
  • Have an easy command to list all pods on a node (get pods --all-namespaces -o wide|grep tools-worker-1002)
  • runbook page for flannel debugging
  • Have an easy command to start a new pod on a given node (https://kubernetes.io/docs/concepts/configuration/assign-pod-node/)
  • Monitoring and alert on pod dns failures - right now we know because irc bots go away when this happens

Event Timeline

Jeff_G created this task.Dec 12 2017, 10:16 PM
Restricted Application added a project: Operations. · View Herald TranscriptDec 12 2017, 10:16 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Magnus triaged this task as Unbreak Now! priority.Dec 12 2017, 10:46 PM
Magnus added a subscriber: Magnus.

Several of my tools appear to be affected as well, example:

ERROR:php_network_getaddresses: getaddrinfo failed: Name or service not known [2002]
SERVER:tools.labsdb
SCHEMA:s51434__mixnmatch_p
Restricted Application added subscribers: Liuxinyu970226, Jay8g, TerraCodes. · View Herald TranscriptDec 12 2017, 10:46 PM
bd808 renamed this task from Database error: Unable to connect to s1.web.db.svc.eqiad.wmflabs to DNS resolution failing from webservices running on Kubernetes.Dec 12 2017, 10:54 PM
Envlh added a subscriber: Envlh.Dec 12 2017, 11:18 PM
Wargo added a subscriber: Wargo.Dec 12 2017, 11:19 PM

We are investigating this issue. Currently we have a number of Kubernetes worker hosts where pods are not able to network with other hosts (udp and tcp failure).

chasemp added a subscriber: chasemp.EditedDec 13 2017, 2:23 PM

(This is an outline of the team response, thank you to @bd808, @madhuvishy, and @Andrew)

(Special thanks to @yuvipanda for being you and @aborrero for viewing notifications far too late in the evening and noting there is backup)

Timeline:

  • https://gerrit.wikimedia.org/r/#/c/397879/ is applied
  • tools.wmflabs.org seems to be going down intermittently but we don't catch it in the act
  • several Tools report connectivity issues
  • investigation turns up that some nodes are in a bad state completely
  • diagnose src nat is faulty
  • diagnose that a reboot or ordered service restart fixes
  • reboot fleet gracefully

Symptoms: lack of apparent POD network connectivity. I suspect it usually surfaces as DNS resolution issues as these are the precursor to most operations. Inspecting a POD at runtime by attaching a shell side car via kubectl exec -it <container>-- /bin/bash demonstrated a total lack of external connectivity on nodes that were effected, in this case tools-worker-1016. You can find PODs assigned to a node with kubectl get pods --all-namespaces -o wide|grep tools-worker-1016

Error State indicators:

  • I suspected with the newly applied ferm rules we were maybe hitting conntrack limits but that seems not true:

broken at the time of this

root@tools-worker-1019:~# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_count
72

root@tools-worker-1016:~# sysctl net.ipv4.netfilter.ip_conntrack_max
net.ipv4.netfilter.ip_conntrack_max = 262144
root@tools-worker-1016:~# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_count
217

Note: kube-proxy seems to manage conntrack_max dynamically. This can get really weird and tricky I suspect.

  • Traffic was flowing from PODs and could be seen on the docker0 interface
root@tools-worker-1016:/etc/ferm/conf.d# tcpdump -i docker0 host 192.168.178.7
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on docker0, link-type EN10MB (Ethernet), capture size 262144 bytes
00:27:46.466688 IP 192.168.146.0.60628 > 192.168.178.7.8000: Flags [S], seq 18597794, win 29200, options [mss 1460,sackOK,TS val 3766230557 ecr 0,nop,wscale 9], length 0
00:27:46.466748 IP 192.168.178.7.8000 > 192.168.146.0.60628: Flags [S.], seq 1073509090, ack 18597795, win 27960, options [mss 1410,sackOK,TS val 3589700014 ecr 3766230557,nop,wscale 9], length 0
00:27:46.467190 IP 192.168.146.0.60628 > 192.168.178.7.8000: Flags [.], ack 1, win 58, options [nop,nop,TS val 3766230557 ecr 3589700014], length 0
00:27:46.467243 IP 192.168.146.0.60628 > 192.168.178.7.8000: Flags [P.], seq 1:487, ack 1, win 58, options [nop,nop,TS val 3766230557 ecr 3589700014], length 486
00:27:46.467267 IP 192.168.178.7.8000 > 192.168.146.0.60628: Flags [.], ack 487, win 57, options [nop,nop,TS val 3589700015 ecr 3766230557], length 0
00:27:46.531519 IP 192.168.178.7.38081 > labs-recursor1.wikimedia.org.domain: 7872+ A? tools-redis. (29)
00:27:46.531571 IP 192.168.178.7.38081 > labs-recursor1.wikimedia.org.domain: 22573+ AAAA? tools-redis. (29)
00:27:46.969441 IP 192.168.178.7.54769 > labs-recursor0.wikimedia.org.domain: 9029+ A? tools-redis. (29)
  • Traffic for the private flannel overlay network was 'leaking' out of eth0 w/o src nat being applied

[broken]

root@tools-worker-1016:/etc/ferm/conf.d# tcpdump -i eth0 host labs-recursor0.wikimedia.org
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
00:38:06.625170 IP 192.168.178.7.50555 > labs-recursor0.wikimedia.org.domain: 30421+ A? tools-redis. (29)
00:38:06.625235 IP 192.168.178.7.50555 > labs-recursor0.wikimedia.org.domain: 34923+ AAAA? tools-redis. (29)
00:38:06.626804 IP 192.168.178.7.50633 > labs-recursor0.wikimedia.org.domain: 20692+ A? tools-redis. (29)
00:38:06.626832 IP 192.168.178.7.50633 > labs-recursor0.wikimedia.org.domain: 50164+ AAAA? tools-redis. (29)
00:38:06.627249 IP tools-worker-1016.tools.eqiad.wmflabs.54219 > labs-recursor0.wikimedia.org.domain: 27226+ PTR? 7.178.168.192.in-addr.arpa. (44)

These addresses should never been seen leaving eth0 to the best of my understanding.

  • Nat table entries did exist for the POD at that time
iptables -t nat -L | grep 192.168.178.7
KUBE-MARK-MASQ  all  --  192.168.178.7        anywhere             /* admin/admin:http */
DNAT       tcp  --  anywhere             anywhere             /* admin/admin:http */ tcp to:192.168.178.7:8000

But this is not sufficient

  • Kube-proxy was throwing errors about communication with the master that a restart of kube-proxy seemed to resolve (possibly in conjunction with a flannel restart only)
Dec 12 13:36:59 tools-worker-1016 kube-proxy[31371]: E1212 13:36:59.275535   31371 reflector.go:203] pkg/proxy/config/api.go:30: Failed to list *api.Service: Get https://k8s-master.tools.wmflabs.org:6443/api/v1/services?resourceVersion=0: dial tcp 10.68.17.142:6443: getsockopt: connection refused
  • Side-car POD shell has very limited debugging tools so I was using the python interpreter for most of my troubleshooting.
python
...
>>>import socket
>>> socket.gethostbyname('yahoo.com')
'98.139.180.180'

Curl was also available:

curl -v www.wikimedia.org

  • Kube-proxy has some red herring errors from our use of hostname enforcement that are confusing.
3 00:41:31 tools-worker-1016 kube-proxy[29485]: W1213 00:41:31.436889   29485 server.go:436] Failed to retrieve node info: nodes "tools-worker-1016" not found
Dec 13 00:41:31 tools-worker-1016 kube-proxy[29485]: W1213 00:41:31.441327   29485 proxier.go:226] invalid nodeIP, initialize kube-proxy with 127.0.0.1 as nodeIP

I believe this is not actually errant state in our case with 1.4 atm

Full seemingly healthy kube-proxy restart

root@tools-worker-1016:/etc/ferm/conf.d# service kube-proxy restart
root@tools-worker-1016:/etc/ferm/conf.d# journalctl -u kube-proxy -f
-- Logs begin at Mon 2017-12-11 16:35:41 UTC. --
Dec 13 00:41:31 tools-worker-1016 systemd[1]: Started Kubernetes Kube-Proxy Server.
Dec 13 00:41:31 tools-worker-1016 kube-proxy[29485]: W1213 00:41:31.337215   29485 server.go:378] Flag proxy-mode="'iptables'" unknown, assuming iptables proxy
Dec 13 00:41:31 tools-worker-1016 kube-proxy[29485]: I1213 00:41:31.339398   29485 server.go:203] Using iptables Proxier.
Dec 13 00:41:31 tools-worker-1016 kube-proxy[29485]: W1213 00:41:31.436889   29485 server.go:436] Failed to retrieve node info: nodes "tools-worker-1016" not found
Dec 13 00:41:31 tools-worker-1016 kube-proxy[29485]: W1213 00:41:31.441327   29485 proxier.go:226] invalid nodeIP, initialize kube-proxy with 127.0.0.1 as nodeIP
Dec 13 00:41:31 tools-worker-1016 kube-proxy[29485]: I1213 00:41:31.441605   29485 server.go:215] Tearing down userspace rules.
Dec 13 00:41:31 tools-worker-1016 kube-proxy[29485]: I1213 00:41:31.488831   29485 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
Dec 13 00:41:31 tools-worker-1016 kube-proxy[29485]: I1213 00:41:31.491342   29485 conntrack.go:66] Setting conntrack hashsize to 32768
Dec 13 00:41:31 tools-worker-1016 kube-proxy[29485]: I1213 00:41:31.492033   29485 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
Dec 13 00:41:31 tools-worker-1016 kube-proxy[29485]: I1213 00:41:31.492084   29485 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600

https://github.com/att-comdev/halcyon-kubernetes/issues/35
https://github.com/att-comdev/halcyon-kubernetes/pull/45

  • For kicks while broken on 1016 I changed the default INPUT policy to ACCEPT but it had no effect

Actions:

  • Debugging live via sidecar
  • Reboot to see if a node came up healthy and it did
  • Diagnosed restart of several related services also resolves the issue in-place but only for new PODs on the node and not existing broken ones
  • Orchestrated drain, cordon, and reboot of all nodes to reset state and confirm health

Current thoughts:

  • service docker restart; service flannel restart; service kubelet restart; service kube-proxy restart seems to address the issue without restarting the node.
  • Flannel is responsible for src nat with the directive --ip-masq=true on workers but seemed to have stopped managing it successfully. I suspect anytime a POD was rescheduled it was coming up without appropriate nat at the very least. This seems to have started with the application of ferm managed rules on the nodes
  • We can hope this was transient with the application of ferm and bad complex interactions with flannel. I worry about ongoing ferm management but I suspect this is a bug rather than us doing a wrong thing at the moment. Any management of ferm in k8s land is complicated potentially and requires attention from here on out.
  • We wait to see if it reoccurs without active changes

Perused links during:

https://serverfault.com/questions/189729/what-the-meaning-of-policy-accept-and-policy-drop-in-iptables
https://fedoraproject.org/wiki/How_to_edit_iptables_rules
https://docs.docker.com/engine/userguide/networking/
https://superuser.com/questions/1130898/no-internet-connection-inside-docker-containers
https://github.com/att-comdev/halcyon-kubernetes/issues/35
https://github.com/kubernetes/kubernetes/issues/37414
https://gitlab.cncf.ci/kubernetes/kubernetes/commit/439ab5a4871325aa438cf2b5e3e0cb87ad75041e
https://github.com/kubernetes/kubernetes/issues/36835
https://github.com/kubernetes/kubernetes/issues/40761
https://github.com/kubernetes/kubernetes/issues/20391
https://github.com/kubernetes/kubernetes/issues/22335

chasemp lowered the priority of this task from Unbreak Now! to Normal.Dec 13 2017, 2:30 PM
chasemp updated the task description. (Show Details)

Note ferm had been broken in Toolforge for a good long while looking for IPv6 addresses that cannot be fulfilled in that context but can in prod. https://phabricator.wikimedia.org/T179955#3831513 is relevant as well

Something looking just like this outage hit us again around 2018-01-05T23:05Z. Stashbot was noticed to be unresponsive in #wikimedia-cloud. Logs showed it was having problems communicating with the tools elasticsearch cluster. The pod was killedd and on restart the bot hung at the step where it should connect to freenode. The sal tool (also elasticsearch driven) and the versions tool (only talks to noc) were also showing DNS/networking failures when checked.

I hopped on tools-worker-1017 where stashbot's pod was running and started working through the same steps @chasemp had used in T182722#3834172. I poked a bit and then decided to try Chase's service docker restart; service flannel restart; service kubelet restart; service kube-proxy restart soft restart. Once I saw kube-proxy logging normal activity I killed the running pod for stashbot. The new pod was also scheduled to tools-worker-1017 (luckily) and immediately started working.

At 23:49, @madhuvishy ran clush -w @k8s-worker -x tools-worker-1001.tools.eqiad.wmflabs 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart'

Was rOPUP7785d31e2a95: network::constants: add fake CACHE_MISC for labs related? Did it trigger ferm to reapply on the next puppet run? That seemed to be our guess at the cause for the 2017-12-12 outage.

bd808 raised the priority of this task from Normal to High.Jan 6 2018, 12:07 AM

Was rOPUP7785d31e2a95: network::constants: add fake CACHE_MISC for labs related? Did it trigger ferm to reapply on the next puppet run? That seemed to be our guess at the cause for the 2017-12-12 outage.

ESC[0;32mInfo: Applying configuration version '1515192085'ESC[0m
ESC[mNotice: /Stage[main]/Base::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]/content:
--- /etc/ferm/conf.d/00_defs    2017-12-12 19:29:10.123512052 +0000
+++ /tmp/puppet-file20180105-13818-1qjsnug      2018-01-05 22:41:38.351441808 +0000
@@ -21,6 +21,7 @@


 @def $BASTION_HOSTS = (10.68.17.232 10.68.18.65 10.68.18.66 10.68.18.68 10.68.21.162 10.68.17.221 10.68.22.61 );
+@def $CACHE_MISC = (10.68.21.68 );
 @def $CUMIN_MASTERS = (10.68.18.66 10.68.18.68 );
 @def $CUMIN_REAL_MASTERS = (208.80.154.158 2620:0:861:2:208:80:154:158 208.80.155.120 2620:0:861:4:208:80:155:120 );
 @def $DEPLOYMENT_HOSTS = (10.68.21.205 10.68.20.135 );
ESC[0m
ESC[0;32mInfo: Computing checksum on file /etc/ferm/conf.d/00_defsESC[0m
ESC[0;32mInfo: /Stage[main]/Base::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]: Filebucketed /etc/ferm/conf.d/00_defs to puppet with sum 7b08d955b1087e5afc646cc33b80061eESC[0m
ESC[mNotice: /Stage[main]/Base::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]/content: content changed '{md5}7b08d955b1087e5afc646cc33b80061e' to '{md5}c20dd0d071d6e8fba1886948d3bef4b5'ESC[0m
ESC[0;32mInfo: /Stage[main]/Base::Firewall/Ferm::Conf[defs]/File[/etc/ferm/conf.d/00_defs]: Scheduling refresh of Service[ferm]ESC[0m
ESC[mNotice: /Stage[main]/Ferm/Service[ferm]: Triggered 'refresh' from 1 events
ESC[0m
ESC[mNotice: Applied catalog in 7.59 secondsESC[0m

So, yes. Ferm are refreshed because of dependent config changes and when that happened it looks like k8s/Docker lots it's ability to use the overlay network.

bd808 renamed this task from DNS resolution failing from webservices running on Kubernetes to Ferm changes on the host node break networking for Kubernetes pods.Jan 6 2018, 12:13 AM

PAWS cluster DNS broke too, and all the workers had switched the default policy for Chain FORWARD to DROP again. I fixed by running sudo iptables -P FORWARD ACCEPT across the paws-workers. So these two things seem related.

chasemp added a comment.EditedJan 8 2018, 2:41 PM

I was hoping that the ferm @preverse option may make sense for us here but there are a few problems with the idea:

  • seems only available in 2.4 and jessie has 2.2 available and I don't see anything in backports

tools-worker-1018:~$ dpkg -s ferm | grep Version
Version: 2.2-3

  • preserve does not ignore a table/chain combo here -- it dumps and restores it. This is very racy for our setup as anything scheduled in that window would in theory be orphaned. We want an option that doesn't exist as "ignore table/chain". Ferm does not play nice with others.

If @preserve was available and we could depool a k8s node for every firewall update then it would make sense potentially (assuming the update was atomic). Depooling from the workers themselves is another thing to investigate)

Short term solutions I can think of at the moment:

  • Tell puppet in Toolforge to restart other relevant services with ferm updates. This means that restarts of ferm outside of Puppet are potentially non-deterministic.
  • Ferm is using an init script so we could modify to include whatever the bare minimum portion of sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restartto reset state post ferm changes. This is ugly too since it makes ferm updates invasive. The upshot is they are very rare and the invasive timeline very small. It also has the upside (versus trying to tell ferm to dump and restore state that is managed outside of it) of letting flannel and kube-proxy handle their own affairs.

Long term solutions:

  • It seems we are basically using ferm to only manage INPUT on FILTER and we could do that in a few other ways but it gets messy since all of existing thought processes assume ferm will be control and ferm does not play nice.
  • Put a patch upstream that allows ferm to outright ignore and not manage certain tables/chains. Not sure how complicated, based on what I've seen -- there seems to be global state flushing and rebuilding so it's probably not made to think this way at all.
  • Hack ferm updates to first depool a node( with a mythical @preserve) but it's inclear how to confirm this (though surely it is possible) so it's procedural rather than wait a bit and best guess. This is possibly the sanest workflow here being safe for all and allowing ferm consistency across landscapes.

PAWS cluster DNS broke too, and all the workers had switched the default policy for Chain FORWARD to DROP again. I fixed by running sudo iptables -P FORWARD ACCEPT across the paws-workers. So these two things seem related.

Not sure what to make of this yet, I don't see ferm on the paws nodes:

root@tools-paws-master-01:~# dpkg -s ferm | grep Version
dpkg-query: package 'ferm' is not installed and no information is available

and base::firewall appears to be selectively applied within toolforge:

modules/role/manifests/toollabs/elasticsearch.pp:    include ::base::firewall
modules/role/manifests/toollabs/k8s/worker.pp:    include profile::base::firewall
modules/role/manifests/toollabs/k8s/master.pp:    include ::base::firewall
modules/role/manifests/toollabs/etcd/flannel.pp:    include ::base::firewall
modules/role/manifests/toollabs/etcd/k8s.pp:    include ::base::firewall
modules/role/manifests/toollabs/logging/centralserver.pp:    include ::base::firewall
modules/toollabs/manifests/proxy.pp:    include ::base::firewall

Do any of these hit PAWS nodes? I thought maybe toollabs::logging::centralserver but it appears no. The only connection I can grok atm is uncollected resource things that shouldn't make any difference?

{"type":"Ferm::Service","title":"role::toollabs::clush::target"

Based on sudo iptables -P FORWARD ACCEPT fixing then it seems that the default for the FORWARD chain on the INPUT table is changing out from underneath:

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
DOCKER-ISOLATION  all  --  anywhere             anywhere
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere

Demo:

root@tools-paws-worker-1001:~# iptables -t filter -L  | grep -A 7 FORWARD
Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
DOCKER-ISOLATION  all  --  anywhere             anywhere
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere

root@tools-paws-worker-1001:~# iptables -P FORWARD DROP
root@tools-paws-worker-1001:~# iptables -t filter -L  | grep -A 7 FORWARD
Chain FORWARD (policy DROP)
target     prot opt source               destination
DOCKER-ISOLATION  all  --  anywhere             anywhere
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere

root@tools-paws-worker-1001:~# iptables -P FORWARD ACCEPT
root@tools-paws-worker-1001:~# iptables -t filter -L  | grep -A 7 FORWARD
Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
DOCKER-ISOLATION  all  --  anywhere             anywhere
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere

I'm not sure how this is connected to the general ferm issue it it is, but the timing of the related issues last week is sure confusing.

Small note from https://wiki.debian.org/DebianFirewall

As you can see, the default policy in a default installation is to ACCEPT all traffic. There are no rules on any chain.

So if changing the default policy to ACCEPT fixes things for filter/forward then something is changing it to DROP. But even that is confusing as there is an explicit allow any it seems late in the chain. The only thing I can of at the moment is updating this chain somehow resets some state from something.

Envlh removed a subscriber: Envlh.Jan 8 2018, 5:08 PM
aborrero added a comment.EditedJan 8 2018, 5:28 PM

I was hoping that the ferm @preverse option may make sense for us here but there are a few problems with the idea:

  • seems only available in 2.4 and jessie has 2.2 available and I don't see anything in backports

tools-worker-1018:~$ dpkg -s ferm | grep Version
Version: 2.2-3

  • preserve does not ignore a table/chain combo here -- it dumps and restores it. This is very racy for our setup as anything scheduled in that window would in theory be orphaned. We want an option that doesn't exist as "ignore table/chain". Ferm does not play nice with others.

Different process/services/users managing rulesets is a common issue in iptables. This doesn't have a clear solution.
This has been partially solved with nftables (see below).

If @preserve was available and we could depool a k8s node for every firewall update then it would make sense potentially (assuming the update was atomic). Depooling from the workers themselves is another thing to investigate)
Short term solutions I can think of at the moment:

  • Tell puppet in Toolforge to restart other relevant services with ferm updates. This means that restarts of ferm outside of Puppet are potentially non-deterministic.
  • Ferm is using an init script so we could modify to include whatever the bare minimum portion of sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restartto reset state post ferm changes. This is ugly too since it makes ferm updates invasive. The upshot is they are very rare and the invasive timeline very small. It also has the upside (versus trying to tell ferm to dump and restore state that is managed outside of it) of letting flannel and kube-proxy handle their own affairs.

I think both are valid. I would explore the second one first. Could we use ferm hooks? i.e. restart these services pre/post loading the puppet ruleset.
http://ferm.foo-projects.org/download/2.1/ferm.html#hooks

Long term solutions:

  • It seems we are basically using ferm to only manage INPUT on FILTER and we could do that in a few other ways but it gets messy since all of existing thought processes assume ferm will be control and ferm does not play nice.
  • Put a patch upstream that allows ferm to outright ignore and not manage certain tables/chains. Not sure how complicated, based on what I've seen -- there seems to be global state flushing and rebuilding so it's probably not made to think this way at all.
  • Hack ferm updates to first depool a node( with a mythical @preserve) but it's inclear how to confirm this (though surely it is possible) so it's procedural rather than wait a bit and best guess. This is possibly the sanest workflow here being safe for all and allowing ferm consistency across landscapes.

Another long term solution:

One common problem with iptables is we have a static (almost non-mutable) firewall/ruleset architecture. You have a preset of tables/chains (filter INPUT, filter FORWARD) that works well for many but are a pain for lots of people in some environments.
Then we have nftables. By default, nftables is fully configuratble: you can have any combination of tables/chains (more differences here: https://wiki.nftables.org/wiki-nftables/index.php/Main_differences_with_iptables). This means you could have a 'filter' table for our puppet rules and then a 'cloud' table with all rules from k8s/docker (or reverse). This way, they won't overlap during management and no inconsistencies in the filtering policy would be produced from the control plane point of view. We could achieve something like isolation by purpose.

I understand this is a challenge for us, since we would need:

Small note from https://wiki.debian.org/DebianFirewall
[...]

BTW that wiki page from Debian is about to change, people are now encouraged to use nftables rather than iptables starting with stretch: http://ral-arturo.org/2017/05/05/debian-stretch-stable-nftables.html

chasemp added a comment.EditedJan 8 2018, 7:27 PM

Taking a look around to see what the status is and I wondered how kube-proxy on teh tools-proxy hosts was fairing. Seems not well, same errors:

journalctl -f -n 10 -u kube-proxy

Jan 08 15:03:35 tools-proxy-02 kube-proxy[15729]: E0108 15:03:35.835825 15729 reflector.go:203] pkg/proxy/config/api.go:30: Failed to list *api.Service: Get https://k8s-master.tools.wmflabs.org:6443/api/v1/services?resourceVersion=0: dial tcp 10.68.17.142:6443: getsockopt: connection refused

1748  2018-01-08 19:23:53 service ferm status
1749  2018-01-08 19:23:56 sudo service docker restart
1750  2018-01-08 19:24:00 sudo service flannel restart
1751  2018-01-08 19:24:05 sudo service kube-proxy restart

Change 403072 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] WIP: toolforge: ferm hook to restart components post updates

https://gerrit.wikimedia.org/r/403072

Small note on debugging that I created a file under the admin tool to use like this:

tools.admin@tools-bastion-03:~$ kubectl get pods
NAME READY STATUS RESTARTS AGE
admin-3517833506-69bqz 1/1 Running 0 26d
tools.admin@tools-bastion-03:~$ kubectl exec -it admin-3517833506-69bqz -- /bin/bash

tools.admin@admin-3517833506-69bqz:/data/project/admin$ python -i debug/debug.py

>>> rurl(count=3)
Fetch http://google.com 3 times at 1 interval
1 <!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content
2 <!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content
3 <!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content``

```>>> extip()
Fetch http://icanhazip.com/ 1 times at 1 interval
1 208.80.155.255
>>> resolve_test()
yahoo.com 206.190.39.42
google.com 172.217.15.78
mediawiki.org 208.80.154.224
weather.com 23.196.85.15
npr.com 216.35.221.76
wmflabs.org 10.68.21.68
>>> resolve('deviantart.com')
'54.192.163.145

Change 403072 merged by Rush:
[operations/puppet@production] toolforge: ferm hook to restart components post updates

https://gerrit.wikimedia.org/r/403072

Change 403308 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] tools: ferm pre hook to stop kube-proxy

https://gerrit.wikimedia.org/r/403308

Note yesterday @yuvipanda executed an upgrade to 1.9 for the PAWS deployment. Things seem to be working as of right now and spot checking a worker I see:

Chain FORWARD (policy DROP)
target     prot opt source               destination
KUBE-FORWARD  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes forward rules */
DOCKER-ISOLATION  all  --  0.0.0.0/0            0.0.0.0/0
DOCKER     all  --  0.0.0.0/0            0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
ACCEPT     all  --  10.244.0.0/16        0.0.0.0/0
ACCEPT     all  --  0.0.0.0/0            10.244.0.0/16

note default drop policy for filter/forward

Hard to diagnose across version boundaries here. Let's see how it fairs.

Change 403308 merged by Rush:
[operations/puppet@production] tools: ferm pre hook to stop kube-proxy

https://gerrit.wikimedia.org/r/403308

Well I ended up spreading the ferm hacks across an impressive number of changesets due to lessons learned and typos :)

Current situations is that things are settled for a moment to see if the changes are effective.

Showing changs on tools-proxy-02:

Info: Using configured environment 'future'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tools-proxy-02.tools.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1515599072'
Info: Computing checksum on file /etc/ferm/conf.d/0_ferm_restart_handler
Info: /Stage[main]/Ferm/File[/etc/ferm/conf.d/0_ferm_restart_handler]: Filebucketed /etc/ferm/conf.d/0_ferm_restart_handler to puppet with sum 3b2dd14d52cec733b23dcf0342a0755f
Notice: /Stage[main]/Ferm/File[/etc/ferm/conf.d/0_ferm_restart_handler]/ensure: removed
Info: /etc/ferm/conf.d: Scheduling refresh of Service[ferm]
Info: Computing checksum on file /usr/local/sbin/ferm_restart_handler
Info: /Stage[main]/Toollabs::Ferm_handlers/File[/usr/local/sbin/ferm_restart_handler]: Filebucketed /usr/local/sbin/ferm_restart_handler to puppet with sum 50d13f58b072104ccae4ed3f4e19da55
Notice: /Stage[main]/Toollabs::Ferm_handlers/File[/usr/local/sbin/ferm_restart_handler]/ensure: removed
Notice: /Stage[main]/Toollabs::Ferm_handlers/File[/usr/local/sbin/ferm_pre_handler]/ensure: defined content as '{md5}4b81869660467fe0dcfb84d34c5ea1f2'
Info: /Stage[main]/Toollabs::Ferm_handlers/File[/usr/local/sbin/ferm_pre_handler]: Scheduling refresh of Ferm::Conf[ferm_pre_handler]
Notice: /Stage[main]/Toollabs::Ferm_handlers/File[/usr/local/sbin/ferm_post_handler]/ensure: defined content as '{md5}e1b3c6e3784fd6ea9d5e87fceb222b0b'
Info: /Stage[main]/Toollabs::Ferm_handlers/File[/usr/local/sbin/ferm_post_handler]: Scheduling refresh of Ferm::Conf[ferm_post_handler]
Notice: /Stage[main]/Toollabs::Ferm_handlers/Ferm::Conf[ferm_pre_handler]/File[/etc/ferm/conf.d/00_ferm_pre_handler]/ensure: defined content as '{md5}87ec1ec01dbe3fbaf8b243802a1e9802'
Info: /Stage[main]/Toollabs::Ferm_handlers/Ferm::Conf[ferm_pre_handler]/File[/etc/ferm/conf.d/00_ferm_pre_handler]: Scheduling refresh of Service[ferm]
Notice: /Stage[main]/Toollabs::Ferm_handlers/Ferm::Conf[ferm_post_handler]/File[/etc/ferm/conf.d/00_ferm_post_handler]/ensure: defined content as '{md5}b037220752b7bd71ced4ff83472e7f34'
Info: /Stage[main]/Toollabs::Ferm_handlers/Ferm::Conf[ferm_post_handler]/File[/etc/ferm/conf.d/00_ferm_post_handler]: Scheduling refresh of Service[ferm]
Notice: /Stage[main]/Ferm/Service[ferm]: Triggered 'refresh' from 3 events
Notice: Applied catalog in 12.53 seconds

logger 'ferm test 1'
service ferm restart

Jan 10 15:46:23 tools-proxy-02 rush: ferm test 1
Jan 10 15:46:26 tools-proxy-02 systemd[1]: Stopping LSB: ferm firewall configuration...
Jan 10 15:46:27 tools-proxy-02 ferm[21828]: Stopping Firewall: ferm.
Jan 10 15:46:27 tools-proxy-02 systemd[1]: Starting LSB: ferm firewall configuration...
Jan 10 15:46:27 tools-proxy-02 /usr/local/sbin/ferm_pre_handler[21864]: stop kube-proxy
Jan 10 15:46:27 tools-proxy-02 /usr/local/sbin/ferm_post_handler[21891]: restart firewall components post ferm management
Jan 10 15:46:27 tools-proxy-02 ferm[21847]: Starting Firewall: ferm.
Jan 10 15:46:27 tools-proxy-02 systemd[1]: Started LSB: ferm firewall configuration.
root@tools-proxy-02:~# service kube-proxy status
● kube-proxy.service - Kubernetes Kube-Proxy Server
   Loaded: loaded (/lib/systemd/system/kube-proxy.service; enabled)
   Active: active (running) since Wed 2018-01-10 15:46:27 UTC; 17s ago
     Docs: https://github.com/kubernetes/kubernetes
           man:kube-proxy
 Main PID: 21932 (kube-proxy)
   CGroup: /system.slice/kube-proxy.service
           ├─21932 /usr/bin/kube-proxy --kubeconfig=/etc/kubernetes/kubeconfig --proxy-mode='iptables...
           └─22232 iptables-restore --noflush --counters

Jan 10 15:46:27 tools-proxy-02 systemd[1]: Started Kubernetes Kube-Proxy Server.
Jan 10 15:46:27 tools-proxy-02 kube-proxy[21932]: W0110 15:46:27.868874   21932 server.go:378] Fla...oxy
Jan 10 15:46:27 tools-proxy-02 kube-proxy[21932]: I0110 15:46:27.878410   21932 server.go:203] Usi...er.
Jan 10 15:46:27 tools-proxy-02 kube-proxy[21932]: W0110 15:46:27.991129   21932 server.go:436] Fai...und
Jan 10 15:46:27 tools-proxy-02 kube-proxy[21932]: W0110 15:46:27.991305   21932 proxier.go:226] in...eIP
Jan 10 15:46:27 tools-proxy-02 kube-proxy[21932]: I0110 15:46:27.991347   21932 server.go:215] Tea...es.
Jan 10 15:46:28 tools-proxy-02 kube-proxy[21932]: I0110 15:46:28.016673   21932 conntrack.go:81] S...072
Jan 10 15:46:28 tools-proxy-02 kube-proxy[21932]: I0110 15:46:28.023278   21932 conntrack.go:66] S...768
Jan 10 15:46:28 tools-proxy-02 kube-proxy[21932]: I0110 15:46:28.031481   21932 conntrack.go:81] S...400
Jan 10 15:46:28 tools-proxy-02 kube-proxy[21932]: I0110 15:46:28.031898   21932 conntrack.go:81] S...600
Jan 10 15:46:27 tools-proxy-02 systemd[1]: Started Kubernetes Kube-Proxy Server.
Jan 10 15:46:27 tools-proxy-02 kube-proxy[21932]: W0110 15:46:27.868874   21932 server.go:378] Flag proxy-mode="'iptables'" unknown, assuming iptables proxy
Jan 10 15:46:27 tools-proxy-02 kube-proxy[21932]: I0110 15:46:27.878410   21932 server.go:203] Using iptables Proxier.
Jan 10 15:46:27 tools-proxy-02 kube-proxy[21932]: W0110 15:46:27.991129   21932 server.go:436] Failed to retrieve node info: nodes "tools-proxy-02" not found
Jan 10 15:46:27 tools-proxy-02 kube-proxy[21932]: W0110 15:46:27.991305   21932 proxier.go:226] invalid nodeIP, initialize kube-proxy with 127.0.0.1 as nodeIP
Jan 10 15:46:27 tools-proxy-02 kube-proxy[21932]: I0110 15:46:27.991347   21932 server.go:215] Tearing down userspace rules.
Jan 10 15:46:28 tools-proxy-02 kube-proxy[21932]: I0110 15:46:28.016673   21932 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
Jan 10 15:46:28 tools-proxy-02 kube-proxy[21932]: I0110 15:46:28.023278   21932 conntrack.go:66] Setting conntrack hashsize to 32768
Jan 10 15:46:28 tools-proxy-02 kube-proxy[21932]: I0110 15:46:28.031481   21932 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
Jan 10 15:46:28 tools-proxy-02 kube-proxy[21932]: I0110 15:46:28.031898   21932 conntrack.go:81] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600

From a worker service ferm restart

Jan 10 15:05:48 tools-worker-1001 rush: ferm test 2
Jan 10 15:06:01 tools-worker-1001 systemd[1]: Stopping LSB: ferm firewall configuration...
Jan 10 15:06:02 tools-worker-1001 ferm[1373]: Stopping Firewall: ferm.
Jan 10 15:06:02 tools-worker-1001 systemd[1]: Starting LSB: ferm firewall configuration...
Jan 10 15:06:02 tools-worker-1001 /usr/local/sbin/ferm_pre_handler[1402]: stop kube-proxy
Jan 10 15:06:02 tools-worker-1001 /usr/local/sbin/ferm_post_handler[1425]: restart firewall components post ferm management
Jan 10 15:06:06 tools-worker-1001 ferm[1393]: Starting Firewall: ferm.
Jan 10 15:06:06 tools-worker-1001 systemd[1]: Started LSB: ferm firewall configuration.

I did some rescheduling of pods to workers before and after ferm was restarted adhoc and by puppet. I haven't been been able to reproduce an outage with the handlers in place. I ran put all workers back in service and saw clean puppet runs across all Toolforge instances. This is all quite the mess but in theory at least deterministic.

chasemp closed this task as Resolved.Jan 16 2018, 4:09 PM
chasemp claimed this task.

I'm tentatively closing this for now until we see something we think is this again