Page MenuHomePhabricator
Paste P8133

tools-worker-1027 migration to eqiad1-r
ActivePublic

Authored by GTirloni on Feb 26 2019, 8:20 PM.
Tags
None
Referenced Files
F28290035: raw.txt
Feb 26 2019, 8:41 PM
F28289999: raw.txt
Feb 26 2019, 8:20 PM
Subscribers
None
root@tools-k8s-master-01:~# kubectl get nodes
NAME STATUS AGE
tools-worker-1001.tools.eqiad.wmflabs Ready 2y
tools-worker-1002.tools.eqiad.wmflabs Ready,SchedulingDisabled 2y
tools-worker-1003.tools.eqiad.wmflabs Ready 2y
tools-worker-1004.tools.eqiad.wmflabs Ready 2y
tools-worker-1005.tools.eqiad.wmflabs Ready,SchedulingDisabled 2y
tools-worker-1006.tools.eqiad.wmflabs Ready 2y
tools-worker-1007.tools.eqiad.wmflabs Ready 2y
tools-worker-1008.tools.eqiad.wmflabs Ready 2y
tools-worker-1009.tools.eqiad.wmflabs Ready 2y
tools-worker-1010.tools.eqiad.wmflabs Ready 2y
tools-worker-1011.tools.eqiad.wmflabs Ready 2y
tools-worker-1012.tools.eqiad.wmflabs Ready 2y
tools-worker-1013.tools.eqiad.wmflabs Ready 2y
tools-worker-1014.tools.eqiad.wmflabs Ready 2y
tools-worker-1015.tools.eqiad.wmflabs Ready 2y
tools-worker-1016.tools.eqiad.wmflabs Ready 2y
tools-worker-1017.tools.eqiad.wmflabs Ready 2y
tools-worker-1018.tools.eqiad.wmflabs Ready 2y
tools-worker-1019.tools.eqiad.wmflabs Ready 2y
tools-worker-1020.tools.eqiad.wmflabs Ready 2y
tools-worker-1021.tools.eqiad.wmflabs Ready 2y
tools-worker-1022.tools.eqiad.wmflabs Ready 2y
tools-worker-1023.tools.eqiad.wmflabs Ready 2y
tools-worker-1025.tools.eqiad.wmflabs Ready 2y
tools-worker-1026.tools.eqiad.wmflabs Ready 2y
tools-worker-1027.tools.eqiad.wmflabs NotReady,SchedulingDisabled 2y
tools-worker-1028.tools.eqiad.wmflabs Ready,SchedulingDisabled 1y
==>> Node addresses is wrong
root@tools-k8s-master-01:~# kubectl describe node tools-worker-1027.tools.eqiad.wmflabs
Name: tools-worker-1027.tools.eqiad.wmflabs
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=tools-worker-1027.tools.eqiad.wmflabs
Taints: <none>
CreationTimestamp: Sat, 04 Feb 2017 02:30:28 +0000
Phase:
Conditions:
Type Status LastHeartbeatTime LastTransitionTimeReason Message
---- ------ ----------------- ------------------------ -------
OutOfDisk Unknown Tue, 26 Feb 2019 17:31:08 +0000 Tue, 26 Feb 2019 17:31:49 +0000 NodeStatusUnknown Kubelet stopped posting node status.
MemoryPressure False Tue, 26 Feb 2019 17:31:08 +0000 Sat, 04 Feb 2017 02:30:28 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 26 Feb 2019 17:31:08 +0000 Sat, 04 Feb 2017 02:30:28 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready Unknown Tue, 26 Feb 2019 17:31:08 +0000 Tue, 26 Feb 2019 17:31:49 +0000 NodeStatusUnknown Kubelet stopped posting node status.
Addresses: 10.68.17.215,10.68.17.215 <<<<<<<<<<<<<<<<<<<<<<
Capacity:
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 4
memory: 8179168Ki
pods: 110
Allocatable:
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 4
memory: 8179168Ki
pods: 110
System Info:
Machine ID: d59e74c8689f444f8bf8656805f9ae89
System UUID: 750E2BA9-4D07-4D4A-8FDD-1A659BAAC06A
Boot ID: f092bafe-ffe6-44b9-b22d-308b10fa9618
Kernel Version: 4.9.0-0.bpo.6-amd64
OS Image: Debian GNU/Linux 8 (jessie)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.12.6
Kubelet Version: v1.4.6+e569a27
Kube-Proxy Version: v1.4.6+e569a27
ExternalID: tools-worker-1027.tools.eqiad.wmflabs
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
0 (0%) 0 (0%) 0 (0%) 0 (0%)
==>> Fix node address
root@tools-k8s-master-01:~# kubectl edit node tools-worker-1027.tools.eqiad.wmflabs
node "tools-worker-1027.tools.eqiad.wmflabs" edited
==>> Still not working
root@tools-k8s-master-01:~# kubectl get node tools-worker-1027.tools.eqiad.wmflabs
NAME STATUS AGE
tools-worker-1027.tools.eqiad.wmflabs NotReady,SchedulingDisabled 2y
==>> Docker can't start on the worker
root@tools-worker-1027:~# systemctl status docker --full
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled)
Drop-In: /etc/systemd/system/docker.service.d
└─puppet-override.conf
Active: failed (Result: start-limit)
Docs: https://docs.docker.com
Feb 26 20:02:06 tools-worker-1027 systemd[1]: Failed to load environment files: No such file or directory
Feb 26 20:02:06 tools-worker-1027 systemd[1]: docker.service failed to run 'start' task: No such file or directory
Feb 26 20:02:06 tools-worker-1027 systemd[1]: Failed to start Docker Application Container Engine.
==>> Docker can start manually, systemd definition is broken
root@tools-worker-1027:~# dockerd
INFO[0000] libcontainerd: new containerd process, pid: 12984
WARN[0001] devmapper: Base device already exists and has filesystem xfs on it. User specified filesystem will be ignored.
INFO[0002] Graph migration to content-addressability took 0.00 seconds
INFO[0002] Loading containers: start.
.INFO[0002] Default bridge (docker0) is assigned with an IP address 192.168.212.0/24. Daemon option --bip can be used to set a preferred IP address
INFO[0002] Loading containers: done.
INFO[0002] Daemon has completed initialization
INFO[0002] Docker daemon commit=78d1802 graphdriver=devicemapper version=1.12.6
INFO[0002] API listen on /var/run/docker.sock
root@tools-worker-1027:~# cat /etc/systemd/system/docker.service.d/puppet-override.conf
# Docker override systemd for v1.11.2-0~jessie
[Unit]
After=network.target docker.socket flannel.service
Requires=docker.socket flannel.service
[Service]
EnvironmentFile=/run/flannel/subnet.env
# We need to clear ExecStart first before setting it again
ExecStart=
ExecStart=/usr/bin/docker daemon -H fd:// \
--config-file=/etc/docker/daemon.json \
--bip=${FLANNEL_SUBNET} \
--mtu=${FLANNEL_MTU}
==>> flannel runtime data is missing
root@tools-worker-1027:~# ls /run/flannel/subnet.env
ls: cannot access /run/flannel/subnet.env: No such file or directory
==>> flannel is broken, can't talk to flannel-etcd nodes in eqiad
root@tools-worker-1027:~# systemctl status flannel
● flannel.service - flannel overlay network
Loaded: loaded (/lib/systemd/system/flannel.service; enabled)
Active: active (running) since Tue 2019-02-26 19:48:11 UTC; 17min ago
Main PID: 654 (flanneld)
CGroup: /system.slice/flannel.service
└─654 /usr/bin/flanneld --etcd-endpoints=https://tools-flannel-etcd-01.tools.eqiad.wm..
Feb 26 19:51:12 tools-worker-1027 flanneld[654]: E0226 19:51:12.575418 00654 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
Feb 26 19:52:43 tools-worker-1027 flanneld[654]: E0226 19:52:43.577219 00654 network.go:53] Failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
==>> security groupo 'etcd' is allowing 172.16.0.0/21 to connect to ports 2379/2380
==>> packets are arriving
root@tools-worker-1027:~# curl http://tools-flannel-etcd-01.tools.eqiad.wmflabs:2379
^C
gtirloni@tools-flannel-etcd-01:~$ sudo tcpdump -n -i eth0 net 172.16.0.0/21 and port 2379
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
20:09:17.580444 IP 172.16.1.118.48786 > 10.68.17.15.2379: Flags [S], seq 3417719856, win 29200, options [mss 1460,sackOK,TS val 247182 ecr 0,nop,wscale 9], length 0
20:09:18.610029 IP 172.16.1.118.48786 > 10.68.17.15.2379: Flags [S], seq 3417719856, win 29200, options [mss 1460,sackOK,TS val 247440 ecr 0,nop,wscale 9], length 0
20:09:20.625803 IP 172.16.1.118.48786 > 10.68.17.15.2379: Flags [S], seq 3417719856, win 29200, options [mss 1460,sackOK,TS val 247944 ecr 0,nop,wscale 9], length 0
root@tools-flannel-etcd-01:~# iptables -A INPUT -p tcp -s 172.16.0.0/21 --dport 2379 -j ACCEPT
root@tools-worker-1027:~# telnet tools-flannel-etcd-01.tools.eqiad.wmflabs 2379
Trying 10.68.17.15...
Connected to tools-flannel-etcd-01.tools.eqiad.wmflabs.
Escape character is '^]'.
==>> flannel issue is with the iptables rules
==>> https://wikitech.wikimedia.org/wiki/Hiera:Tools has tools-worker-1027.tools.eqiad.wmflabs
root@tools-flannel-etcd-01:~# dig +short tools-worker-1027.tools.eqiad.wmflabs
172.16.1.118
root@tools-puppetmaster-01:~# dig +short tools-worker-1027.tools.eqiad.wmflabs
172.16.1.118
root@tools-flannel-etcd-01:~# iptables -L -n | grep 172.16.0.0 | grep 2379
root@tools-flannel-etcd-01:~#
==>> DNS entry for worker-1027 is okay, has Puppet not run after the migration?
root@tools-flannel-etcd-01:~# puppet agent -t
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tools-flannel-etcd-01.tools.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1551212181'
Notice: Applied catalog in 8.46 seconds
root@tools-flannel-etcd-01:~# iptables -L -n | grep 172.16.0.0 | grep 2379
root@tools-flannel-etcd-01:~#
root@tools-flannel-etcd-01:~# grep 172.16.1.118 /var/lib/puppet/client_data/catalog/tools-flannel-etcd-01.tools.eqiad.wmflabs.json
root@tools-flannel-etcd-01:~#
==>> https://phabricator.wikimedia.org/T113380 (Ferm doesn't update @resolve hostnames on IP change)
root@tools-flannel-etcd-01:~# grep CACHE /etc/default/ferm
CACHE=no
root@tools-flannel-etcd-01:~# journalctl -u ferm --full
-- Logs begin at Tue 2019-02-26 06:41:45 UTC, end at Tue 2019-02-26 20:35:41 UTC. --
Feb 26 20:34:50 tools-flannel-etcd-01 systemd[1]: Reloading LSB: ferm firewall configuration.
Feb 26 20:34:50 tools-flannel-etcd-01 ferm[2822]: Reloading Firewall configuration...Error in /etc/
Feb 26 20:34:50 tools-flannel-etcd-01 ferm[2822]: tools-worker-1001.tools.eqiad.wmflabs tools-worke
Feb 26 20:34:50 tools-flannel-etcd-01 systemd[1]: ferm.service: control process exited, code=exited
Feb 26 20:34:50 tools-flannel-etcd-01 systemd[1]: Reload failed for LSB: ferm firewall configuratio
Feb 26 20:34:50 tools-flannel-etcd-01 ferm[2822]: )
Feb 26 20:34:50 tools-flannel-etcd-01 ferm[2822]: )
Feb 26 20:34:50 tools-flannel-etcd-01 ferm[2822]: <--
Feb 26 20:34:50 tools-flannel-etcd-01 ferm[2822]: DNS query for 'tools-proxy-01.tools.eqiad.wmflabs
root@tools-flannel-etcd-01:~# journalctl -u ferm --full -no-pager
Failed to parse lines 'o-pager'
root@tools-flannel-etcd-01:~# journalctl -u ferm --full --no-pager
-- Logs begin at Tue 2019-02-26 06:41:45 UTC, end at Tue 2019-02-26 20:35:51 UTC. --
Feb 26 20:34:50 tools-flannel-etcd-01 systemd[1]: Reloading LSB: ferm firewall configuration.
Feb 26 20:34:50 tools-flannel-etcd-01 ferm[2822]: Reloading Firewall configuration...Error in /etc/ferm/conf.d/10_flannel-clients line 4:
Feb 26 20:34:50 tools-flannel-etcd-01 ferm[2822]: tools-worker-1001.tools.eqiad.wmflabs tools-worker-1002.tools.eqiad.wmflabs tools-worker-1003.tools.eqiad.wmflabs tools-worker-1004.tools.eqiad.wmflabs tools-worker-1005.tools.eqiad.wmflabs tools-worker-1006.tools.eqiad.wmflabs tools-worker-1007.tools.eqiad.wmflabs tools-worker-1008.tools.eqiad.wmflabs tools-worker-1009.tools.eqiad.wmflabs tools-worker-1010.tools.eqiad.wmflabs tools-worker-1011.tools.eqiad.wmflabs tools-worker-1012.tools.eqiad.wmflabs tools-worker-1013.tools.eqiad.wmflabs tools-worker-1014.tools.eqiad.wmflabs tools-worker-1015.tools.eqiad.wmflabs tools-worker-1016.tools.eqiad.wmflabs tools-worker-1017.tools.eqiad.wmflabs tools-worker-1018.tools.eqiad.wmflabs tools-worker-1019.tools.eqiad.wmflabs tools-worker-1020.tools.eqiad.wmflabs tools-worker-1021.tools.eqiad.wmflabs tools-worker-1022.tools.eqiad.wmflabs tools-worker-1023.tools.eqiad.wmflabs tools-worker-1025.tools.eqiad.wmflabs tools-worker-1026.tools.eqiad.wmflabs tools-worker-1027.tools.eqiad.wmflabs tools-worker-1028.tools.eqiad.wmflabs tools-flannel-etcd-01.tools.eqiad.wmflabs tools-flannel-etcd-02.tools.eqiad.wmflabs tools-flannel-etcd-03.tools.eqiad.wmflabs tools-proxy-01.tools.eqiad.wmflabs tools-proxy-02.tools.eqiad.wmflabs tools-proxy-03.tools.eqiad.wmflabs tools-proxy-04.tools.eqiad.wmflabs tools-bastion-03.tools.eqiad.wmflabs tools-bastion-05.tools.eqiad.wmflabs tools-checker-01.tools.eqiad.wmflabs tools-checker-02.tools.eqiad.wmflabs
Feb 26 20:34:50 tools-flannel-etcd-01 systemd[1]: ferm.service: control process exited, code=exited status=255
Feb 26 20:34:50 tools-flannel-etcd-01 systemd[1]: Reload failed for LSB: ferm firewall configuration.
Feb 26 20:34:50 tools-flannel-etcd-01 ferm[2822]: )
Feb 26 20:34:50 tools-flannel-etcd-01 ferm[2822]: )
Feb 26 20:34:50 tools-flannel-etcd-01 ferm[2822]: <--
Feb 26 20:34:50 tools-flannel-etcd-01 ferm[2822]: DNS query for 'tools-proxy-01.tools.eqiad.wmflabs' failed: NXDOMAIN
==>> removed tools-proxy-01/02 from hiera value
https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ATools&type=revision&diff=1817552&oldid=1813808
==>> Run puppet again
root@tools-flannel-etcd-01:~# puppet agent -t
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tools-flannel-etcd-01.tools.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1551213507'
Notice: /Stage[main]/Role::Toollabs::Etcd::Flannel/Ferm::Service[flannel-clients]/File[/etc/ferm/conf.d/10_flannel-clients]/content:
--- /etc/ferm/conf.d/10_flannel-clients 2019-01-14 15:12:54.089887964 +0000
+++ /tmp/puppet-file20190226-3263-1kzp2vo 2019-02-26 20:38:43.732459560 +0000
@@ -1,5 +1,5 @@
# Autogenerated by puppet. DO NOT EDIT BY HAND!
#
#
-&R_SERVICE(tcp, 2379, @resolve((tools-worker-1001.tools.eqiad.wmflabs tools-worker-1002.tools.eqiad.wmflabs tools-worker-1003.tools.eqiad.wmflabs tools-worker-1004.tools.eqiad.wmflabs tools-worker-1005.tools.eqiad.wmflabs tools-worker-1006.tools.eqiad.wmflabs tools-worker-1007.tools.eqiad.wmflabs tools-worker-1008.tools.eqiad.wmflabs tools-worker-1009.tools.eqiad.wmflabs tools-worker-1010.tools.eqiad.wmflabs tools-worker-1011.tools.eqiad.wmflabs tools-worker-1012.tools.eqiad.wmflabs tools-worker-1013.tools.eqiad.wmflabs tools-worker-1014.tools.eqiad.wmflabs tools-worker-1015.tools.eqiad.wmflabs tools-worker-1016.tools.eqiad.wmflabs tools-worker-1017.tools.eqiad.wmflabs tools-worker-1018.tools.eqiad.wmflabs tools-worker-1019.tools.eqiad.wmflabs tools-worker-1020.tools.eqiad.wmflabs tools-worker-1021.tools.eqiad.wmflabs tools-worker-1022.tools.eqiad.wmflabs tools-worker-1023.tools.eqiad.wmflabs tools-worker-1025.tools.eqiad.wmflabs tools-worker-1026.tools.eqiad.wmflabs tools-worker-1027.tools.eqiad.wmflabs tools-worker-1028.tools.eqiad.wmflabs tools-flannel-etcd-01.tools.eqiad.wmflabs tools-flannel-etcd-02.tools.eqiad.wmflabs tools-flannel-etcd-03.tools.eqiad.wmflabs tools-proxy-01.tools.eqiad.wmflabs tools-proxy-02.tools.eqiad.wmflabs tools-proxy-03.tools.eqiad.wmflabs tools-proxy-04.tools.eqiad.wmflabs tools-bastion-03.tools.eqiad.wmflabs tools-bastion-05.tools.eqiad.wmflabs tools-checker-01.tools.eqiad.wmflabs tools-checker-02.tools.eqiad.wmflabs)));
+&R_SERVICE(tcp, 2379, @resolve((tools-worker-1001.tools.eqiad.wmflabs tools-worker-1002.tools.eqiad.wmflabs tools-worker-1003.tools.eqiad.wmflabs tools-worker-1004.tools.eqiad.wmflabs tools-worker-1005.tools.eqiad.wmflabs tools-worker-1006.tools.eqiad.wmflabs tools-worker-1007.tools.eqiad.wmflabs tools-worker-1008.tools.eqiad.wmflabs tools-worker-1009.tools.eqiad.wmflabs tools-worker-1010.tools.eqiad.wmflabs tools-worker-1011.tools.eqiad.wmflabs tools-worker-1012.tools.eqiad.wmflabs tools-worker-1013.tools.eqiad.wmflabs tools-worker-1014.tools.eqiad.wmflabs tools-worker-1015.tools.eqiad.wmflabs tools-worker-1016.tools.eqiad.wmflabs tools-worker-1017.tools.eqiad.wmflabs tools-worker-1018.tools.eqiad.wmflabs tools-worker-1019.tools.eqiad.wmflabs tools-worker-1020.tools.eqiad.wmflabs tools-worker-1021.tools.eqiad.wmflabs tools-worker-1022.tools.eqiad.wmflabs tools-worker-1023.tools.eqiad.wmflabs tools-worker-1025.tools.eqiad.wmflabs tools-worker-1026.tools.eqiad.wmflabs tools-worker-1027.tools.eqiad.wmflabs tools-worker-1028.tools.eqiad.wmflabs tools-flannel-etcd-01.tools.eqiad.wmflabs tools-flannel-etcd-02.tools.eqiad.wmflabs tools-flannel-etcd-03.tools.eqiad.wmflabs tools-proxy-03.tools.eqiad.wmflabs tools-proxy-04.tools.eqiad.wmflabs tools-bastion-03.tools.eqiad.wmflabs tools-bastion-05.tools.eqiad.wmflabs tools-checker-01.tools.eqiad.wmflabs tools-checker-02.tools.eqiad.wmflabs)));
Info: Computing checksum on file /etc/ferm/conf.d/10_flannel-clients
Info: /Stage[main]/Role::Toollabs::Etcd::Flannel/Ferm::Service[flannel-clients]/File[/etc/ferm/conf.d/10_flannel-clients]: Filebucketed /etc/ferm/conf.d/10_flannel-clients to puppet with sum 1b18e2319c219e8818776c86b9c666b7
Notice: /Stage[main]/Role::Toollabs::Etcd::Flannel/Ferm::Service[flannel-clients]/File[/etc/ferm/conf.d/10_flannel-clients]/content: content changed '{md5}1b18e2319c219e8818776c86b9c666b7' to '{md5}8b9fe86136cc43596a65c5980d523eff'
Info: /Stage[main]/Role::Toollabs::Etcd::Flannel/Ferm::Service[flannel-clients]/File[/etc/ferm/conf.d/10_flannel-clients]: Scheduling refresh of Service[ferm]
Notice: /Stage[main]/Ferm/Service[ferm]: Triggered 'refresh' from 1 events
Notice: Applied catalog in 9.71 seconds
root@tools-flannel-etcd-01:~# iptables -L -n | grep 172.16.1.118
ACCEPT tcp -- 172.16.1.118 0.0.0.0/0 tcp dpt:2379
==>> flannel is working
root@tools-worker-1027:~# systemctl status flannel
● flannel.service - flannel overlay network
Loaded: loaded (/lib/systemd/system/flannel.service; enabled)
Active: active (running) since Tue 2019-02-26 19:48:11 UTC; 51min ago
Main PID: 654 (flanneld)
CGroup: /system.slice/flannel.service
└─654 /usr/bin/flanneld --etcd-endpoints=https://tools-flannel-etcd-01.tools.eqiad.wm...
root@tools-worker-1027:~# ls /run/flannel/subnet.env
/run/flannel/subnet.env
==>> docker is running
root@tools-worker-1027:~# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled)
Drop-In: /etc/systemd/system/docker.service.d
└─puppet-override.conf
Active: active (running) since Tue 2019-02-26 20:26:39 UTC; 13min ago
==>> Node is healthy
root@tools-k8s-master-01:~# kubectl get node tools-worker-1027.tools.eqiad.wmflabs
NAME STATUS AGE
tools-worker-1027.tools.eqiad.wmflabs Ready,SchedulingDisabled 2y