Page MenuHomePhabricator

Kubernetes worker nodes hanging
Closed, ResolvedPublic

Description

Task to collect information on kubernetes worker nodes hanging. For each node, collect:

  1. Time when it started to hang
  2. Symptoms from graphite (high iowait? high CPU usage?)
  3. Labvirt node it was on
  4. If ssh was accessible
  5. Snippets from console log
  6. How it was 'fixed'

Table:

dateNodelabvirtiowaitsshfixtype
7/21tools-worker-1018labvirt1010yesnoreboot
7/22tools-worker-1015labvirt1010yesnoreboot
7/25tools-worker-1005labvirt1004yesyesdestroyed with fire

Details

Related Gerrit Patches:
operations/puppet : productiontools: Add check for high iowait

Event Timeline

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJul 21 2016, 5:29 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
yuvipanda added a comment.EditedJul 21 2016, 5:35 PM
  1. Time:
  2. Symptoms from graphite: high iowait
  3. labvirt it was on: labvirt1010
  4. ssh accessible: no
  5. console log: `

[1098481.420227] INFO: task jbd2/vda3-8:143 blocked for more than 120 seconds.
[1098481.425226] Not tainted 4.4.0-1-amd64 #1
[1098481.425617] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1098601.424262] INFO: task jbd2/vda3-8:143 blocked for more than 120 seconds.

6. Not fixed yet

Change 300334 had a related patch set uploaded (by Yuvipanda):
tools: Add check for high iowait

https://gerrit.wikimedia.org/r/300334

Change 300334 merged by Yuvipanda:
tools: Add check for high iowait

https://gerrit.wikimedia.org/r/300334

Found this stack trace in dmesg for tools-worker-1018:

[Thu Jul 21 11:10:11 2016] INFO: task jbd2/vda3-8:143 blocked for more than 120 seconds.
[Thu Jul 21 11:10:11 2016]       Not tainted 4.4.0-1-amd64 #1
[Thu Jul 21 11:10:11 2016] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Jul 21 11:10:11 2016] jbd2/vda3-8     D ffff88023fd95d80     0   143      2 0x00000000
[Thu Jul 21 11:10:11 2016]  ffff880233a99100 ffff8802361f6f00 ffff8800ba7d0000 ffff8800ba7cfbc0
[Thu Jul 21 11:10:11 2016]  7fffffffffffffff ffffffff81590a90 ffff8800ba7cfc40 ffff880008952a00
[Thu Jul 21 11:10:11 2016]  ffffffff81590211 0000000000000000 ffffffff815931a5 7fffffffffffffff
[Thu Jul 21 11:10:11 2016] Call Trace:
[Thu Jul 21 11:10:11 2016]  [<ffffffff81590a90>] ? bit_wait_timeout+0xa0/0xa0
[Thu Jul 21 11:10:11 2016]  [<ffffffff81590211>] ? schedule+0x31/0x80
[Thu Jul 21 11:10:11 2016]  [<ffffffff815931a5>] ? schedule_timeout+0x235/0x2d0
[Thu Jul 21 11:10:11 2016]  [<ffffffff81590a90>] ? bit_wait_timeout+0xa0/0xa0
[Thu Jul 21 11:10:11 2016]  [<ffffffff8158f814>] ? io_schedule_timeout+0xb4/0x130
[Thu Jul 21 11:10:11 2016]  [<ffffffff810b8095>] ? prepare_to_wait+0x55/0x80
[Thu Jul 21 11:10:11 2016]  [<ffffffff81590aa7>] ? bit_wait_io+0x17/0x60
[Thu Jul 21 11:10:11 2016]  [<ffffffff8159058a>] ? __wait_on_bit+0x5a/0x90
[Thu Jul 21 11:10:11 2016]  [<ffffffff81590a90>] ? bit_wait_timeout+0xa0/0xa0
[Thu Jul 21 11:10:11 2016]  [<ffffffff8159063e>] ? out_of_line_wait_on_bit+0x7e/0xa0
[Thu Jul 21 11:10:11 2016]  [<ffffffff810b8400>] ? autoremove_wake_function+0x40/0x40
[Thu Jul 21 11:10:11 2016]  [<ffffffffa01d7151>] ? jbd2_journal_commit_transaction+0x1121/0x1900 [jbd2]
[Thu Jul 21 11:10:11 2016]  [<ffffffff810ad015>] ? put_prev_entity+0x35/0x710
[Thu Jul 21 11:10:11 2016]  [<ffffffff810dd419>] ? try_to_del_timer_sync+0x59/0x80
[Thu Jul 21 11:10:11 2016]  [<ffffffffa01db1cd>] ? kjournald2+0xdd/0x280 [jbd2]
[Thu Jul 21 11:10:11 2016]  [<ffffffff810b83c0>] ? wait_woken+0x90/0x90
[Thu Jul 21 11:10:11 2016]  [<ffffffffa01db0f0>] ? commit_timeout+0x10/0x10 [jbd2]
[Thu Jul 21 11:10:11 2016]  [<ffffffff81096cff>] ? kthread+0xdf/0x100
[Thu Jul 21 11:10:11 2016]  [<ffffffff81096c20>] ? kthread_park+0x50/0x50
[Thu Jul 21 11:10:11 2016]  [<ffffffff8159435f>] ? ret_from_fork+0x3f/0x70
[Thu Jul 21 11:10:11 2016]  [<ffffffff81096c20>] ? kthread_park+0x50/0x50

Repeats over and over again

First report of the error is at

[Thu Jul 21 11:06:11 2016] INFO: task jbd2/vda3-8:143 blocked for more than 120 seconds.
hashar added a subscriber: hashar.Jul 21 2016, 9:48 PM

jbd2_journal_commit_transaction and overall diskio borked somehow.

We had that for a while on CI slaves end of June. A console trace is P3278 and task is T138281

That was with Linux version 3.16.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08)

Mentioned in SAL [2016-07-21T22:42:27Z] <chasemp> reboot tools-worker-1018 as stuck T141017

chasemp updated the task description. (Show Details)Jul 21 2016, 10:48 PM

Mentioned in SAL [2016-07-21T22:42:27Z] <chasemp> reboot tools-worker-1018 as stuck T141017

A few notes on this. We saw iowait spike considerably and diamond appeared to be flushing stats during the incident even though the host itself was unavailable for investigation. Sessions would possibly terminate infrequently and cut out shortly thereafter.

iowait for all workers during incident

relative link https://graphite-labs.wikimedia.org/render/?width=1300&height=700&_salt=1469141746.904&target=cactiStyle(tools.tools-worker-10*.cpu.total.iowait)&target=tools.tools-worker-1001.iostat.vda.await&from=-24h&hideLegend=false

iowait for labvirts

https://graphite.wikimedia.org/render/?width=960&height=530&_salt=1469142700.402&target=cactiStyle(servers.labvirt10*.cpu.total.iowait)&from=-24h&hideLegend=false

The only labvirt to show anything is 1005 and worker 1018 is on 1010.

chasemp updated the task description. (Show Details)Jul 22 2016, 2:26 PM
chasemp added a comment.EditedJul 22 2016, 2:29 PM

so, I just saw a warning from high iowait on worker-1015. I tried to ssh in to no effect.

Saw this warning

PROBLEM - High iowait on tools-worker-1015 is CRITICAL: CRITICAL: tools.tools-worker-1015.cpu.total.iowait (>11.11%)

8:57 CDT according to my irc client

iowait had actually been having issues for awhile I believe, and the real kicker is syslog was hosed up so there seems to be no recent logs.

syslog:Jul 22 12:17:56 tools-worker-1015 kubelet[2498]: E0722 12:17:56.892185    2498 file.go:53] Unable to read config path "/etc/kubernetes/manifests": path does not exist, ignoring
syslog:^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Jul 22 14:07:30 tools-worker-1015 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="647" x-info="http://www.rsyslog.com"] start
syslog:Jul 22 14:07:30 tools-worker-1015 systemd-modules-load[170]: Inserted module 'act_mirred'

https://graphite-labs.wikimedia.org/render/?width=845&height=460&_salt=1469197524.819&target=tools.tools-worker-1015.cpu.total.iowait

also of note from 1015 atop stopped recording at the same time as:

  • iowait increasing
  • leading to load increasing
  • iostat collector failing to report in

On tools-worker-1015 I can issue:

atop -r atop_20160722 -b 11:50

if I press t it jumps to the next collection of 12:00 and then 12:10 and then to 14:07:30 (post reboot)

even at the last collection cycle of 12:10:02 I see no procs in dwait state. Something happened between 12:17:00 and 12:20:00 to kick this off.

chasemp triaged this task as High priority.Jul 25 2016, 2:01 PM

note we are upgrading all kernels to be consistent T140611

because of the log periods missing during the issue (but some procs are still sending info like diamond) we are going to try a temporary log aggregation solution so we can capture the issue T141270

I caught 1005 in the act! It got a pod scheduled on it, but was stuck in Pending. I was able to ssh in, and docker ps was hung. It's still hung (even after a docker restart) and it has high iowait too.

docker is a zombie process!

root      1789  5.1  0.0      0     0 ?        Zsl  Jul15 774:50 [docker] <defunct>

Can confirm, it's got:

[656761.057813] INFO: task lighttpd:15169 blocked for more than 120 seconds.
[656761.062759]       Not tainted 4.4.0-1-amd64 #1
[656761.064302] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables

and similar in console.

I lookeed at iostat graphs and see big spike, but only for vda and dm-0. You can find out what corresponds to 'dm-0' by doing:

dmsetup ls

In this case, the output was:

docker-253:0-1179649-3dcb21977466b1794bcdf096db8b7035f22e12f742eb18ecb7562e33d9e9fe4a	(253:18)
docker-253:0-1179649-10100b7a61aa408d703c8eb9fe5b0b24480cf04b627d7b87bd5cb225f38a1106	(253:8)
docker-253:0-1179649-caa5e4ddd4de835ea8d9097fdcd38203f57f6d9bacd3c4a5ea96c06d81e33dfd	(253:5)
vd-docker--storage	(253:0)
docker-253:0-1179649-934125c0b07dd64de8b817778aaaf13b1ceb700aa8a97301b9a43c26b728294d	(253:4)
docker-253:0-1179649-950de7484967476a36cc795d830030bd8ef1c01e0c6c5557d9c74c5c09d17236	(253:14)
docker-253:0-1179649-eb7a0c11034028c2cb604c38825434142351dd8458c80ed25d52e2a584152c11	(253:10)
docker-253:0-1179649-b4ecbcc1df4044a2b7583d2e21e9b0fd243b0886afe1ad9a0c0835b732d49900	(253:7)
docker-253:0-1179649-0999d45b5f50bdc3fa5f219218307a5c20087b8aa9851291363162f568419046	(253:12)
docker-253:0-1179649-c379e12daf15f81863016c392a30ae41095521fef104889ffef4078fda6235e7	(253:17)
docker-253:0-1179649-eac802c5ec1d4d31fca2fffc8fbc3cacc7c23184ecf5caefc7d6b3c18fb3b405	(253:9)
docker-253:0-1179649-619d0e6fbd95dc38e15f0bb1ec4af5e93b1b2a0c38e86f7400649154d3a70671	(253:19)
docker-253:0-1179649-b81a1db83c12a83af3cc4060c467ee28976e31c3783903b1e6e8e4abde0b53b7	(253:15)
docker-253:0-1179649-pool	(253:1)
docker-253:0-1179649-ba96e215c19b4df0125085cdbc2f6cd900610716beba12807673d5cb474e9ff7	(253:13)
docker-253:0-1179649-f0f25671a28562ae2c295e15c7c596997500883b0c23167fe00d941e7d32a149	(253:20)
docker-253:0-1179649-6a0a0e1ce0ce5a6bdaa97f65650d42874d275f5e8e766c938033ad08967fedd3	(253:16)
docker-253:0-1179649-9e643d26618a3139adbe3fbcbc912d2279397d0371c6685a13bbd12097c887d7	(253:11)
docker-253:0-1179649-e5f0cfdfc3e6adb24e5d7d0eac900218fbbcc8fae41da911f14c16f7d23265fa	(253:6)
docker-253:0-1179649-d465c528908f0f877184df510806f336fd4c5bf2e267b568bf07f99ca243ea53	(253:3)
docker-253:0-1179649-377a6c10d650cbc02bd9f6e68e0f0abfc0180f6f9190c731c988626e1f7c061f	(253:2)

So dm-0 is actually:

vd-docker--storage	(253:0)

Which is the lv on top of which docker mounts the loopback devices that are then used by the containers.

root@tools-worker-1005:/home/yuvipanda# lsblk
NAME                                                                                      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                                                                                       254:0    0   80G  0 disk 
├─vda1                                                                                    254:1    0 1007K  0 part 
├─vda2                                                                                    254:2    0  511M  0 part [SWAP]
├─vda3                                                                                    254:3    0   19G  0 part /
└─vda4                                                                                    254:4    0 60.5G  0 part 
  └─vd-docker--storage                                                                    253:0    0 60.5G  0 lvm  /var/lib/doc
loop0                                                                                       7:0    0  100G  0 loop 
└─docker-253:0-1179649-pool                                                               253:1    0  100G  0 dm   
  ├─docker-253:0-1179649-377a6c10d650cbc02bd9f6e68e0f0abfc0180f6f9190c731c988626e1f7c061f 253:2    0  100G  0 dm   
  ├─docker-253:0-1179649-d465c528908f0f877184df510806f336fd4c5bf2e267b568bf07f99ca243ea53 253:3    0  100G  0 dm   
  ├─docker-253:0-1179649-934125c0b07dd64de8b817778aaaf13b1ceb700aa8a97301b9a43c26b728294d 253:4    0  100G  0 dm   
  ├─docker-253:0-1179649-caa5e4ddd4de835ea8d9097fdcd38203f57f6d9bacd3c4a5ea96c06d81e33dfd 253:5    0  100G  0 dm   
  ├─docker-253:0-1179649-e5f0cfdfc3e6adb24e5d7d0eac900218fbbcc8fae41da911f14c16f7d23265fa 253:6    0  100G  0 dm   
  ├─docker-253:0-1179649-b4ecbcc1df4044a2b7583d2e21e9b0fd243b0886afe1ad9a0c0835b732d49900 253:7    0  100G  0 dm   
  ├─docker-253:0-1179649-10100b7a61aa408d703c8eb9fe5b0b24480cf04b627d7b87bd5cb225f38a1106 253:8    0  100G  0 dm   
  ├─docker-253:0-1179649-eac802c5ec1d4d31fca2fffc8fbc3cacc7c23184ecf5caefc7d6b3c18fb3b405 253:9    0  100G  0 dm   
  ├─docker-253:0-1179649-eb7a0c11034028c2cb604c38825434142351dd8458c80ed25d52e2a584152c11 253:10   0  100G  0 dm   
  ├─docker-253:0-1179649-9e643d26618a3139adbe3fbcbc912d2279397d0371c6685a13bbd12097c887d7 253:11   0  100G  0 dm   
  ├─docker-253:0-1179649-0999d45b5f50bdc3fa5f219218307a5c20087b8aa9851291363162f568419046 253:12   0  100G  0 dm   
  ├─docker-253:0-1179649-ba96e215c19b4df0125085cdbc2f6cd900610716beba12807673d5cb474e9ff7 253:13   0  100G  0 dm   
  ├─docker-253:0-1179649-950de7484967476a36cc795d830030bd8ef1c01e0c6c5557d9c74c5c09d17236 253:14   0  100G  0 dm   
  ├─docker-253:0-1179649-b81a1db83c12a83af3cc4060c467ee28976e31c3783903b1e6e8e4abde0b53b7 253:15   0  100G  0 dm   
  ├─docker-253:0-1179649-6a0a0e1ce0ce5a6bdaa97f65650d42874d275f5e8e766c938033ad08967fedd3 253:16   0  100G  0 dm   
  ├─docker-253:0-1179649-c379e12daf15f81863016c392a30ae41095521fef104889ffef4078fda6235e7 253:17   0  100G  0 dm   
  ├─docker-253:0-1179649-3dcb21977466b1794bcdf096db8b7035f22e12f742eb18ecb7562e33d9e9fe4a 253:18   0  100G  0 dm   
  ├─docker-253:0-1179649-619d0e6fbd95dc38e15f0bb1ec4af5e93b1b2a0c38e86f7400649154d3a70671 253:19   0  100G  0 dm   
  └─docker-253:0-1179649-f0f25671a28562ae2c295e15c7c596997500883b0c23167fe00d941e7d32a149 253:20   0  100G  0 dm   
loop1                                                                                       7:1    0    2G  0 loop 
└─docker-253:0-1179649-pool                                                               253:1    0  100G  0 dm   
  ├─docker-253:0-1179649-377a6c10d650cbc02bd9f6e68e0f0abfc0180f6f9190c731c988626e1f7c061f 253:2    0  100G  0 dm   
  ├─docker-253:0-1179649-d465c528908f0f877184df510806f336fd4c5bf2e267b568bf07f99ca243ea53 253:3    0  100G  0 dm   
  ├─docker-253:0-1179649-934125c0b07dd64de8b817778aaaf13b1ceb700aa8a97301b9a43c26b728294d 253:4    0  100G  0 dm   
  ├─docker-253:0-1179649-caa5e4ddd4de835ea8d9097fdcd38203f57f6d9bacd3c4a5ea96c06d81e33dfd 253:5    0  100G  0 dm   
  ├─docker-253:0-1179649-e5f0cfdfc3e6adb24e5d7d0eac900218fbbcc8fae41da911f14c16f7d23265fa 253:6    0  100G  0 dm   
  ├─docker-253:0-1179649-b4ecbcc1df4044a2b7583d2e21e9b0fd243b0886afe1ad9a0c0835b732d49900 253:7    0  100G  0 dm   
  ├─docker-253:0-1179649-10100b7a61aa408d703c8eb9fe5b0b24480cf04b627d7b87bd5cb225f38a1106 253:8    0  100G  0 dm   
  ├─docker-253:0-1179649-eac802c5ec1d4d31fca2fffc8fbc3cacc7c23184ecf5caefc7d6b3c18fb3b405 253:9    0  100G  0 dm   
  ├─docker-253:0-1179649-eb7a0c11034028c2cb604c38825434142351dd8458c80ed25d52e2a584152c11 253:10   0  100G  0 dm   
  ├─docker-253:0-1179649-9e643d26618a3139adbe3fbcbc912d2279397d0371c6685a13bbd12097c887d7 253:11   0  100G  0 dm   
  ├─docker-253:0-1179649-0999d45b5f50bdc3fa5f219218307a5c20087b8aa9851291363162f568419046 253:12   0  100G  0 dm   
  ├─docker-253:0-1179649-ba96e215c19b4df0125085cdbc2f6cd900610716beba12807673d5cb474e9ff7 253:13   0  100G  0 dm   
  ├─docker-253:0-1179649-950de7484967476a36cc795d830030bd8ef1c01e0c6c5557d9c74c5c09d17236 253:14   0  100G  0 dm   
  ├─docker-253:0-1179649-b81a1db83c12a83af3cc4060c467ee28976e31c3783903b1e6e8e4abde0b53b7 253:15   0  100G  0 dm   
  ├─docker-253:0-1179649-6a0a0e1ce0ce5a6bdaa97f65650d42874d275f5e8e766c938033ad08967fedd3 253:16   0  100G  0 dm   
  ├─docker-253:0-1179649-c379e12daf15f81863016c392a30ae41095521fef104889ffef4078fda6235e7 253:17   0  100G  0 dm   
  ├─docker-253:0-1179649-3dcb21977466b1794bcdf096db8b7035f22e12f742eb18ecb7562e33d9e9fe4a 253:18   0  100G  0 dm   
  ├─docker-253:0-1179649-619d0e6fbd95dc38e15f0bb1ec4af5e93b1b2a0c38e86f7400649154d3a70671 253:19   0  100G  0 dm   
  └─docker-253:0-1179649-f0f25671a28562ae2c295e15c7c596997500883b0c23167fe00d941e7d32a149 253:20   0  100G  0 dm   
root@tools-worker-1005:/home/yuvipanda#

This makes me suspect that the extra overhead from the lvm loopback driver is what is causing these issues (see T141126).

@mark pointed out a couple of things:

  1. iowait was at 100% on couple of CPUs (0 and 2)
  2. There's atop which we can use to look at historical data!
  3. There's a lighttpd process in D state:
root      1789  4.8  0.0      0     0 ?        Zsl  Jul15 774:50 [docker] <defunct>
nagios    7357  0.0  0.0  23104  1760 ?        Ss   Jul15   0:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d
51451    15169  0.0  0.0  81180  5708 ?        Ds   Jul18   2:08 /usr/sbin/lighttpd -f /var/run/lighttpd/magnus-toolserver -D
51451    15176  0.0  0.3 342472 30884 ?        Ss   Jul18   0:00  \_ /usr/bin/php-cgi
51451    18329  0.0  0.2 348708 23672 ?        S    Jul23   0:02  |   \_ /usr/bin/php-cgi
51451    18414  0.0  0.2 348720 23648 ?        S    Jul23   0:03  |   \_ /usr/bin/php-cgi
51451    15177  0.0  0.3 342472 30904 ?        Ss   Jul18   0:00  \_ /usr/bin/php-cgi
51451    20111  0.0  0.3 351600 26428 ?        S    Jul23   0:09      \_ /usr/bin/php-cgi
51451    20133  0.0  0.3 351652 26468 ?        S    Jul23   0:09      \_ /usr/bin/php-cgi

Note that it's owned by root - it has in some ways broken out of the container it was in, but only by killing the docker daemon.

The current working hypothesis is:

  1. We were running docker with the default storage backend it came with which uses devicemapper over a loopback device.
  2. This is highly discouraged and is known to cause problems
  3. Our io spikes are on the loopback device volume, so we suspect this is the cause.

So I've gone ahead and done T141126 now.

So I've moved the following nodes to use the direct-lvm docker storage driver: tools-worker-[1001,1003,1006,1020-1021]

The rest are still on loopback, and 1005 is still vaguely stuck and depooled.

I've also disabled puppet on the nodes with loopback on, so we don't end up automigrating. Easiest and cleanest way to migrate is probably to just delete the nodes and recreate, and this also gives people node creation practice :) I'll do more next week and make others do more as well.

yuvipanda updated the task description. (Show Details)Aug 4 2016, 7:15 PM

I've now moved all of the worker nodes to use direct-lvm, let's see if the hangs recur.

I'm going to close this just now since T141673 covers the general case, and we moved to direct-lvm.

yuvipanda closed this task as Resolved.Aug 10 2016, 5:42 PM