castor.integration.eqiad.wmflabs is no more reachable for some reason which deadlock the whole CI system.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | hashar | T133652 castor.integration.eqiad.wmflabs unreacheable deadlocking the whole CI | |||
Resolved | yuvipanda | T133654 wmflabs OpenStack is deadlocked (can't boot or delete instances) |
Event Timeline
Mentioned in SAL [2016-04-26T08:06:49Z] <hashar> CI jobs deadlocked due to castor being unavailable | https://phabricator.wikimedia.org/T133652
Mentioned in SAL [2016-04-26T08:06:50Z] <hashar> CI jobs deadlocked due to castor being unavailable | https://phabricator.wikimedia.org/T133652
Seems the /dev/vda disk is stalling somehow :(
castor login: [2863440.276096] INFO: task jbd2/vda3-8:113 blocked for more than 120 seconds. [2863440.280709] Not tainted 3.19.0-2-amd64 #1 [2863440.281519] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [2863440.283320] INFO: task apt-get:13482 blocked for more than 120 seconds. [2863440.284564] Not tainted 3.19.0-2-amd64 #1 [2863440.285378] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [2863560.284208] INFO: task jbd2/vda3-8:113 blocked for more than 120 seconds. [2863560.289924] Not tainted 3.19.0-2-amd64 #1 [2863560.290503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [2863560.291904] INFO: task kworker/u4:2:9168 blocked for more than 120 seconds. [2863560.293398] Not tainted 3.19.0-2-amd64 #1 [2863560.293993] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [2863560.295207] INFO: task apt-get:13482 blocked for more than 120 seconds. [2863560.296062] Not tainted 3.19.0-2-amd64 #1 [2863560.296667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [2863680.296148] INFO: task jbd2/vda3-8:113 blocked for more than 120 seconds. [2863680.301011] Not tainted 3.19.0-2-amd64 #1 [2863680.301397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [2863680.302388] INFO: task kworker/u4:2:9168 blocked for more than 120 seconds. [2863680.302966] Not tainted 3.19.0-2-amd64 #1 [2863680.303356] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [2863680.304189] INFO: task apt-get:13482 blocked for more than 120 seconds. [2863680.304746] Not tainted 3.19.0-2-amd64 #1 [2863680.305124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [2863800.304211] INFO: task jbd2/vda3-8:113 blocked for more than 120 seconds. [2863800.309417] Not tainted 3.19.0-2-amd64 #1 [2863800.310186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [2863800.311988] INFO: task kworker/u4:2:9168 blocked for more than 120 seconds. [2863800.313380] Not tainted 3.19.0-2-amd64 #1 [2863800.313763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mentioned in SAL [2016-04-26T08:10:55Z] <hashar> soft rebooting castor instance | T133652
Mentioned in SAL [2016-04-26T08:12:48Z] <hashar> hard rebooting castor instance | T133652
Mentioned in SAL [2016-04-26T08:20:22Z] <hashar> shutoff instance castor, does not seem to be able to start again :( | T133652
Labs process got restarted and castor instance managed to spawn. Jenkins refuses to add it back as a slave though :(
In Jenkins the castor slave thread seems to be blocked.
"Channel reader thread: castor" prio=5 BLOCKED hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1226) hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:495) hudson.remoting.Channel.terminate(Channel.java:868) hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
I tried to kill related threads via https://integration.wikimedia.org/ci/computer/monitoring/ Eventually I have just renamed the slave to castor-old and back to castor and that somehow unlocked Jenkins. It then managed to add the slave back over SSH.
https://integration.wikimedia.org/ci/computer/castor/ looks fine now.
Root cause was Nova acting weirdly fixed by @yuvipanda which "restarted nova-conductor & scheduler on labcontrol1001".
Then some oddity with Jenkins not being able to pool the slave back due to some deadlock in the slave ssh disconnection. Solved by renaming the slave twice which got rid of the lock.
Roughly a 5 hours outage overall.