Page MenuHomePhabricator

castor.integration.eqiad.wmflabs unreacheable deadlocking the whole CI
Closed, ResolvedPublic

Description

castor.integration.eqiad.wmflabs is no more reachable for some reason which deadlock the whole CI system.

Event Timeline

hashar triaged this task as Unbreak Now! priority.

Mentioned in SAL [2016-04-26T08:06:49Z] <hashar> CI jobs deadlocked due to castor being unavailable | https://phabricator.wikimedia.org/T133652

Mentioned in SAL [2016-04-26T08:06:50Z] <hashar> CI jobs deadlocked due to castor being unavailable | https://phabricator.wikimedia.org/T133652

Seems the /dev/vda disk is stalling somehow :(

castor login: [2863440.276096] INFO: task jbd2/vda3-8:113 blocked for more than 120 seconds.
[2863440.280709]       Not tainted 3.19.0-2-amd64 #1
[2863440.281519] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863440.283320] INFO: task apt-get:13482 blocked for more than 120 seconds.
[2863440.284564]       Not tainted 3.19.0-2-amd64 #1
[2863440.285378] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863560.284208] INFO: task jbd2/vda3-8:113 blocked for more than 120 seconds.
[2863560.289924]       Not tainted 3.19.0-2-amd64 #1
[2863560.290503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863560.291904] INFO: task kworker/u4:2:9168 blocked for more than 120 seconds.
[2863560.293398]       Not tainted 3.19.0-2-amd64 #1
[2863560.293993] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863560.295207] INFO: task apt-get:13482 blocked for more than 120 seconds.
[2863560.296062]       Not tainted 3.19.0-2-amd64 #1
[2863560.296667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863680.296148] INFO: task jbd2/vda3-8:113 blocked for more than 120 seconds.
[2863680.301011]       Not tainted 3.19.0-2-amd64 #1
[2863680.301397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863680.302388] INFO: task kworker/u4:2:9168 blocked for more than 120 seconds.
[2863680.302966]       Not tainted 3.19.0-2-amd64 #1
[2863680.303356] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863680.304189] INFO: task apt-get:13482 blocked for more than 120 seconds.
[2863680.304746]       Not tainted 3.19.0-2-amd64 #1
[2863680.305124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863800.304211] INFO: task jbd2/vda3-8:113 blocked for more than 120 seconds.
[2863800.309417]       Not tainted 3.19.0-2-amd64 #1
[2863800.310186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863800.311988] INFO: task kworker/u4:2:9168 blocked for more than 120 seconds.
[2863800.313380]       Not tainted 3.19.0-2-amd64 #1
[2863800.313763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Mentioned in SAL [2016-04-26T08:10:55Z] <hashar> soft rebooting castor instance | T133652

Mentioned in SAL [2016-04-26T08:12:48Z] <hashar> hard rebooting castor instance | T133652

Mentioned in SAL [2016-04-26T08:20:22Z] <hashar> shutoff instance castor, does not seem to be able to start again :( | T133652

It wont come back. I am going to create a new instance.

Labs process got restarted and castor instance managed to spawn. Jenkins refuses to add it back as a slave though :(

In Jenkins the castor slave thread seems to be blocked.

"Channel reader thread: castor" prio=5 BLOCKED
	hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1226)
	hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:495)
	hudson.remoting.Channel.terminate(Channel.java:868)
	hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)

I tried to kill related threads via https://integration.wikimedia.org/ci/computer/monitoring/ Eventually I have just renamed the slave to castor-old and back to castor and that somehow unlocked Jenkins. It then managed to add the slave back over SSH.

https://integration.wikimedia.org/ci/computer/castor/ looks fine now.

hashar added a subscriber: yuvipanda.

Root cause was Nova acting weirdly fixed by @yuvipanda which "restarted nova-conductor & scheduler on labcontrol1001".

Then some oddity with Jenkins not being able to pool the slave back due to some deadlock in the slave ssh disconnection. Solved by renaming the slave twice which got rid of the lock.

Roughly a 5 hours outage overall.

Restricted Application added a subscriber: Jay8g. · View Herald Transcript