castor.integration.eqiad.wmflabs unreacheable deadlocking the whole CI
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Apr 26 2016, 8:05 AM

Description

castor.integration.eqiad.wmflabs is no more reachable for some reason which deadlock the whole CI system.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		hashar	T133652 castor.integration.eqiad.wmflabs unreacheable deadlocking the whole CI
		Resolved		yuvipanda	T133654 wmflabs OpenStack is deadlocked (can't boot or delete instances)

Event Timeline

hashar created this task.Apr 26 2016, 8:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 26 2016, 8:05 AM

hashar claimed this task.Apr 26 2016, 8:06 AM

hashar triaged this task as Unbreak Now! priority.

Restricted Application added subscribers: Southparkfan, Luke081515, TerraCodes, Urbanecm. · View Herald TranscriptApr 26 2016, 8:06 AM

Mentioned in SAL [2016-04-26T08:06:49Z] <hashar> CI jobs deadlocked due to castor being unavailable | https://phabricator.wikimedia.org/T133652

Mentioned in SAL [2016-04-26T08:06:50Z] <hashar> CI jobs deadlocked due to castor being unavailable | https://phabricator.wikimedia.org/T133652

Seems the /dev/vda disk is stalling somehow :(

castor login: [2863440.276096] INFO: task jbd2/vda3-8:113 blocked for more than 120 seconds.
[2863440.280709]       Not tainted 3.19.0-2-amd64 #1
[2863440.281519] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863440.283320] INFO: task apt-get:13482 blocked for more than 120 seconds.
[2863440.284564]       Not tainted 3.19.0-2-amd64 #1
[2863440.285378] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863560.284208] INFO: task jbd2/vda3-8:113 blocked for more than 120 seconds.
[2863560.289924]       Not tainted 3.19.0-2-amd64 #1
[2863560.290503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863560.291904] INFO: task kworker/u4:2:9168 blocked for more than 120 seconds.
[2863560.293398]       Not tainted 3.19.0-2-amd64 #1
[2863560.293993] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863560.295207] INFO: task apt-get:13482 blocked for more than 120 seconds.
[2863560.296062]       Not tainted 3.19.0-2-amd64 #1
[2863560.296667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863680.296148] INFO: task jbd2/vda3-8:113 blocked for more than 120 seconds.
[2863680.301011]       Not tainted 3.19.0-2-amd64 #1
[2863680.301397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863680.302388] INFO: task kworker/u4:2:9168 blocked for more than 120 seconds.
[2863680.302966]       Not tainted 3.19.0-2-amd64 #1
[2863680.303356] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863680.304189] INFO: task apt-get:13482 blocked for more than 120 seconds.
[2863680.304746]       Not tainted 3.19.0-2-amd64 #1
[2863680.305124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863800.304211] INFO: task jbd2/vda3-8:113 blocked for more than 120 seconds.
[2863800.309417]       Not tainted 3.19.0-2-amd64 #1
[2863800.310186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2863800.311988] INFO: task kworker/u4:2:9168 blocked for more than 120 seconds.
[2863800.313380]       Not tainted 3.19.0-2-amd64 #1
[2863800.313763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Mentioned in SAL [2016-04-26T08:10:55Z] <hashar> soft rebooting castor instance | T133652

Mentioned in SAL [2016-04-26T08:12:48Z] <hashar> hard rebooting castor instance | T133652

Mentioned in SAL [2016-04-26T08:20:22Z] <hashar> shutoff instance castor, does not seem to be able to start again :( | T133652

It wont come back. I am going to create a new instance.

hashar mentioned this in T133654: wmflabs OpenStack is deadlocked (can't boot or delete instances).Apr 26 2016, 8:37 AM

hashar created subtask T133654: wmflabs OpenStack is deadlocked (can't boot or delete instances).

hashar merged a task: T133655: npm-node-4.3 test fails on a core patch.Apr 26 2016, 8:46 AM

hashar added a subscriber: Amire80.

Labs process got restarted and castor instance managed to spawn. Jenkins refuses to add it back as a slave though :(

yuvipanda closed subtask T133654: wmflabs OpenStack is deadlocked (can't boot or delete instances) as Resolved.Apr 26 2016, 8:56 AM

In Jenkins the castor slave thread seems to be blocked.

"Channel reader thread: castor" prio=5 BLOCKED
	hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1226)
	hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:495)
	hudson.remoting.Channel.terminate(Channel.java:868)
	hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)

I tried to kill related threads via https://integration.wikimedia.org/ci/computer/monitoring/ Eventually I have just renamed the slave to castor-old and back to castor and that somehow unlocked Jenkins. It then managed to add the slave back over SSH.

https://integration.wikimedia.org/ci/computer/castor/ looks fine now.

Root cause was Nova acting weirdly fixed by @yuvipanda which "restarted nova-conductor & scheduler on labcontrol1001".

Then some oddity with Jenkins not being able to pool the slave back due to some deadlock in the slave ssh disconnection. Solved by renaming the slave twice which got rid of the lock.

Roughly a 5 hours outage overall.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:43 PM

Restricted Application added a project: Release-Engineering-Team (Kanban). · View Herald TranscriptJun 7 2017, 6:43 PM

Restricted Application added a subscriber: Jay8g. · View Herald Transcript

• Phabricator_maintenance edited projects, added RelEng-Archive-FY201718-Q1; removed Release-Engineering-Team (Kanban), Cloud-Services.Sep 26 2017, 11:47 PM

hashar added a project: Castor.Aug 5 2020, 5:30 PM