Page MenuHomePhabricator

Postmortem: Nodepool can't add slaves to Jenkins due to config plugin directory reaching 32k inodes
Closed, ResolvedPublic

Description

On 2016-02-16 21:11:25 Nodepool started raising alarms while attempting to add a slave in Jenkins:

JenkinsException: Error in request.Possibly authentication failed [500]

The pool of node got quickly exhausted and no build could run anymore.

[22:38:56]  <paladox>	it seems that https://integration.wikimedia.org/zuul/ has frozen again.
[22:52:18]  <paladox>	It seems there is a big queue at https://integration.wikimedia.org/zuul/ because rake-jessie is not working. hashar.
[22:53:19]  <legoktm>	are we out of nodepool slaves?
[22:54:25]  <legoktm>	Feb 16 22:52:09 labnodepool1001 nodepoold[1596]: JenkinsException: Error in request.Possibly authentication failed [500]
[22:56:25]  <+greg-g>	hashar: ^^
[22:56:29]  <legoktm>	paladox: it's building more slaves as we speak, just have to wait a bit
[22:56:30]  <hashar>	!log contint: Nodepool instances pool exhausted
[22:56:42]  <legoktm>	I can see it building more slaves right now


[22:56:45]  <+hashar>	must be some labs issue
[22:57:11]  <legoktm>	hashar: journald has a bunch of exceptions, I think jenkins was returning 500 errors to nodepool?
[22:57:47]  <+hashar>	looking at /var/log/nodepool/nodepool.log on labnodepool1001.eqiad.wmnet
[22:58:11]  <+hashar>	yeah apparently Nodepool could not authenticate with Jenkins
[22:58:32]  <+hashar>	first event on 21:18 UTC
[23:01:16]  <+hashar>	so why the hell does nodepool cant authenticate with Jenkins
[23:02:49]  <+hashar>	!log Nodepool can not authenticate with Jenkins anymore. Thus it can not add slaves it spawned.
[23:11:27]  <+hashar>	I am gonna nuke Jenkins
[23:14:37]  <+hashar>	!log Jenkins: Could not create rootDir /var/lib/jenkins/config-history/nodes/ci-jessie-wikimedia-34969/2016-02-16_22-40-23
[23:14:46]  <+hashar>	CAUSE THERE IS ONLY 32K INODES PER DIR !!!!!!!!!!!!!
[23:15:07]  <+hashar>	found via https://integration.wikimedia.org/ci/log/Warnings/


[23:17:13]  <+hashar>	!log Jenkins accepting slave creations again. Root cause is /var/lib/jenkins/config-history/nodes/ has reached the 32k inode limit.
[23:17:40]  <+hashar>	2016-02-16 23:16:40,691 INFO nodepool.NodeLauncher: Node id: 35052 added to jenkins
[23:18:16]  <+hashar>	!log jenkins@gallium find /var/lib/jenkins/config-history/nodes -maxdepth 1 -type d -name 'ci-jessie*' -exec rm -vfR {} \;

The Jenkins master has plugin that keep an history of config changes and that includes slaves. When we have reached 32k + entries in the directory /var/lib/jenkins/config-history/nodes/ it reached 32k inodes and the file system refused to save. That prevents Jenkins from adding the slave.

Actions

Event Timeline

hashar raised the priority of this task from to Needs Triage.
hashar updated the task description. (Show Details)
hashar added subscribers: hashar, Legoktm, greg.
hashar changed the task status from Open to Stalled.Jun 6 2016, 9:48 PM
Paladox changed the task status from Stalled to Open.Sep 2 2016, 2:29 PM

I guess we can now go forward with this task, reopening it now.

hashar claimed this task.

That is solved, the workaround is to garbage collect the nodes history.