Page MenuHomePhabricator

Jenkins should auto-depool nodes if they run out of disk space on specific partitions
Closed, ResolvedPublic

Description

On https://integration.wikimedia.org/ci/computer/configure it says:

 This monitors the available disk space of $JENKINS_HOME on each agent, and if it gets below a threshold, the agent will be marked offline.

This directory is where all your builds are performed, so if it fills up, all the builds will fail.

But it doesn't seem like that works for the root partition, which is what filled up in T201077.

Event Timeline

Legoktm triaged this task as High priority.Aug 4 2018, 5:30 AM
Legoktm created this task.

I suggested something similar in T193661#4175728 but it dropped off my radar.

Here is the start of a groovy script that I was playing with at that time:

import hudson.util.RemotingDiagnostics    

groovy_script = 'println "df --output pcnt /srv".execute().text'

for (slave in hudson.model.Hudson.instance.slaves) {
  def computer = slave.computer
  println(computer.getName())
  def channel = computer.getChannel()
  println(RemotingDiagnostics.executeGroovy(groovy_script, channel))
}

Change 451078 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] WIP: Disconnect nodes where /srv > 95% full

https://gerrit.wikimedia.org/r/451078

Change 451078 merged by jenkins-bot:
[integration/config@master] Disconnect nodes where /srv > 95% full

https://gerrit.wikimedia.org/r/451078

thcipriani claimed this task.

Deployed a groovy script that is limited to running on integration-slave-* nodes.

Initially it wasn't limited to integration-slave-* nodes and it marked castor02 as offline (T202341) so it seems to be working correctly.

I'll mark this as resolved, but we can fiddle with parameters of this script further to ensure that it catches what it needs to.

Awesome, will it alert us anywhere (e.g. IRC) that it depooled a node? Or will we just rely on the shinken disk space alerts for that?

Awesome, will it alert us anywhere (e.g. IRC) that it depooled a node? Or will we just rely on the shinken disk space alerts for that?

Currently this relies on shinken disk space alerts, but it's a good point that it would be good to have it alert via IRC.

I'll reopen and make an adjustment to the job to fail if one of the nodes is offline due to /srv being full and have that alert in IRC.

I think doing a string comparison with computer.getOfflineCauseReason (https://javadoc.jenkins-ci.org/hudson/model/Computer.html#getOfflineCauseReason--) should work here.

thcipriani lowered the priority of this task from High to Medium.Aug 21 2018, 2:43 PM

Change 454306 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] Notify if node disconnected due to disk space

https://gerrit.wikimedia.org/r/454306

Change 454436 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] Take node offline if / is full

https://gerrit.wikimedia.org/r/454436

Jenkins monitors instances and would already disconnect them when the disk space is too low.

https://integration.wikimedia.org/ci/computer/configure has:

[x] Free Disk Space
Free Space Threshold:  300MB

[x] Free Swap Space
[x] Free Temp Space
Free Space Threshold:  1GB

So probably we should just raise the 300MB threshold.

Jenkins monitors instances and would already disconnect them when the disk space is too low.

That's seemingly only for the available disk space of $JENKINS_HOME but it seems like checking / might have prevented T202457.

Change 454306 merged by jenkins-bot:
[integration/config@master] Notify if node disconnected due to disk space

https://gerrit.wikimedia.org/r/454306

Change 454436 merged by jenkins-bot:
[integration/config@master] Take node offline if / is full

https://gerrit.wikimedia.org/r/454436

The job should now notify on IRC and via email.