Maniphest T201224

Jenkins should auto-depool nodes if they run out of disk space on specific partitions
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Legoktm
	Aug 4 2018, 5:30 AM

Description

On https://integration.wikimedia.org/ci/computer/configure it says:

 This monitors the available disk space of $JENKINS_HOME on each agent, and if it gets below a threshold, the agent will be marked offline.

This directory is where all your builds are performed, so if it fills up, all the builds will fail.

But it doesn't seem like that works for the root partition, which is what filled up in T201077.

Details

Subject	Repo	Branch	Lines +/-
Take node offline if / is full	integration/config	master	+7 -5
Notify if node disconnected due to disk space	integration/config	master	+44 -18
Disconnect nodes where /srv > 95% full	integration/config	master	+57 -0

Customize query in gerrit

Related Objects

Mentioned In: T193661: Alert in -releng when permanent hosts have low disk space
Mentioned Here: T202457: mediawiki-quibble docker jobs fails due to disk full
T202341: castor02 integration node /srv disk is full
T193661: Alert in -releng when permanent hosts have low disk space
T201077: MediaWiki core test failure: The table 'archive' is full

Event Timeline

Legoktm triaged this task as High priority.Aug 4 2018, 5:30 AM

Legoktm created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 4 2018, 5:30 AM

I suggested something similar in T193661#4175728 but it dropped off my radar.

Here is the start of a groovy script that I was playing with at that time:

import hudson.util.RemotingDiagnostics    

groovy_script = 'println "df --output pcnt /srv".execute().text'

for (slave in hudson.model.Hudson.instance.slaves) {
  def computer = slave.computer
  println(computer.getName())
  def channel = computer.getChannel()
  println(RemotingDiagnostics.executeGroovy(groovy_script, channel))
}

Change 451078 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] WIP: Disconnect nodes where /srv > 95% full

https://gerrit.wikimedia.org/r/451078

gerritbot added a project: Patch-For-Review.Aug 7 2018, 7:01 PM

Change 451078 merged by jenkins-bot:
[integration/config@master] Disconnect nodes where /srv > 95% full

https://gerrit.wikimedia.org/r/451078

Deployed a groovy script that is limited to running on integration-slave-* nodes.

Initially it wasn't limited to integration-slave-* nodes and it marked castor02 as offline (T202341) so it seems to be working correctly.

I'll mark this as resolved, but we can fiddle with parameters of this script further to ensure that it catches what it needs to.

Awesome, will it alert us anywhere (e.g. IRC) that it depooled a node? Or will we just rely on the shinken disk space alerts for that?

In T201224#4517385, @Legoktm wrote:

Awesome, will it alert us anywhere (e.g. IRC) that it depooled a node? Or will we just rely on the shinken disk space alerts for that?

Currently this relies on shinken disk space alerts, but it's a good point that it would be good to have it alert via IRC.

I'll reopen and make an adjustment to the job to fail if one of the nodes is offline due to /srv being full and have that alert in IRC.

I think doing a string comparison with computer.getOfflineCauseReason (https://javadoc.jenkins-ci.org/hudson/model/Computer.html#getOfflineCauseReason--) should work here.

Restricted Application added a project: Release-Engineering-Team (Kanban). · View Herald TranscriptAug 21 2018, 2:42 PM

thcipriani lowered the priority of this task from High to Medium.Aug 21 2018, 2:43 PM

Change 454306 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] Notify if node disconnected due to disk space

https://gerrit.wikimedia.org/r/454306

Change 454436 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[integration/config@master] Take node offline if / is full

https://gerrit.wikimedia.org/r/454436

Jenkins monitors instances and would already disconnect them when the disk space is too low.

https://integration.wikimedia.org/ci/computer/configure has:

[x] Free Disk Space
Free Space Threshold:  300MB

[x] Free Swap Space
[x] Free Temp Space
Free Space Threshold:  1GB

So probably we should just raise the 300MB threshold.

In T201224#4521814, @hashar wrote:

Jenkins monitors instances and would already disconnect them when the disk space is too low.

That's seemingly only for the available disk space of $JENKINS_HOME but it seems like checking / might have prevented T202457.

Change 454306 merged by jenkins-bot:
[integration/config@master] Notify if node disconnected due to disk space

https://gerrit.wikimedia.org/r/454306

Change 454436 merged by jenkins-bot:
[integration/config@master] Take node offline if / is full

https://gerrit.wikimedia.org/r/454436

The job should now notify on IRC and via email.

Krinkle awarded a token.Sep 1 2018, 1:43 AM

hashar mentioned this in T193661: Alert in -releng when permanent hosts have low disk space .Oct 29 2018, 12:37 PM

Jenkins should auto-depool nodes if they run out of disk space on specific partitionsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Jenkins should auto-depool nodes if they run out of disk space on specific partitions
Closed, ResolvedPublic
Actions