Jenkins files under /var/lib/jenkins/config-history/config need to be garbage collected
Closed, ResolvedPublic

Description

There is a limit of 32000 inodes per directory on gallium filesystem. The Jenkins configuration changes are saved under /var/lib/jenkins/config-history/config and it is no more able to save.

From https://integration.wikimedia.org/ci/log/Warnings/

java.lang.RuntimeException: Could not create rootDir /var/lib/jenkins/config-history/config/2016-02-10_22-36-03

The dir has reached 32k inodes.

hashar updated the task description. (Show Details)
hashar raised the priority of this task from to Needs Triage.
hashar added a subscriber: hashar.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 10 2016, 10:47 PM

On gallium:

find /var/lib/jenkins/config-history/config -type f -wholename '*/2015*' -delete

find /var/lib/jenkins/config-history/config -type d -name '2015*' -delete

Down to 9k entries

# ls -ld /var/lib/jenkins/config-history/config
drwxrwsr-x 9203

Now we need a bug upstream and a puppetize cron.

hashar triaged this task as Normal priority.Feb 10 2016, 11:18 PM
hashar removed hashar as the assignee of this task.Feb 11 2016, 5:44 PM
hashar set Security to None.

The plugin has a few configuration settings in https://integration.wikimedia.org/ci/configure :

  • Max number of history entries to keep
  • Max number of days to keep history entries
  • Max number of history entries to show per page
  • System configuration exclude file pattern = queue|nodeMonitors|UpdateCenter|global-build-stats
  • Do not save duplicate history

I have manually deleted some old history from 2012/2013. The plugin no more save them under the job directory but in a global path. Thus: find /var/lib/jenkins/jobs/*/config-history -delete

I have manually deleted the legacy nodes via: rm -fR /var/lib/jenkins/config-history/nodes/*_deleted_*

I then tried to add to the system exclude

  • nodes/ci-jessie-wikimedia
  • ci-jessie-wikimedia

None work. Gotta dig in the code :(

The regex is only uses for xml files directly in Jenkins home:

boolean isSaveable(final Saveable item, final XmlFile xmlFile) {

    if (item instanceof TopLevelItem) {
        return true;
    }

    if (xmlFile.getFile().getParentFile().equals(getJenkinsHome())) {
        return checkRegex(xmlFile);
    }

And only match against the base name :-(

private boolean checkRegex(final XmlFile xmlFile) {
    if (excludeRegexpPattern != null) {
        final Matcher matcher = excludeRegexpPattern.matcher(xmlFile.getFile().getName());
        return !matcher.find();
    } else {
        return true;
    }
}

Plugin does not track slaves that are either EphemeralNode or AbstractCloudSlave. But that is for plugins implementing cloud based slaves. Nodepool uses the Jenkins core API and that spawns regular slaves. So it is a dead-end.

Since that has hit /var/lib/jenkins/config-history/config/ and the nodes config as well, lets do the limitation globally. I did:

Max number of history entries to keep: 1000
Max number of days to keep history entries: 90

We will see in a few days / weeks what is going on.

hashar removed hashar as the assignee of this task.Feb 19 2016, 8:27 PM

I did the immediate clean up but that will pill up again until we have a garbage collector. Deleting files without restart Jenkins apparently confuse the plugins though :-(

Moving bug back to pool.

In puppet we have a tidy function that is able to garbage collect files matching a wildcard/path.

Example from modules/statistics/manifests/compute.pp:

# Clean up R temporary files which have not been accessed in a week.
tidy { '/tmp':
    matches => 'Rtmp*',
    age     => '1w',
    rmdirs  => true,
    backup  => false,
    recurse => 1,
}

Online doc for puppet 4.5 (we have 3.4.3) https://docs.puppet.com/puppet/latest/reference/types/tidy.html

Change 295641 had a related patch set uploaded (by Hashar):
contint: tidy Nodepool slaves config history

https://gerrit.wikimedia.org/r/295641

Change 295641 merged by Dzahn:
contint: tidy Nodepool slaves config history

https://gerrit.wikimedia.org/r/295641

The puppet tidy type got merged ( https://gerrit.wikimedia.org/r/#/c/295641 ). I can not see it being applied on gallium though. Will wait a day or so and check again.

So puppet.log has:Debug: /User[mnoushad]: Autorequiring Group[wikidev]

<LONG PAUSE>
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: /Stage[main]/Role::Ci::Master/Tidy[history of nodepool slaves config]: File does not exist
Info: Applying configuration version '1469041402'

strace of puppet shows:

lstat("/var/lib/jenkins/config-history/nodes/ci-jessie-wikimedia-187562_deleted_20160719_103741_374/2016-07-19_10-37-41", {st_mode=S_IFDIR|S_ISGID|0775, st_size=4096, ...}) = 0
openat(AT_FDCWD, "/var/lib/jenkins/config-history/nodes/ci-jessie-wikimedia-187562_deleted_20160719_103741_374/2016-07-19_10-37-41", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
lseek(4, 0, SEEK_SET)                   = 0
getdents(4, /* 3 entries */, 32768)     = 80
getdents(4, /* 0 entries */, 32768)     = 0
close(4)

And obviously the files do not get deleted :(

Change 300085 had a related patch set uploaded (by Hashar):
Revert "contint: tidy Nodepool slaves config history"

https://gerrit.wikimedia.org/r/300085

Change 300092 had a related patch set uploaded (by Hashar):
contint: tidy Nodepool slaves config history

https://gerrit.wikimedia.org/r/300092

Take two with tmpreaper which looks easier / sane :]

Change 300085 merged by Dzahn:
Revert "contint: tidy Nodepool slaves config history"

https://gerrit.wikimedia.org/r/300085

Mentioned in SAL [2016-08-07T12:47:31Z] <hashar> root cause of CI outage is T126552

I am wondering, can we set this as high priority please?

greg added a subscriber: greg.

There's not much use in petitioning about specific priority settings. As this only becomes a problem about once every few months, the priority setting is fine. It will be addressed though. The outstanding patch is: https://gerrit.wikimedia.org/r/#/c/300092/

Unrelatedly, I'm adding the Wikimedia-Incident project to this task as this was the cause of at least the outage on Sunday August 7th (see email titled "CI outage (solved)" on the operations mailing list).

hashar raised the priority of this task from Normal to High.

Would need someone familiar with tmp reaper to review the patch https://gerrit.wikimedia.org/r/#/c/300092/ and then we can get it deployed and check whether tmp reaper properly clear the files out.

Meanwhile I guess that is blocked on review by Operations

Setting priority to high since that causes a CI outage every ~ 32k Nodepool instances spawn or once per month.

Change 300092 abandoned by Hashar:
contint: tidy Nodepool slaves config history

Reason:
I am not entirely convinced by tmpreaper. It will garbage collect other directories under /var/lib/jenkins/config-history/nodes which I would like to preserve.

Would get with a cron/find -path 'that i want' -delete instead

https://gerrit.wikimedia.org/r/300092

Change 308165 had a related patch set uploaded (by Hashar):
contint: tidy Nodepool slaves config history

https://gerrit.wikimedia.org/r/308165

https://gerrit.wikimedia.org/r/#/c/300092/ had a typo and I eventually I found out it would delete the non Nodepool slaves as well.

https://gerrit.wikimedia.org/r/308165 goes with an hourly cron that uses find -path -delete which is way easier to understand.

Change 308165 merged by Dzahn:
contint: tidy Nodepool slaves config history

https://gerrit.wikimedia.org/r/308165

hashar closed this task as Resolved.Sep 2 2016, 2:27 PM

From the Gerrit change:

on gallium: Notice: /Stage[main]/Role::Ci::Master/Cron[tidy_jenkins_ephemeral_nodes_configs]/ensure: created

@gallium:~# crontab -u jenkins -l | grep history
35 * * * * /usr/bin/find /var/lib/jenkins/config-history/nodes -path '/var/lib/jenkins/config-history/nodes/ci-*' -mmin +60 -delete > /dev/null 2>&1

ran the command manually once: before there were 6409 files, after there were 55 files left