Page MenuHomePhabricator

Jenkins: Figure out long term solution for /tmp management
Closed, ResolvedPublic

Description

While the VM testing will make this problem obsolete, depending on how far away this is, this is a high priority problem.

There are various kind of jobs, programs, utilities and other scripts run inside a Jenkins job that may produce output in /tmp. In some cases this is configurable and can be disabled. In other cases, it's just an artefact of a lower level program and really not feasible (nor reasonable) to make this configurable.

We shouldn't keep /tmp around for ever.

I propose one or both of:

  1. Set the $TMPDIR environmental variable to something that is cleaned up.
    • Dedicated to the job. E.g. some Jenkins plugin that runs globally on all jobs and nukes the tmp dir after the job is run.
    • Dedicated to jenkins-slave. E.g. some generic /tmp/jenkins-slave dir that is purged by a cronjob we run on all contint slaves that will rm- rf items older than 6 hours.
  1. Purge everything in /tmp older than 24 hours.

See also:

Details

Reference
bz72011

Related Objects

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:51 AM
bzimport set Reference to bz72011.
bzimport added a subscriber: Unknown Object (MLST).

Seems like a duplicate of Bug 68563 - Jenkins: point TMP/TEMP to workspace and delete it after build completion

Krinkle updated the task description. (Show Details)
Krinkle set Security to None.
Krinkle removed a subscriber: Unknown Object (MLST).
Krinkle triaged this task as Medium priority.Nov 26 2014, 4:11 AM
Krinkle updated the task description. (Show Details)
Krinkle removed a subscriber: Unknown Object (MLST).

Change 181047 had a related patch set uploaded (by Aude):
Cleanup SiteListFileCache test files in tearDown

https://gerrit.wikimedia.org/r/181047

Patch-For-Review

Krinkle raised the priority of this task from Medium to High.Feb 26 2015, 2:56 AM
Krinkle removed a project: Patch-For-Review.

Raising priority, and recommending we give something like T89327: Consider running tmpreaper on Jenkins jobs' tmpfs a serious thought (until we have disposable VMs per-job). There are just too many things propping up in npm (see sub tasks on this task). Every week something new comes up.

It's costing lots of time and effort to track these down and to clean them up manually. Jenkins is frequently auto-depooling slaves without warning because they have less than 1GB of /tmp space left. It takes about 1-2 days for an average slave to get its temp clogged.

Our slaves have a large /mnt disk (~ 70GB), but the root disk (which /tmp is part of) is quite tiny at only 7GB, of which ~ 5GB tends to be in use by the standard provision and binaries. Leaving merely 1 GB for /var and /tmp. That should be enough in theory for a dozen concurrent builds. But because of left over garbage in temp and build up of logs, things are piling up.

Current disk usage of slaves:

1$ dsh-ci-slaves 'df -h'
2
3integration-slave1001.eqiad.wmflabs: Filesystem Size Used Avail Use% Mounted on
4integration-slave1001.eqiad.wmflabs: /dev/vda1 9.4G 3.6G 5.4G 40% /
5integration-slave1001.eqiad.wmflabs: udev 3.9G 12K 3.9G 1% /dev
6integration-slave1001.eqiad.wmflabs: tmpfs 799M 304K 799M 1% /run
7integration-slave1001.eqiad.wmflabs: none 5.0M 0 5.0M 0% /run/lock
8integration-slave1001.eqiad.wmflabs: none 3.9G 0 3.9G 0% /run/shm
9integration-slave1001.eqiad.wmflabs: /dev/mapper/vd-var 2.0G 776M 1.2G 41% /var
10integration-slave1001.eqiad.wmflabs: /dev/mapper/vd-second--local--disk 64G 28G 34G 46% /mnt
11integration-slave1001.eqiad.wmflabs: /dev/mapper/vd-log 2.0G 234M 1.7G 13% /var/log
12integration-slave1001.eqiad.wmflabs: tmpfs 512M 0 512M 0% /mnt/home/jenkins-deploy/tmpfs
13integration-slave1001.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/scratch 7.3T 1.3T 6.0T 17% /data/scratch
14integration-slave1001.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/project 30T 15T 16T 49% /data/project
15integration-slave1001.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/keys 960M 47M 913M 5% /public/keys
16integration-slave1001.eqiad.wmflabs: labstore1003.eqiad.wmnet:/dumps 44T 9.8T 34T 23% /public/dumps
17integration-slave1001.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/home 30T 15T 16T 49% /home
18
19integration-slave1002.eqiad.wmflabs: Filesystem Size Used Avail Use% Mounted on
20integration-slave1002.eqiad.wmflabs: /dev/vda1 7.6G 4.1G 3.2G 57% /
21integration-slave1002.eqiad.wmflabs: udev 3.9G 12K 3.9G 1% /dev
22integration-slave1002.eqiad.wmflabs: tmpfs 1.6G 320K 1.6G 1% /run
23integration-slave1002.eqiad.wmflabs: none 5.0M 0 5.0M 0% /run/lock
24integration-slave1002.eqiad.wmflabs: none 3.9G 0 3.9G 0% /run/shm
25integration-slave1002.eqiad.wmflabs: /dev/vda2 1.9G 1.1G 750M 59% /var
26integration-slave1002.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/scratch 7.3T 1.3T 6.0T 17% /data/scratch
27integration-slave1002.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/project 30T 15T 16T 49% /data/project
28integration-slave1002.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/keys 960M 47M 913M 5% /public/keys
29integration-slave1002.eqiad.wmflabs: labstore1003.eqiad.wmnet:/dumps 44T 9.8T 34T 23% /public/dumps
30integration-slave1002.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/home 30T 15T 16T 49% /home
31integration-slave1002.eqiad.wmflabs: /dev/mapper/vd-second--local--disk 68G 36G 29G 56% /mnt
32integration-slave1002.eqiad.wmflabs: tmpfs 512M 0 512M 0% /mnt/home/jenkins-deploy/tmpfs
33
34integration-slave1003.eqiad.wmflabs: Filesystem Size Used Avail Use% Mounted on
35integration-slave1003.eqiad.wmflabs: /dev/vda1 7.6G 4.1G 3.1G 57% /
36integration-slave1003.eqiad.wmflabs: udev 3.9G 12K 3.9G 1% /dev
37integration-slave1003.eqiad.wmflabs: tmpfs 799M 320K 799M 1% /run
38integration-slave1003.eqiad.wmflabs: none 5.0M 0 5.0M 0% /run/lock
39integration-slave1003.eqiad.wmflabs: none 3.9G 0 3.9G 0% /run/shm
40integration-slave1003.eqiad.wmflabs: /dev/vda2 1.9G 924M 902M 51% /var
41integration-slave1003.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/home 30T 15T 16T 49% /home
42integration-slave1003.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/keys 960M 47M 913M 5% /public/keys
43integration-slave1003.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/scratch 7.3T 1.3T 6.0T 17% /data/scratch
44integration-slave1003.eqiad.wmflabs: labstore1003.eqiad.wmnet:/dumps 44T 9.8T 34T 23% /public/dumps
45integration-slave1003.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/project 30T 15T 16T 49% /data/project
46integration-slave1003.eqiad.wmflabs: /dev/mapper/vd-second--local--disk 68G 31G 34G 48% /mnt
47integration-slave1003.eqiad.wmflabs: tmpfs 512M 0 512M 0% /mnt/home/jenkins-deploy/tmpfs
48
49integration-slave1004.eqiad.wmflabs: Filesystem Size Used Avail Use% Mounted on
50integration-slave1004.eqiad.wmflabs: /dev/vda1 9.4G 3.9G 5.1G 43% /
51integration-slave1004.eqiad.wmflabs: udev 3.9G 12K 3.9G 1% /dev
52integration-slave1004.eqiad.wmflabs: tmpfs 799M 320K 799M 1% /run
53integration-slave1004.eqiad.wmflabs: none 5.0M 0 5.0M 0% /run/lock
54integration-slave1004.eqiad.wmflabs: none 3.9G 0 3.9G 0% /run/shm
55integration-slave1004.eqiad.wmflabs: /dev/mapper/vd-var 2.0G 799M 1.1G 42% /var
56integration-slave1004.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/keys 960M 47M 913M 5% /public/keys
57integration-slave1004.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/project 30T 15T 16T 49% /data/project
58integration-slave1004.eqiad.wmflabs: labstore1003.eqiad.wmnet:/dumps 44T 9.8T 34T 23% /public/dumps
59integration-slave1004.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/scratch 7.3T 1.3T 6.0T 17% /data/scratch
60integration-slave1004.eqiad.wmflabs: /dev/mapper/vd-second--local--disk 64G 29G 33G 47% /mnt
61integration-slave1004.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/home 30T 15T 16T 49% /home
62integration-slave1004.eqiad.wmflabs: /dev/mapper/vd-log 2.0G 183M 1.7G 10% /var/log
63integration-slave1004.eqiad.wmflabs: tmpfs 512M 0 512M 0% /mnt/home/jenkins-deploy/tmpfs
64
65integration-slave1005.eqiad.wmflabs: Filesystem Size Used Avail Use% Mounted on
66integration-slave1005.eqiad.wmflabs: /dev/vda1 9.3G 5.3G 3.6G 61% /
67integration-slave1005.eqiad.wmflabs: none 4.0K 0 4.0K 0% /sys/fs/cgroup
68integration-slave1005.eqiad.wmflabs: udev 3.9G 12K 3.9G 1% /dev
69integration-slave1005.eqiad.wmflabs: tmpfs 799M 5.8M 793M 1% /run
70integration-slave1005.eqiad.wmflabs: none 5.0M 0 5.0M 0% /run/lock
71integration-slave1005.eqiad.wmflabs: none 3.9G 0 3.9G 0% /run/shm
72integration-slave1005.eqiad.wmflabs: none 100M 0 100M 0% /run/user
73integration-slave1005.eqiad.wmflabs: /dev/mapper/vd-var 2.0G 764M 1.1G 42% /var
74integration-slave1005.eqiad.wmflabs: /dev/mapper/vd-second--local--disk 64G 27G 35G 44% /mnt
75integration-slave1005.eqiad.wmflabs: /dev/mapper/vd-log 2.0G 139M 1.7G 8% /var/log
76integration-slave1005.eqiad.wmflabs: tmpfs 512M 0 512M 0% /mnt/home/jenkins-deploy/tmpfs
77integration-slave1005.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/keys 960M 47M 913M 5% /public/keys
78integration-slave1005.eqiad.wmflabs: labstore1003.eqiad.wmnet:/dumps 44T 9.8T 34T 23% /public/dumps
79integration-slave1005.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/scratch 7.3T 1.3T 6.0T 17% /data/scratch
80integration-slave1005.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/home 30T 15T 16T 49% /home
81
82integration-slave1006.eqiad.wmflabs: Filesystem Size Used Avail Use% Mounted on
83integration-slave1006.eqiad.wmflabs: /dev/vda1 7.4G 5.5G 1.6G 78% /
84integration-slave1006.eqiad.wmflabs: none 4.0K 0 4.0K 0% /sys/fs/cgroup
85integration-slave1006.eqiad.wmflabs: udev 3.9G 12K 3.9G 1% /dev
86integration-slave1006.eqiad.wmflabs: tmpfs 799M 5.8M 793M 1% /run
87integration-slave1006.eqiad.wmflabs: none 5.0M 0 5.0M 0% /run/lock
88integration-slave1006.eqiad.wmflabs: none 3.9G 0 3.9G 0% /run/shm
89integration-slave1006.eqiad.wmflabs: none 100M 0 100M 0% /run/user
90integration-slave1006.eqiad.wmflabs: /dev/vda2 1.9G 919M 858M 52% /var
91integration-slave1006.eqiad.wmflabs: /dev/mapper/vd-second--local--disk 68G 34G 31G 53% /mnt
92integration-slave1006.eqiad.wmflabs: tmpfs 512M 0 512M 0% /mnt/home/jenkins-deploy/tmpfs
93integration-slave1006.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/scratch 7.3T 1.3T 6.0T 17% /data/scratch
94integration-slave1006.eqiad.wmflabs: labstore1003.eqiad.wmnet:/dumps 44T 9.8T 34T 23% /public/dumps
95integration-slave1006.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/home 30T 15T 16T 49% /home
96integration-slave1006.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/keys 960M 47M 913M 5% /public/keys
97
98integration-slave1007.eqiad.wmflabs: Filesystem Size Used Avail Use% Mounted on
99integration-slave1007.eqiad.wmflabs: /dev/vda1 7.4G 5.4G 1.7G 77% /
100integration-slave1007.eqiad.wmflabs: none 4.0K 0 4.0K 0% /sys/fs/cgroup
101integration-slave1007.eqiad.wmflabs: udev 3.9G 12K 3.9G 1% /dev
102integration-slave1007.eqiad.wmflabs: tmpfs 799M 5.8M 793M 1% /run
103integration-slave1007.eqiad.wmflabs: none 5.0M 0 5.0M 0% /run/lock
104integration-slave1007.eqiad.wmflabs: none 3.9G 0 3.9G 0% /run/shm
105integration-slave1007.eqiad.wmflabs: none 100M 0 100M 0% /run/user
106integration-slave1007.eqiad.wmflabs: /dev/vda2 1.9G 980M 798M 56% /var
107integration-slave1007.eqiad.wmflabs: /dev/mapper/vd-second--local--disk 68G 30G 35G 46% /mnt
108integration-slave1007.eqiad.wmflabs: tmpfs 512M 0 512M 0% /mnt/home/jenkins-deploy/tmpfs
109integration-slave1007.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/keys 960M 47M 913M 5% /public/keys
110integration-slave1007.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/home 30T 15T 16T 49% /home
111integration-slave1007.eqiad.wmflabs: labstore1003.eqiad.wmnet:/dumps 44T 9.8T 34T 23% /public/dumps
112integration-slave1007.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/scratch 7.3T 1.3T 6.0T 17% /data/scratch
113
114integration-slave1008.eqiad.wmflabs: Filesystem Size Used Avail Use% Mounted on
115integration-slave1008.eqiad.wmflabs: /dev/vda1 7.4G 5.4G 1.7G 77% /
116integration-slave1008.eqiad.wmflabs: none 4.0K 0 4.0K 0% /sys/fs/cgroup
117integration-slave1008.eqiad.wmflabs: udev 3.9G 12K 3.9G 1% /dev
118integration-slave1008.eqiad.wmflabs: tmpfs 799M 5.8M 793M 1% /run
119integration-slave1008.eqiad.wmflabs: none 5.0M 0 5.0M 0% /run/lock
120integration-slave1008.eqiad.wmflabs: none 3.9G 0 3.9G 0% /run/shm
121integration-slave1008.eqiad.wmflabs: none 100M 0 100M 0% /run/user
122integration-slave1008.eqiad.wmflabs: /dev/vda2 1.9G 900M 877M 51% /var
123integration-slave1008.eqiad.wmflabs: /dev/mapper/vd-second--local--disk 68G 30G 35G 46% /mnt
124integration-slave1008.eqiad.wmflabs: tmpfs 512M 0 512M 0% /mnt/home/jenkins-deploy/tmpfs
125integration-slave1008.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/home 30T 15T 16T 49% /home
126integration-slave1008.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/scratch 7.3T 1.3T 6.0T 17% /data/scratch
127integration-slave1008.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/keys 960M 47M 913M 5% /public/keys
128integration-slave1008.eqiad.wmflabs: labstore1003.eqiad.wmnet:/dumps 44T 9.8T 34T 23% /public/dumps
129
130integration-slave1010.eqiad.wmflabs: Filesystem Size Used Avail Use% Mounted on
131integration-slave1010.eqiad.wmflabs: /dev/vda1 18G 6.8G 11G 41% /
132integration-slave1010.eqiad.wmflabs: none 4.0K 0 4.0K 0% /sys/fs/cgroup
133integration-slave1010.eqiad.wmflabs: udev 3.9G 12K 3.9G 1% /dev
134integration-slave1010.eqiad.wmflabs: tmpfs 799M 3.1M 796M 1% /run
135integration-slave1010.eqiad.wmflabs: none 5.0M 0 5.0M 0% /run/lock
136integration-slave1010.eqiad.wmflabs: none 3.9G 0 3.9G 0% /run/shm
137integration-slave1010.eqiad.wmflabs: none 100M 0 100M 0% /run/user
138integration-slave1010.eqiad.wmflabs: /dev/mapper/vd-second--local--disk 61G 13G 46G 22% /mnt
139integration-slave1010.eqiad.wmflabs: tmpfs 512M 296K 512M 1% /mnt/home/jenkins-deploy/tmpfs
140integration-slave1010.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/project/integration/home 30T 15T 16T 49% /home
141integration-slave1010.eqiad.wmflabs: labstore1003.eqiad.wmnet:/dumps 44T 9.8T 34T 23% /public/dumps
142integration-slave1010.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/keys 960M 47M 913M 5% /public/keys
143integration-slave1010.eqiad.wmflabs: labstore.svc.eqiad.wmnet:/scratch 7.3T 1.3T 6.0T 17% /data/scratch

The default /dev/vda1 for the main disk has increased for new labs instances. It's now 18GB instead of 7GB. We should re-create all our slaves to make use of this.

To do this in a way we can keep the current pool of slaves running, @Andrew has doubled our quota to allow for an additional 9 instances of m1.large type to be created alongside while we switch over.

I'll work on re-creating and provisioning all our instances tomorrow.

hashar claimed this task.

This is almost no more an issue compared to what it used to be at the end of 2014 / beginning of 2015. The complete resolution would be achieved whenever we have migrated all jobs to Nodepool disposable instances.

Meanwhile, there is little value in keeping this bug open and idling.