Page MenuHomePhabricator

Improve housekeeping of files in /tmp on Hadoop workers
Closed, ResolvedPublic

Description

We receive periodic alerts based about the root volume on hadoop workers becoming full.
For example:

btullis@an-worker1092:~$ df -h /
Filesystem                           Size  Used Avail Use% Mounted on
/dev/mapper/an--worker1092--vg-root   55G   49G  3.2G  94% /

Upon investigation, we find that most of the space is taken up in the /tmp directory.

--- / -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   36.2 GiB [##########] /tmp                                                                                                                                                                                      
    6.3 GiB [#         ] /var
    4.5 GiB [#         ] /usr
    1.6 GiB [          ] /opt
    8.4 MiB [          ] /etc
    7.2 MiB [          ] /home
   36.0 KiB [          ] /root

Of this used space in /tmp, the vast majority of it is consumed by jar files.

btullis@an-worker1092:~$ sudo find /tmp -name *.jar -exec du -ch {} + | grep total$
34G	total

What is more, most of those are more than 30 days old.

btullis@an-worker1092:~$ sudo find /tmp -name *.jar -mtime +30 -exec du -ch {} + | grep total$
32G	total

Many different jars are affected, although different versions of refinery-hive make up the vast majority of them.

btullis@an-worker1092:~$ sudo find /tmp -name *.jar -mtime +30 | awk -F / '{print $NF}'|sort|uniq -c|sort -rn|head -n 20
    413 refinery-hive-0.2.1-shaded.jar
    136 refinery-hive-0.2.54-shaded.jar
     97 refinery-hive-0.2.48-shaded.jar
     96 refinery-hive-0.2.42-shaded.jar
     51 refinery-hive-0.2.30-shaded.jar
     17 refinery-hive-0.0.91-SNAPSHOT.jar
     12 refinery-hive-0.2.59-shaded.jar
      7 refinery-job-0.2.1-shaded.jar
      7 org.apache.iceberg_iceberg-spark-runtime-3.3_2.12-1.6.1.jar
      7 iceberg-spark-runtime-3.3_2.12-1.6.1.jar
      6 unused-1.0.0.jar
      6 refinery-job-0.2.54-SNAPSHOT-shaded.jar
      6 org.spark-project.spark_unused-1.0.0.jar
      4 mysql-connector-j-8.2.0.jar
      3 zstd-jni-1.4.8-1.jar
      3 spark-token-provider-kafka-0-10_2.12-3.1.2.jar
      3 spark-sql-kafka-0-10_2.12-3.1.2.jar
      3 spark-avro_2.12-3.1.2.jar
      3 snappy-java-1.1.8.2.jar
      3 snakeyaml-1.26.jar

It would be good if we could work out how best to prevent this gradual build-up of jar files in /tmp on the an-worker nodes.

Event Timeline

I obtained the full list of how much space is used on each worker by jars more than 30 days old.

btullis@cumin1003:~$ sudo cumin --force --no-progress --no-color -o txt A:hadoop-worker 'find /tmp -name *.jar -mtime +30 -exec du -ch {} + | grep total$'| awk 'x==1 { print $0 } /_____FORMATTED_OUTPUT_____/ { x=1 }'|sort -rh -k2
135 hosts will be targeted:
an-worker[1065-1069,1078-1177,1187-1208].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet
FORCE mode enabled, continuing without confirmation
an-worker1105.eqiad.wmnet: 37G	total
an-worker1117.eqiad.wmnet: 36G	total
an-worker1097.eqiad.wmnet: 36G	total
an-worker1162.eqiad.wmnet: 35G	total
an-worker1154.eqiad.wmnet: 35G	total
an-worker1133.eqiad.wmnet: 35G	total
an-worker1107.eqiad.wmnet: 35G	total
an-worker1095.eqiad.wmnet: 35G	total
an-worker1091.eqiad.wmnet: 35G	total
an-worker1089.eqiad.wmnet: 35G	total
an-worker1160.eqiad.wmnet: 34G	total
an-worker1152.eqiad.wmnet: 34G	total
an-worker1140.eqiad.wmnet: 34G	total
an-worker1131.eqiad.wmnet: 34G	total
an-worker1128.eqiad.wmnet: 34G	total
an-worker1125.eqiad.wmnet: 34G	total
an-worker1122.eqiad.wmnet: 34G	total
an-worker1121.eqiad.wmnet: 34G	total
an-worker1113.eqiad.wmnet: 34G	total
an-worker1111.eqiad.wmnet: 34G	total
an-worker1110.eqiad.wmnet: 34G	total
an-worker1109.eqiad.wmnet: 34G	total
an-worker1106.eqiad.wmnet: 34G	total
an-worker1099.eqiad.wmnet: 34G	total
an-worker1094.eqiad.wmnet: 34G	total
an-worker1093.eqiad.wmnet: 34G	total
an-worker1086.eqiad.wmnet: 34G	total
an-worker1081.eqiad.wmnet: 34G	total
an-worker1175.eqiad.wmnet: 33G	total
an-worker1161.eqiad.wmnet: 33G	total
an-worker1159.eqiad.wmnet: 33G	total
an-worker1153.eqiad.wmnet: 33G	total
an-worker1151.eqiad.wmnet: 33G	total
an-worker1149.eqiad.wmnet: 33G	total
an-worker1136.eqiad.wmnet: 33G	total
an-worker1132.eqiad.wmnet: 33G	total
an-worker1130.eqiad.wmnet: 33G	total
an-worker1126.eqiad.wmnet: 33G	total
an-worker1124.eqiad.wmnet: 33G	total
an-worker1120.eqiad.wmnet: 33G	total
an-worker1118.eqiad.wmnet: 33G	total
an-worker1116.eqiad.wmnet: 33G	total
an-worker1115.eqiad.wmnet: 33G	total
an-worker1114.eqiad.wmnet: 33G	total
an-worker1108.eqiad.wmnet: 33G	total
an-worker1098.eqiad.wmnet: 33G	total
an-worker1087.eqiad.wmnet: 33G	total
an-worker1084.eqiad.wmnet: 33G	total
an-worker1083.eqiad.wmnet: 33G	total
an-worker1158.eqiad.wmnet: 32G	total
an-worker1150.eqiad.wmnet: 32G	total
an-worker1147.eqiad.wmnet: 32G	total
an-worker1145.eqiad.wmnet: 32G	total
an-worker1137.eqiad.wmnet: 32G	total
an-worker1135.eqiad.wmnet: 32G	total
an-worker1129.eqiad.wmnet: 32G	total
an-worker1123.eqiad.wmnet: 32G	total
an-worker1112.eqiad.wmnet: 32G	total
an-worker1103.eqiad.wmnet: 32G	total
an-worker1096.eqiad.wmnet: 32G	total
an-worker1092.eqiad.wmnet: 32G	total
an-worker1088.eqiad.wmnet: 32G	total
an-worker1082.eqiad.wmnet: 32G	total
an-worker1079.eqiad.wmnet: 32G	total
an-worker1078.eqiad.wmnet: 32G	total
an-worker1157.eqiad.wmnet: 31G	total
an-worker1144.eqiad.wmnet: 31G	total
an-worker1141.eqiad.wmnet: 31G	total
an-worker1139.eqiad.wmnet: 31G	total
an-worker1138.eqiad.wmnet: 31G	total
an-worker1080.eqiad.wmnet: 31G	total
analytics1071.eqiad.wmnet: 31G	total
an-worker1146.eqiad.wmnet: 30G	total
an-worker1142.eqiad.wmnet: 30G	total
an-worker1134.eqiad.wmnet: 30G	total
an-worker1127.eqiad.wmnet: 30G	total
an-worker1090.eqiad.wmnet: 30G	total
an-worker1143.eqiad.wmnet: 29G	total
analytics1073.eqiad.wmnet: 29G	total
analytics1072.eqiad.wmnet: 29G	total
analytics1070.eqiad.wmnet: 29G	total
an-worker1104.eqiad.wmnet: 28G	total
analytics1076.eqiad.wmnet: 26G	total
analytics1075.eqiad.wmnet: 26G	total
an-worker1085.eqiad.wmnet: 25G	total
analytics1077.eqiad.wmnet: 25G	total
analytics1074.eqiad.wmnet: 25G	total
an-worker1176.eqiad.wmnet: 24G	total
an-worker1067.eqiad.wmnet: 14G	total
an-worker1066.eqiad.wmnet: 14G	total
an-worker1069.eqiad.wmnet: 13G	total
an-worker1065.eqiad.wmnet: 13G	total
an-worker1193.eqiad.wmnet: 6.4G	total
an-worker1198.eqiad.wmnet: 6.0G	total
an-worker1102.eqiad.wmnet: 5.9G	total
an-worker1192.eqiad.wmnet: 5.8G	total
an-worker1191.eqiad.wmnet: 5.7G	total
an-worker1189.eqiad.wmnet: 5.6G	total
an-worker1188.eqiad.wmnet: 5.5G	total
an-worker1207.eqiad.wmnet: 5.4G	total
an-worker1203.eqiad.wmnet: 5.4G	total
an-worker1199.eqiad.wmnet: 5.4G	total
an-worker1196.eqiad.wmnet: 5.4G	total
an-worker1190.eqiad.wmnet: 5.4G	total
an-worker1187.eqiad.wmnet: 5.3G	total
an-worker1204.eqiad.wmnet: 5.2G	total
an-worker1197.eqiad.wmnet: 5.1G	total
an-worker1195.eqiad.wmnet: 5.1G	total
an-worker1205.eqiad.wmnet: 4.9G	total
an-worker1194.eqiad.wmnet: 4.9G	total
an-worker1208.eqiad.wmnet: 4.8G	total
an-worker1200.eqiad.wmnet: 4.5G	total
an-worker1201.eqiad.wmnet: 4.4G	total
an-worker1206.eqiad.wmnet: 3.7G	total
an-worker1173.eqiad.wmnet: 3.4G	total
an-worker1202.eqiad.wmnet: 3.1G	total
an-worker1172.eqiad.wmnet: 2.9G	total
an-worker1170.eqiad.wmnet: 2.9G	total
an-worker1174.eqiad.wmnet: 2.8G	total
an-worker1171.eqiad.wmnet: 2.4G	total
an-worker1169.eqiad.wmnet: 2.0G	total
an-worker1168.eqiad.wmnet: 1013M	total
an-worker1167.eqiad.wmnet: 975M	total
an-worker1166.eqiad.wmnet: 932M	total
an-worker1156.eqiad.wmnet: 414M	total
an-worker1155.eqiad.wmnet: 335M	total
an-worker1177.eqiad.wmnet: 255M	total
an-worker1148.eqiad.wmnet: 255M	total
an-worker1068.eqiad.wmnet: 255M	total
an-worker1165.eqiad.wmnet: 207M	total
an-worker1164.eqiad.wmnet: 207M	total
an-worker1119.eqiad.wmnet: 207M	total
an-worker1163.eqiad.wmnet: 128M	total

I have done the following.

btullis@cumin1003:~$ sudo cumin A:hadoop-worker 'find /tmp -name *.jar -mtime +30 -delete'
135 hosts will be targeted:
an-worker[1065-1069,1078-1177,1187-1208].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet
OK to proceed on 135 hosts? Enter the number of affected hosts to confirm or "q" to quit: 135
===== NO OUTPUT =====                                                                                                                                                                                              
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (135/135) [00:05<00:00, 22.60hosts/s]
FAIL |                                                                                                                                                                           |   0% (0/135) [00:05<?, ?hosts/s]
100.0% (135/135) success ratio (>= 100.0% threshold) for command: 'find /tmp -name ...time +30 -delete'.
100.0% (135/135) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
btullis@cumin1003:~$

Mentioned in SAL (#wikimedia-operations) [2025-10-24T08:43:43Z] <gehel> cleanup old jar files on an-worker nodes - T396582 - sudo cumin A:hadoop-worker 'find /tmp -name *.jar -mtime +30 -delete'

It looks like not only jar files take space. At the moment, an.worker1161 is low on disk space, with a single directory taking 3.4G of what looks like sqlite database from a AddLink job.

gehel@an-worker1161:~$ sudo ls -lh /tmp/ozge/addalink/shard_nlwiki/release
total 3.4G
-rw-r----- 1 ozge wikidev  74M Sep  5 11:22 anchors.pkl
-rw-r----- 1 ozge wikidev 139M Sep  5 12:28 anchors.sqlite
-rw-r----- 1 ozge wikidev  124 Sep  5 11:26 anchors.sqlite.checksum
-rw-r----- 1 ozge wikidev  64M Sep  5 11:26 anchors.sqlite.gz
-rw-r----- 1 ozge wikidev 2.7M Sep  5 12:50 nlwiki.linkmodel.joblib
-rw-r----- 1 ozge wikidev 4.6M Sep  5 12:50 nlwiki.linkmodel.json
-rw-r----- 1 ozge wikidev   88 Sep  5 12:50 nlwiki.linkmodel.json.checksum
-rw-r----- 1 ozge wikidev  56M Sep  5 11:26 pageids.pkl
-rw-r----- 1 ozge wikidev 161M Sep  5 12:32 pageids.sqlite
-rw-r----- 1 ozge wikidev  124 Sep  5 11:32 pageids.sqlite.checksum
-rw-r----- 1 ozge wikidev  71M Sep  5 11:32 pageids.sqlite.gz
-rw-r----- 1 ozge wikidev  35M Sep  5 11:32 redirects.pkl
-rw-r----- 1 ozge wikidev  74M Sep  5 12:38 redirects.sqlite
-rw-r----- 1 ozge wikidev  126 Sep  5 11:34 redirects.sqlite.checksum
-rw-r----- 1 ozge wikidev  36M Sep  5 11:34 redirects.sqlite.gz
-rw-r----- 1 ozge wikidev 993M Sep  5 11:34 w2vfiltered.pkl
-rw-r----- 1 ozge wikidev 1.2G Sep  5 12:40 w2vfiltered.sqlite
-rw-r----- 1 ozge wikidev  128 Sep  5 12:07 w2vfiltered.sqlite.checksum
-rw-r----- 1 ozge wikidev 555M Sep  5 12:07 w2vfiltered.sqlite.gz
bking renamed this task from Improve housekeeping of jar files in /tmp on Hadoop workers to Improve housekeeping of files in /tmp on Hadoop workers.Oct 27 2025, 6:32 PM

I want to share a bit more context based on this Slack thread .

As @Gehel mentioned above, not all the files are jars. I removed some jars from an-worker1128 today but it only recouped a few GB.

Tmpreaper can clean up /tmp on a regular basis, and it's available from the Debian repos. (Actually, it looks like WMF has used it in the past ).

I've never used it, but we had a similar script called tmpwatch in my early sysad days and it worked pretty well.

@BTullis or @Stevemunene , what do you think about installing tmpreaper or using some other means to automatically clean out any files in tmp that are older than 30 days?

Checked on a random host, there are > 5G of tmp files older than 30 days:

gehel@an-worker1161:~$ sudo nice ionice -c3 find /tmp -mtime +30 -printf "%s\n" | sumCol 1
5675191027

In the meantime I ran

brouberol@cumin1003:~$ sudo cumin A:hadoop-worker 'find /tmp -name *.jar -mtime +30 -delete'

Change #1199233 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] Hadoop: Introduce tmpreaper to cleanup /tmp

https://gerrit.wikimedia.org/r/1199233

Change #1199233 abandoned by Gehel:

[operations/puppet@production] Hadoop: Introduce tmpreaper to cleanup /tmp

Reason:

The modern solution is systemd!

https://gerrit.wikimedia.org/r/1199233

Change #1199239 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] Hadoop: cleanup /tmp with systemd::tmpfile

https://gerrit.wikimedia.org/r/1199239

Change #1199239 merged by Gehel:

[operations/puppet@production] Hadoop: cleanup /tmp with systemd::tmpfile

https://gerrit.wikimedia.org/r/1199239

Code merged. Manual test on an-worker1161:

gehel@an-worker1161:~$ sudo nice ionice -c3 find /tmp -mtime +30 | wc -l
18921
gehel@an-worker1161:~$ sudo run-puppet-agent
[...]
gehel@an-worker1161:~$ sudo systemd-tmpfiles --clean /etc/tmpfiles.d/tmp-hadoop.conf 
gehel@an-worker1161:~$ sudo systemd-tmpfiles --remove /etc/tmpfiles.d/tmp-hadoop.conf 
gehel@an-worker1161:~$ sudo nice ionice -c3 find /tmp -mtime +30 | wc -l
14800

Some files have been cleaned up (around ~4000 files), but not all. Not sure what's going on here.

Change #1199334 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] hadoop: cleanup /tmp from directories as well as files

https://gerrit.wikimedia.org/r/1199334

Change #1199334 merged by Gehel:

[operations/puppet@production] hadoop: cleanup /tmp from directories as well as files

https://gerrit.wikimedia.org/r/1199334