"No space left on device" on wsexport-prod01.eqiad
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tpt
	Feb 26 2020, 7:05 AM

Description

There does not seems to be any garbage collector in the tmp directory of our prod and dev servers, leading to a 100% disk occupancy and failures to do format conversions. I mitigated the problem by removing all the existing "garbage" by hand.

We should add a cron task (or something similar) to avoid this problem in the future.

First reported here: https://fr.wikisource.org/w/index.php?title=Sujet:Vhfanbvh1oj5hsde&topic_showPostId=vhhdnn0jeshqvw9h#flow-post-vhhdnn0jeshqvw9h

Related Objects

Mentioned Here: T190185: tmp file leak from tools.wsexport

Event Timeline

Tpt created this task.Feb 26 2020, 7:05 AM

Restricted Application added a project: Community-Tech. · View Herald TranscriptFeb 26 2020, 7:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Tpt updated the task description. (Show Details)Feb 26 2020, 7:05 AM

Would be good to have an idea about what is leaving stuff in /tmp in the first place. Cleaning out /tmp (especially also after failures) should be standard for any piece of software.

Samwilson claimed this task.Feb 28 2020, 1:12 AM

Samwilson edited projects, added Community-Tech (Kanban-Q3-2019-20); removed Community-Tech.

Samwilson moved this task from Ready to In Development on the Community-Tech (Kanban-Q3-2019-20) board.Mar 3 2020, 2:00 AM

The temp-file cleanup was doing glob( $this->dir . '/*.*' ), and so only finding files with dots in them.

PR: https://github.com/wsexport/tool/pull/223

I wondered about adding a cronjob to clear the cache, because the current regime only runs on 5% of requests. But maybe it's not required; we can keep an eye on it after this fix.

The output of df -h is now in the DB backup script, so we can see in those cron emails how much space there is (in lieu of actual low-space notifications I guess).

Above patch is merged and is deployed to the test site. I'm not sure what can be done to QA this, so if nothing breaks shall we deploy it to prod tomorrow?

This is now live on production. I'll keep monitoring the space usage. It looks like it's helped with some of it, but there are still temp calibre config files being created, e.g. /tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.10.1_tmp_z8wdJ3 and possibly not cleaned up.

Also, locally, I have found that if there is an exception creating an ebook, the temp files are not necessarily cleaned up. I am not sure how common this will be, and how much space this would take up. Perhaps the cron suggestion in T246197#5935464 would handle this.

Calibre has a few environment variables of interest:

CALIBRE_CONFIG_DIRECTORY - sets the directory where configuration files are stored/read.
CALIBRE_TEMP_DIR - sets the temporary directory used by calibre
CALIBRE_CACHE_DIRECTORY - sets the directory calibre uses to cache persistent data between sessions

I've set the first of these (to at least get rid of the warning), and cleared out the other tmp files.

Setting CALIBRE_TEMP_DIR to /tmp/calibre-temp results in Permission denied: '/tmp/calibre-temp/calibre_4.10.1_tmp_wWOQGQ', which can be resolved but doesn't seem to actually fix the problem that Calibre's not cleaning up its temp directories (it just moves these to a different location).

Do any of you know if it's acceptable to add /tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/ to /usr/lib/tmpfiles.d/tmp.conf (by default it's excluded)?

I found this quote from Kovid on an old bug report about temp files: "temp files are guaranteed to be deleted only when calibre is shutdown cleanly."

I'm not sure how that info applies to the server/headless version. It might be that when the conversion process crashes, the cleanup never happens.

To me, that's the more serious issue. Might it be possible to have a cronjob that removes calibre tmp files that are over 24 hours old?

Yes, it seems we have to do it manually then. The process-specific tmp directory created by systemd will change its name, so I guess we just scan all of /tmp? (Sorry, I'm not familiar with how these work!)

I've set up a daily job in /etc/cron.daily/calibre-cleanup:

#!/bin/bash

find /tmp -name '*calibre*' -user www-data -mtime +2 -exec rm -r {} \;

This might make sysadmins cry though, so if any of you are, please tell me the proper way. :-)

I've documented that script.

MBinder_WMF edited projects, added Community-Tech (Kanban-2019-20-Q4); removed Community-Tech (Kanban-Q3-2019-20).Mar 31 2020, 10:26 PM

MBinder_WMF moved this task from Ready to In Development on the Community-Tech (Kanban-2019-20-Q4) board.Mar 31 2020, 10:30 PM

In T246197#5959509, @Samwilson wrote:

This might make sysadmins cry though, so if any of you are, please tell me the proper way. :-)

Your script seems reasonable. On the Toolforge grid engine nodes we use tmpreaper to do this sort of cleanup. We configure it with these options in /etc/tmpreaper.conf:

/etc/tmpreaper.conf

TMPREAPER_TIME=1d
TMPREAPER_PROTECT_EXTRA='/tmp/*.webgrid-{generic,lighttpd}'
TMPREAPER_DIRS='/tmp/.'
TMPREAPER_DELAY='256'
TMPREAPER_ADDITIONALOPTIONS=''

We added that actually because of wsexport: T190185: tmp file leak from tools.wsexport

Thanks @bd808.

Although, it actually looks like /etc/cron.daily/calibre-cleanup is not being run daily. Just now, I logged in and the disk was full again:

samwilson@wsexport-prod01:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            3.9G     0  3.9G   0% /dev
tmpfs           798M   81M  718M  11% /run
/dev/vda2        19G   18G   55M 100% /
tmpfs           3.9G  8.0K  3.9G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
tmpfs           798M     0  798M   0% /run/user/0
tmpfs           798M     0  798M   0% /run/user/3205

then I ran sudo /etc/cron.daily/calibre-cleanup and it's okay again:

samwilson@wsexport-prod01:~$  df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            3.9G     0  3.9G   0% /dev
tmpfs           798M   81M  718M  11% /run
/dev/vda2        19G   14G  4.1G  77% /
tmpfs           3.9G  136K  3.9G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
tmpfs           798M     0  798M   0% /run/user/0
tmpfs           798M     0  798M   0% /run/user/3205

I'm not sure what's going on. run-parts --test /etc/cron.daily says it'd be run.

I've restarted the cron service; that didn't help.

Any ideas?

In T246197#6041885, @Samwilson wrote:

I'm not sure what's going on. run-parts --test /etc/cron.daily says it'd be run.

I've restarted the cron service; that didn't help.

Any ideas?

$ cat /etc/crontab
# /etc/crontab: system-wide crontab
# Unlike any other crontab you don't have to run the `crontab'
# command to install the new version when you edit this file
# and files in /etc/cron.d. These files also have username fields,
# that none of the other crontabs do.

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name command to be executed
17 *    * * *   root    cd / && run-parts --report /etc/cron.hourly
25 6    * * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
47 6    * * 7   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
52 6    1 * *   root    test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly )
$ ls /usr/sbin/anacron
ls: cannot access '/usr/sbin/anacron': No such file or directory

It looks like it should fire sometime around 06:25 UTC each day.

I tried running it manually in a way very similar to what the /etc/crontab does and got this output:

$ cd /
$ run-parts --verbose --regex '^calibre-cleanup$' /etc/cron.daily
run-parts: executing /etc/cron.daily/calibre-cleanup
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_Ebpqka’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_eUOrpM’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_d5RC_4’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_wSjHW1’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_xKWc4s’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_DQ013T’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_84i1Bj’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_CLNdww’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_ZgHWMM’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_r9yQ6o’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_BB5iHc’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_NjhQwK’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_lcxaMr’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_F2eLDU’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_wWWkyt’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_2SQdfk’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_ygW8ro’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_8d_JVm’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_Bi71QW’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_vqbqkH’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_DsgB8J’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_mo2kmQ’: No such file or directory
find: ‘/tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp/calibre_4.13.0_tmp_pYjdn_’: No such file or directory
run-parts: /etc/cron.daily/calibre-cleanup exited with return code 1

A second run in the same manner only reported run-parts: executing /etc/cron.daily/calibre-cleanup. There are still a ton of calibre tmp directories under /tmp/systemd-private-373cd78772cd45f8ac0e7d10f26d367b-apache2.service-zyCwxV/tmp, but on random inspection they look to have mtimes newer than the 48 hour limit in your find command.

*shrug* I'm not sure what the heck is going on.

Thanks for looking into it @bd808. I haven't been able to figure out what's going on.

We could move the Calibre temp directories to within the tool's own temp directory, and then clean them up with the same mechanism that we're using for other temp files. This patch is WIP of this idea: https://github.com/wsexport/tool/pull/227

This would have the advantage of working wherever the tool is installed, and feels a bit nicer in some ways (but perhaps I should keep trying to figure out why the cron cleanup isn't working as it should...).

So I'm not sure what's happened, but after a few weeks of disk usage hovering around 95% it's now more like around 70%, which is what the cron cleanup command gets it to when run manually. So I don't think there's anything more required here. I'll leave the above patch for now, because it doesn't seem to be required, but we can always come back here if need be in the future.

There's not much to review here, but if anyone wants to pass an eye over the script and documentation, that'd be great.

All looks fine after a couple of weeks. Closing.

"No space left on device" on wsexport-prod01.eqiadClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

"No space left on device" on wsexport-prod01.eqiad
Closed, ResolvedPublic
Actions