Page MenuHomePhabricator

tmpreaper possible race condition
Open, NormalPublic

Description

We had a couple of cronspam from tmpreaper:

# mw1206
/etc/cron.daily/tmpreaper:
error: Failed to lstat() `/tmp/./merge-mine-kQP8sw': No such file or directory
error: Failed to lstat() `/tmp/./merge-old-kjkfob': No such file or directory
error: Failed to lstat() `/tmp/./merge-your-2LN2xR': No such file or directory

# mw1288
/etc/cron.daily/tmpreaper:
error: Failed to lstat() `/tmp/./puppet20161122-58128-8m6vuv': No such file or directory

Looks like tmpreaper has a race condition with short lived files, for which it tries to perform an lstat after the file is deleted.
For mw1288 for example puppet runs at minute 28 and the email was sent at minute 29, good indication that the missing file was a temporary file from puppet.

It looks the same of https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=763858 the tmpreaper version in those hosts is 1.6.13+nmu1.

Event Timeline

Volans created this task.Nov 22 2016, 10:44 AM
jbond added a subscriber: jbond.EditedMar 6 2019, 9:01 PM

when monitoring the tmp dir i see many short lived tmp files and a few long long lived files. running lsof shows that the short lived files get created by hhvm (at least on mw1347). My assumption is that the long lived ones where also created by hhvm and is theses files i.e. the ones created on Mar 2 in the output below which are causing the problem, however i'm unsure where the race condition is. tmpreaper is configured to only remove files with a ctime > 7days with a 256 second splay. My best guess is that something in media wiki (or something else although i check all the other cron jobs and couldn't see anything) is also reaping theses files after 7 days. if its doing so without the 256 second splay it would account for the somewhat spasmodic nature of our alerts

jbond@mw1347:~$ while true ; do date ; ls -la /tmp/ | egrep '\s\w{6}$' ; sleep 1  ; done
Wed Mar  6 20:50:33 UTC 2019
total 149812
-rw-------  1 www-data www-data         0 Mar  6 20:50 0hibnA
-rw-------  1 www-data www-data         0 Mar  2 00:35 2ipJ2O
-rw-------  1 www-data www-data         0 Mar  6 20:50 H1hn1o
-rw-------  1 www-data www-data         0 Mar  2 00:35 MdLJjP
Wed Mar  6 20:50:34 UTC 2019
total 149812
-rw-------  1 www-data www-data         0 Mar  2 00:35 2ipJ2O
-rw-------  1 www-data www-data         0 Mar  2 00:35 MdLJjP
Wed Mar  6 20:50:35 UTC 2019
total 149812
-rw-------  1 www-data www-data         0 Mar  2 00:35 2ipJ2O
-rw-------  1 www-data www-data         0 Mar  6 20:50 2XvV1G
-rw-------  1 www-data www-data         0 Mar  6 20:50 6kDywU
-rw-------  1 www-data www-data         0 Mar  6 20:50 7ECaJN
-rw-------  1 www-data www-data         0 Mar  2 00:35 MdLJjP
-rw-------  1 www-data www-data         0 Mar  6 20:50 vIhKkA
Wed Mar  6 20:50:36 UTC 2019
total 149812
-rw-------  1 www-data www-data         0 Mar  2 00:35 2ipJ2O
-rw-------  1 www-data www-data         0 Mar  2 00:35 MdLJjP
Wed Mar  6 20:50:37 UTC 2019
total 149812
-rw-------  1 www-data www-data         0 Mar  2 00:35 2ipJ2O
-rw-------  1 www-data www-data         0 Mar  6 20:50 kGOipb
-rw-------  1 www-data www-data         0 Mar  2 00:35 MdLJjP
-rw-------  1 www-data www-data         0 Mar  6 20:50 n0f18c
Wed Mar  6 20:50:38 UTC 2019
total 149816
-rw-------  1 www-data www-data         0 Mar  2 00:35 2ipJ2O
-rw-------  1 www-data www-data         0 Mar  6 20:50 79WV18
-rw-------  1 www-data www-data         0 Mar  2 00:35 MdLJjP
Wed Mar  6 20:50:39 UTC 2019
total 149816
-rw-------  1 www-data www-data         0 Mar  2 00:35 2ipJ2O
-rw-------  1 www-data www-data         0 Mar  6 20:50 8CiEQr
-rw-------  1 www-data www-data         0 Mar  6 20:50 lRkrEo
-rw-------  1 www-data www-data         0 Mar  2 00:35 MdLJjP
while true ; do sudo lsof +D /tmp/ | egrep '/tmp/\w{6}$' ; done
hhvm    1443 www-data   79r   REG  253,0   3137639 6818806 /tmp/DJdNDv
hhvm    1443 www-data  270u   REG  253,0         0 6818806 /tmp/cxttuh
hhvm    1443 www-data  326u   REG  253,0         0 6819342 /tmp/HmunJS
elukey added a comment.EditedJul 29 2019, 6:18 PM

Following the Debian Bug, it seems that in https://sources.debian.org/src/tmpreaper/1.6.13+nmu1/tmpreaper.c/?hl=452#L422 we could add a simple check to avoid this. From

		if (lstat (ent->d_name, &sb)) {
		    message (LOG_ERROR,
			     "Failed to lstat() `%s/%s': %s\n",
			     dirname, ent->d_name, strerror (errno));
		    continue;
		}

To:

		if (lstat (ent->d_name, &sb)) {
		    if(errno != ENOENT) {
                            message (LOG_ERROR,
                                             "Failed to lstat() `%s/%s': %s\n",
			                      dirname, ent->d_name, strerror (errno));
                    }
		    continue;
		}

The patch could be applied, the package rebuilt and then uploaded to stretch-wikimedia (waiting for upstream to fix it). Thoughts?

@elukey I am up for attempting to patch it and upload to stretch-wikimedia, I will try to do it next week

jijiki moved this task from Backlog to Next up on the serviceops board.
elukey added a subscriber: faidon.Jul 30 2019, 11:28 AM

I have created tmpreaper_1.6.13+nmu1+deb9u1+wmf1_amd64.deb on boron, with the following patch:

elukey@boron:~/tmpreaper-1.6.13+nmu1+deb9u1$ cat patches/no-log-enoent.patch
Index: tmpreaper-1.6.13+nmu1+deb9u1/tmpreaper.c
===================================================================
--- tmpreaper-1.6.13+nmu1+deb9u1.orig/tmpreaper.c
+++ tmpreaper-1.6.13+nmu1+deb9u1/tmpreaper.c
@@ -545,9 +545,11 @@ cleanupDirectory (const char	   * dirnam
                     strcpy(dst, ent->d_name);
                     strcat(dst, "/.");
 		    if (lstat (dst, &sb2)) {
-			message (LOG_ERROR,
-				 "Failed to lstat() `%s/%s/.': %s\n",
-				 dirname, ent->d_name, strerror (errno));
+			if (errno != ENOENT) {
+				message (LOG_ERROR,
+					 "Failed to lstat() `%s/%s/.': %s\n",
+					 dirname, ent->d_name, strerror (errno));
+			}
 			free(dst);
 			continue;
 		    }

If the version number is acceptable we could test it, but then there is another question - do we want to maintain another package like tmpreaper? This means to rebase security patches on top of ours, etc.. @faidon, any suggestion ? :)

We can start by responding to Debian bug #763858 with your fix and see if the maintainer is willing to incorporate this!

We can start by responding to Debian bug #763858 with your fix and see if the maintainer is willing to incorporate this!

I wanted to do some testing first on a couple of nodes to make sure that it works properly, but yeah sounds like a good idea :) I'll coordinate with @jijiki!

The patch seems sane, but I'm wondering whether we actually need to pursue this further? tmpreaper is dead upstream (the Debian maintainer keeps it alive a little for security fixes, but the origin of the codebase is a 20 years old tmpwatch RPM from Red Hat) and has significant bit rot on modern systems (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=881725). Notably we only use it on app servers, it seems to have been added back in 2015 to address core dumps from HHVM clogging up /tmp.

Given that we're moving away from HHVM, we can simply remove tmpreaper usage along with it? Arguably PHP might also dump core, but that applies to every other service we run, so in that case we should rather look into a solution which monitors/mitigates excessive /tmp usage in general.

elukey added a comment.Aug 5 2019, 8:11 AM

The patch seems sane, but I'm wondering whether we actually need to pursue this further? tmpreaper is dead upstream (the Debian maintainer keeps it alive a little for security fixes, but the origin of the codebase is a 20 years old tmpwatch RPM from Red Hat) and has significant bit rot on modern systems (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=881725). Notably we only use it on app servers, it seems to have been added back in 2015 to address core dumps from HHVM clogging up /tmp.
Given that we're moving away from HHVM, we can simply remove tmpreaper usage along with it? Arguably PHP might also dump core, but that applies to every other service we run, so in that case we should rather look into a solution which monitors/mitigates excessive /tmp usage in general.

Makes sense!

@elukey @MoritzMuehlenhoff I have added tmpreaper's removal as an actionable in our HHVM removal process (T229792), shall we mark this as resolved or invalid?

Toolforge/Toollabs also uses tmpreaper (but not the puppetised version with the tmpreaper Puppet class). I'm adding @Andrew and @aborrero for comments whether we should keep it open for this or whether it's not worth tracking there.

jijiki moved this task from Next up to Externally Blocked on the serviceops board.Aug 6 2019, 9:54 AM
Andrew added a comment.Aug 6 2019, 2:49 PM

From the man page:

Unless  your  machine  is  one  with  lots  of  relatively
       untrusted users, such as an ISP or school, you don't need this program; `find ... -exec rm
       ...' works just as well when you don't have to be concerned about people trying to exploit
       the race condition on you.

So that suggests that it's probably wise to keep using it on cloud VMs, as long as it still works. That said, I'm not sure that we couldn't just > /dev/null the cron job.

elukey added a comment.Aug 7 2019, 6:35 AM

So that suggests that it's probably wise to keep using it on cloud VMs, as long as it still works. That said, I'm not sure that we couldn't just > /dev/null the cron job.

The main reason would be that any failure to reported by tmpreaper would be completely ignored, and if we consider it important it might not be the best. Anyway, I am fine with any solution that we want to follow, no strong opinions :)