Page MenuHomePhabricator

Grid Engine lighttpd error "opening temp-file failed: No such file or directory"
Closed, ResolvedPublicSpike

Description

The webservice error.log is filling with these errors:

2019-03-07 01:14:54: (chunk.c.553) opening temp-file failed: No such file or directory

Something about our temp files being cleaned up while still in use? It seems that the files in ~/app/var/cache/images/ are okay. I'm sure it's possible to make the error show us something more informative.


Analysis in T217815#5577987 points to tmpreaper removing the Grid Engine managed TMPDIR once empty of other files.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Some files fail to load (the response is truncated with no error). e.g. https://tools.wmflabs.org/svgtranslate/File:Hawaii_Island_topographic_map-fr.svg

These don't seem to result in the above error though, so maybe that's a separate issue.

Samwilson claimed this task.

I restarted the web service and this seems to have gone away. If it comes back I'll reopen this.

Are we sure that the temp files that get opened are closed or "released?" Maybe it's not so much that it can't find an existing temp file but instead can't get a new file handle because they've been used up?

Niharika renamed this task from opening temp-file failed: No such file or directory to [4 hours] opening temp-file failed: No such file or directory.Mar 19 2019, 11:25 PM
Niharika triaged this task as Medium priority.
Niharika added a project: Spike.
Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJun 16 2019, 8:30 AM

I has this problem too. Solved by restarting webservice.

I, too, had this problem with https://tools.wmflabs.org/global-search. Just as suggested here, restarting the webservice did the trick. Whatever the issue is, it's certainly not unique to SVG Translate.

I'm having the same issue and opened T225966 for this yesterday.

I'm not currently working on this.

I'm having the same problem with the wikiwho tool. Responses get truncated at 64 KB. If I make requests that give smaller responses, everything works fine.

@bd808 From my research this has something to do with Nginx (which I think we use?) and/or fastcgi. Solutions point to a permissions issue, just as the error message suggests:

This makes me believe we need Toolforge admin intervention. In the case of wikiwho, we're not writing to any files directly. It's just a simple proxy script, see https://github.com/wikimedia/WhoWroteThat/blob/a44adeeaeff3b4da9f1b5264ef844aedcd6b3f15/public_html/wikiWhoProxy.php for the source. Instead, it seems the webserver uses temporary files when the response is over 64 KB, and it can't create any.

Any ideas?

@bd808 From my research this has something to do with Nginx (which I think we use?) and/or fastcgi.

Nginx is used as the "front proxy" in Toolforge (and Cloud VPS), but it is not used inside of any of the webservice managed services. Webservices use lighttpd instead.

Webservices using PHP do use /usr/bin/php-cgi as a fastcgi container running under the lighttpd parent process. The lighttpd configuration for this looks something like:

fastcgi.server += ( ".php" =>
        ((
                "bin-path" => "/usr/bin/php-cgi",
                "socket" => "/var/run/lighttpd/php.socket.wikiwho",
                "max-procs" => 2,
                "bin-environment" => (
                        "PHP_FCGI_CHILDREN" => "2",
                        "PHP_FCGI_MAX_REQUESTS" => "500"
                ),
                "bin-copy-environment" => (
                        "PATH", "SHELL", "USER"
                ),
                "broken-scriptfilename" => "enable",
                "allow-x-send-file" => "enable"
         ))
)

The $HOME/error.log file for a tool is written by lighttpd, so an error message in that file like 2019-03-07 01:14:54: (chunk.c.553) opening temp-file failed: No such file or directory would be referring to lighttpd's chunk.c source file.

Solutions point to a permissions issue, just as the error message suggests:

This makes me believe we need Toolforge admin intervention. In the case of wikiwho, we're not writing to any files directly. It's just a simple proxy script, see https://github.com/wikimedia/WhoWroteThat/blob/a44adeeaeff3b4da9f1b5264ef844aedcd6b3f15/public_html/wikiWhoProxy.php for the source. Instead, it seems the webserver uses temporary files when the response is over 64 KB, and it can't create any.

According to /data/project/wikiwho/service.manifest, the wikiwho tool is running using the grid engine backend and its lighttpd flavor. This means that the tool is running lighttpd 1.4.45-1 on Debian Stretch. Looking at the source on github, we can see that line 553 of the chunk.c file is indeed a log message for a failure opening a temp file inside the chunkqueue_append_mem_to_tempfile function.

In chunkqueue_append_mem_to_tempfile, the error is logged when (NULL == dst_c && NULL == (dst_c = chunkqueue_get_append_tempfile(dest))) is true. The pseudo global errno at the point of this log event is 2 (ENOENT). This is based on the "No such file or directory" message from strerror(errno) in the log message:

$ errno -s "No such file or directory"
ENOENT 2 No such file or directory

So it appears that chunkqueue_get_append_tempfile(dest) is returning NULL and that something is setting errno=2 during or prior to this call. Tracing through chunkqueue_get_append_tempfile, it looks like the most likely cause of ENOENT would be a failure of the open(2) call which is embedded in one of the mkstemp(3) calls.

Things get a bit muddier from here because of the cq->tempdirs data structure. The file path given to mkstemp might be dynamically constructed from one of the entries in this list or if the list is empty it might be the literal "/var/tmp/lighttpd-upload-XXXXXX". Tracing back through more code in the lighttpd source, it looks like the list of temporary directories is initialized with the contents of the TMPDIR environment variable. Thanks to the magic of the /proc filesystem we can find out what that is in the context of the running lighttpd process:

$ cat /proc/$(cat /var/run/lighttpd/wikiwho.pid)/environ | tr '\0' '\n' | grep TMPDIR
TMPDIR=/tmp/8244039.1.webgrid-lighttpd

It seems likely that mkstemp('/tmp/8244039.1.webgrid-lighttpd/lighttpd-upload-XXXXXX') was the call that failed and set errno=ENOENT. And...

$ ls /tmp/8244039.1.webgrid-lighttpd
ls: cannot access '/tmp/8244039.1.webgrid-lighttpd': No such file or directory

So... why is this directory missing? We can reasonably assume did exist at some point because other directories fitting the /tmp/nnnnnnn.n.webgrid-lighttpd pattern do exist on the grid node in question at the time I am writing this. It turns out that the web exec nodes for the grid engine have tmpreaper installed to clean up leaked files in /tmp (T190185). The current configuration for tmpreaper is:

TMPREAPER_TIME=1d

TMPREAPER_PROTECT_EXTRA=''
TMPREAPER_DIRS='/tmp/.'
TMPREAPER_DELAY='256'
TMPREAPER_ADDITIONALOPTIONS=''

I think what is happening here is that ~1 day after a lighttpd webservice is launched on the grid engine tmpreaper deletes the configured temporary directory for that lighttpd process. This would also nicely explain why restarting the webservice "fixes" the problem (for about 24 hours).

Change 543266 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: exclude grid engine TMPDIR directories from tmpreaper

https://gerrit.wikimedia.org/r/543266

bd808 renamed this task from [4 hours] opening temp-file failed: No such file or directory to Grid Engine lighttpd error "opening temp-file failed: No such file or directory".Oct 16 2019, 2:58 AM
bd808 updated the task description. (Show Details)

Basic test of the concepts applied in the proposed configuration change:

$ mkdir /tmp/bd808-tmpreaper-test
$ cd /tmp/bd808-tmpreaper-test
$ touch -at 1910150000 delete_me
$ mkdir keep_this_dir
$ touch -at 1910150000 keep_this_dir
$ touch -at 1910150000 keep_this_dir/delete_me_too
$ find .
.
./keep_this_dir
./keep_this_dir/delete_me_too
./delete_me
$ /usr/sbin/tmpreaper --test --showdeleted -v -v 1d --protect '/tmp/bd808-tmpreaper-test/*_this_{dir}' /tmp/bd808-tmpreaper-test
debug: Grace period is `86400' seconds.
debug: This is PID (23774) being run by UID (3518), EUID (3518).
debug: getcwd() is: /tmp/bd808-tmpreaper-test
(PID 23774) Pretending to clean up directory `/tmp/bd808-tmpreaper-test'.
debug: safe_chdir() : Before chdir(), `dirname' is inode `262150'.
debug: safe_chdir() : Before chdir(), `dirname' on device `64768'.
debug: safe_chdir() : After chdir(), `dirname' is inode `262150'.
debug: safe_chdir() : After chdir(), `dirname' on device `64768'.
debug: (PID 23775) Found directory entry `keep_this_dir'.
(PID 23775) Pretending to clean up directory `keep_this_dir'.
debug: safe_chdir() : Before chdir(), `dirname' is inode `262153'.
debug: safe_chdir() : Before chdir(), `dirname' on device `64768'.
debug: safe_chdir() : After chdir(), `dirname' is inode `262153'.
debug: safe_chdir() : After chdir(), `dirname' on device `64768'.
debug: (PID 23776) Found directory entry `delete_me_too'.
Pretending to remove file `keep_this_dir/delete_me_too'.
rm keep_this_dir/delete_me_too
(PID 23775) Back from recursing down `keep_this_dir'.
Entry matching `--protect' pattern skipped. `/tmp/bd808-tmpreaper-test/keep_this_dir'
debug: (PID 23775) Found directory entry `delete_me'.
Pretending to remove file `/tmp/bd808-tmpreaper-test/delete_me'.
rm /tmp/bd808-tmpreaper-test/delete_me

This appears to make tmpreaper retain otherwise empty directories with stale atime if they match a protected pattern. Stale files inside such protected directories are still cleaned up as desired.

Change 543266 merged by Bstorm:
[operations/puppet@production] toolforge: exclude grid engine TMPDIR directories from tmpreaper

https://gerrit.wikimedia.org/r/543266

Puppet is rolling this out now (and doesn't seem to be doing evil).

Puppet is rolling this out now (and doesn't seem to be doing evil).

@Samwilson, @MusikAnimal, @Mbch331: the fix that I made should prevent the directory removal for new webservice jobs, but it will not retroactively recreate any TMPDIR that was already deleted by tmpreaper. You will need to restart your webservices to ensure that their TMPDIR exists.

I am going to start a webservice on the grid engine backend myself today to use for verifying this fix. Once I have verified that the TMPDIR exists >48 hours after the job is launched (tmpreaper cleans up files older than 24 hours once per day so worst case cleanup time is around 48 hours) I will email cloud-announce to let everyone using the grid engine backend for webservice that we tracked down and fixed this annoying bug.

The fix seems to work!

Screen Shot 2019-10-18 at 14.57.27.png (170×566 px, 28 KB)

Rather than notifying maintainers that the can restart their grid engine based webservices to recreate any TMPDIR that tempreaper may have removed, I am going to use the instructions from https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Restarting_all_webservices to restart all grid based webservices.

Mentioned in SAL (#wikimedia-cloud) [2019-10-18T21:29:54Z] <bd808> Rescheduled all grid engine webservice jobs (T217815)

Thanks! I am no longer seeing this error. I'm still having trouble with some very large responses, but it seems it is an unrelated issue. I'll investigate and file a new task if needed.