Grid Engine lighttpd error "opening temp-file failed: No such file or directory"
Closed, ResolvedPublicSpike
Actions

Assigned To

Authored By

	Samwilson
	Mar 7 2019, 1:17 AM

Description

The webservice error.log is filling with these errors:

2019-03-07 01:14:54: (chunk.c.553) opening temp-file failed: No such file or directory

Something about our temp files being cleaned up while still in use? It seems that the files in ~/app/var/cache/images/ are okay. I'm sure it's possible to make the error show us something more informative.

Analysis in T217815#5577987 points to tmpreaper removing the Grid Engine managed TMPDIR once empty of other files.

Details

	Subject	Repo	Branch	Lines +/-
	toolforge: exclude grid engine TMPDIR directories from tmpreaper	operations/puppet	production	+2 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		MusikAnimal	T231492 Add a tool to proxy API requests to WhoColor API
		Resolved	Spike	bd808	T217815 Grid Engine lighttpd error "opening temp-file failed: No such file or directory"

Event Timeline

Samwilson created this task.Mar 7 2019, 1:17 AM

Restricted Application added a project: Community-Tech. · View Herald TranscriptMar 7 2019, 1:17 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Some files fail to load (the response is truncated with no error). e.g. https://tools.wmflabs.org/svgtranslate/File:Hawaii_Island_topographic_map-fr.svg

These don't seem to result in the above error though, so maybe that's a separate issue.

I restarted the web service and this seems to have gone away. If it comes back I'll reopen this.

Niharika moved this task from Backlog to Done on the SVG Translate Tool board.Mar 12 2019, 5:42 PM

Samwilson mentioned this in T216568: [BUG] Image doesn't render and "Language settings" link disappears.Mar 14 2019, 6:47 AM

Niharika reopened this task as Open.Mar 17 2019, 8:25 PM

Niharika moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.

Are we sure that the temp files that get opened are closed or "released?" Maybe it's not so much that it can't find an existing temp file but instead can't get a new file handle because they've been used up?

Niharika renamed this task from opening temp-file failed: No such file or directory to [4 hours] opening temp-file failed: No such file or directory.Mar 19 2019, 11:25 PM

Niharika triaged this task as Medium priority.

Niharika added a project: Spike.

Samwilson moved this task from Done to In progress on the SVG Translate Tool board.Mar 26 2019, 1:50 AM

Samwilson mentioned this in T219183: Login broken on SVG Translate production tool.Mar 26 2019, 2:47 AM

Niharika added a project: Community-Tech-Sprint.Apr 2 2019, 11:36 PM

Niharika moved this task from Needs Discussion to In Sprint 🏃‍♀️🏃‍♂️ on the Community-Tech board.

Samwilson moved this task from Ready to In Development on the Community-Tech-Sprint board.Apr 4 2019, 6:37 AM

Niharika removed projects: Community-Tech-Sprint, Community-Tech.Apr 4 2019, 11:24 PM

MBH merged a task: T225338: Response from my tool on Toolforge being truncated to 64 kb.Jun 16 2019, 8:30 AM

MBH added subscribers: MBH, aborrero, Base.

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJun 16 2019, 8:30 AM

I has this problem too. Solved by restarting webservice.

I, too, had this problem with https://tools.wmflabs.org/global-search. Just as suggested here, restarting the webservice did the trick. Whatever the issue is, it's certainly not unique to SVG Translate.

MusikAnimal mentioned this in T225587: Search results page stops loading midway.Jun 16 2019, 11:15 PM

I'm having the same issue and opened T225966 for this yesterday.

Mbch331 added a project: Toolforge.Jun 18 2019, 8:59 PM

MusikAnimal mentioned this in T225966: Toolforge: File upload to PHP webservice fails because temp file can't be opened.Jun 19 2019, 1:47 AM

I'm not currently working on this.

bd808 edited projects, added Tools; removed Toolforge.Aug 8 2019, 3:48 AM

I'm having the same problem with the wikiwho tool. Responses get truncated at 64 KB. If I make requests that give smaller responses, everything works fine.

@bd808 From my research this has something to do with Nginx (which I think we use?) and/or fastcgi. Solutions point to a permissions issue, just as the error message suggests:

This makes me believe we need Toolforge admin intervention. In the case of wikiwho, we're not writing to any files directly. It's just a simple proxy script, see https://github.com/wikimedia/WhoWroteThat/blob/a44adeeaeff3b4da9f1b5264ef844aedcd6b3f15/public_html/wikiWhoProxy.php for the source. Instead, it seems the webserver uses temporary files when the response is over 64 KB, and it can't create any.

Any ideas?

MusikAnimal added a parent task: T231492: Add a tool to proxy API requests to WhoColor API.Oct 15 2019, 5:12 PM

MusikAnimal mentioned this in T231492: Add a tool to proxy API requests to WhoColor API.

In T217815#5576255, @MusikAnimal wrote:

@bd808 From my research this has something to do with Nginx (which I think we use?) and/or fastcgi.

Nginx is used as the "front proxy" in Toolforge (and Cloud VPS), but it is not used inside of any of the webservice managed services. Webservices use lighttpd instead.

Webservices using PHP do use /usr/bin/php-cgi as a fastcgi container running under the lighttpd parent process. The lighttpd configuration for this looks something like:

fastcgi.server += ( ".php" =>
        ((
                "bin-path" => "/usr/bin/php-cgi",
                "socket" => "/var/run/lighttpd/php.socket.wikiwho",
                "max-procs" => 2,
                "bin-environment" => (
                        "PHP_FCGI_CHILDREN" => "2",
                        "PHP_FCGI_MAX_REQUESTS" => "500"
                ),
                "bin-copy-environment" => (
                        "PATH", "SHELL", "USER"
                ),
                "broken-scriptfilename" => "enable",
                "allow-x-send-file" => "enable"
         ))
)

The $HOME/error.log file for a tool is written by lighttpd, so an error message in that file like 2019-03-07 01:14:54: (chunk.c.553) opening temp-file failed: No such file or directory would be referring to lighttpd's chunk.c source file.

Solutions point to a permissions issue, just as the error message suggests:

https://stackoverflow.com/a/4515898/604142

https://serverfault.com/a/824520/190562

This makes me believe we need Toolforge admin intervention. In the case of wikiwho, we're not writing to any files directly. It's just a simple proxy script, see https://github.com/wikimedia/WhoWroteThat/blob/a44adeeaeff3b4da9f1b5264ef844aedcd6b3f15/public_html/wikiWhoProxy.php for the source. Instead, it seems the webserver uses temporary files when the response is over 64 KB, and it can't create any.

According to /data/project/wikiwho/service.manifest, the wikiwho tool is running using the grid engine backend and its lighttpd flavor. This means that the tool is running lighttpd 1.4.45-1 on Debian Stretch. Looking at the source on github, we can see that line 553 of the chunk.c file is indeed a log message for a failure opening a temp file inside the chunkqueue_append_mem_to_tempfile function.

In chunkqueue_append_mem_to_tempfile, the error is logged when (NULL == dst_c && NULL == (dst_c = chunkqueue_get_append_tempfile(dest))) is true. The pseudo global errno at the point of this log event is 2 (ENOENT). This is based on the "No such file or directory" message from strerror(errno) in the log message:

$ errno -s "No such file or directory"
ENOENT 2 No such file or directory

So it appears that chunkqueue_get_append_tempfile(dest) is returning NULL and that something is setting errno=2 during or prior to this call. Tracing through chunkqueue_get_append_tempfile, it looks like the most likely cause of ENOENT would be a failure of the open(2) call which is embedded in one of the mkstemp(3) calls.

Things get a bit muddier from here because of the cq->tempdirs data structure. The file path given to mkstemp might be dynamically constructed from one of the entries in this list or if the list is empty it might be the literal "/var/tmp/lighttpd-upload-XXXXXX". Tracing back through more code in the lighttpd source, it looks like the list of temporary directories is initialized with the contents of the TMPDIR environment variable. Thanks to the magic of the /proc filesystem we can find out what that is in the context of the running lighttpd process:

$ cat /proc/$(cat /var/run/lighttpd/wikiwho.pid)/environ | tr '\0' '\n' | grep TMPDIR
TMPDIR=/tmp/8244039.1.webgrid-lighttpd

It seems likely that mkstemp('/tmp/8244039.1.webgrid-lighttpd/lighttpd-upload-XXXXXX') was the call that failed and set errno=ENOENT. And...

$ ls /tmp/8244039.1.webgrid-lighttpd
ls: cannot access '/tmp/8244039.1.webgrid-lighttpd': No such file or directory

So... why is this directory missing? We can reasonably assume did exist at some point because other directories fitting the /tmp/nnnnnnn.n.webgrid-lighttpd pattern do exist on the grid node in question at the time I am writing this. It turns out that the web exec nodes for the grid engine have tmpreaper installed to clean up leaked files in /tmp (T190185). The current configuration for tmpreaper is:

TMPREAPER_TIME=1d

TMPREAPER_PROTECT_EXTRA=''
TMPREAPER_DIRS='/tmp/.'
TMPREAPER_DELAY='256'
TMPREAPER_ADDITIONALOPTIONS=''

I think what is happening here is that ~1 day after a lighttpd webservice is launched on the grid engine tmpreaper deletes the configured temporary directory for that lighttpd process. This would also nicely explain why restarting the webservice "fixes" the problem (for about 24 hours).

Change 543266 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: exclude grid engine TMPDIR directories from tmpreaper

https://gerrit.wikimedia.org/r/543266

gerritbot added a project: Patch-For-Review.Oct 16 2019, 2:50 AM

bd808 claimed this task.Oct 16 2019, 2:55 AM

bd808 edited projects, added cloud-services-team (Kanban); removed Spike.

bd808 merged a task: T225966: Toolforge: File upload to PHP webservice fails because temp file can't be opened.

bd808 added a subscriber: zhuyifei1999.

bd808 renamed this task from [4 hours] opening temp-file failed: No such file or directory to Grid Engine lighttpd error "opening temp-file failed: No such file or directory".Oct 16 2019, 2:58 AM

bd808 updated the task description. (Show Details)

Basic test of the concepts applied in the proposed configuration change:

$ mkdir /tmp/bd808-tmpreaper-test
$ cd /tmp/bd808-tmpreaper-test
$ touch -at 1910150000 delete_me
$ mkdir keep_this_dir
$ touch -at 1910150000 keep_this_dir
$ touch -at 1910150000 keep_this_dir/delete_me_too
$ find .
.
./keep_this_dir
./keep_this_dir/delete_me_too
./delete_me
$ /usr/sbin/tmpreaper --test --showdeleted -v -v 1d --protect '/tmp/bd808-tmpreaper-test/*_this_{dir}' /tmp/bd808-tmpreaper-test
debug: Grace period is `86400' seconds.
debug: This is PID (23774) being run by UID (3518), EUID (3518).
debug: getcwd() is: /tmp/bd808-tmpreaper-test
(PID 23774) Pretending to clean up directory `/tmp/bd808-tmpreaper-test'.
debug: safe_chdir() : Before chdir(), `dirname' is inode `262150'.
debug: safe_chdir() : Before chdir(), `dirname' on device `64768'.
debug: safe_chdir() : After chdir(), `dirname' is inode `262150'.
debug: safe_chdir() : After chdir(), `dirname' on device `64768'.
debug: (PID 23775) Found directory entry `keep_this_dir'.
(PID 23775) Pretending to clean up directory `keep_this_dir'.
debug: safe_chdir() : Before chdir(), `dirname' is inode `262153'.
debug: safe_chdir() : Before chdir(), `dirname' on device `64768'.
debug: safe_chdir() : After chdir(), `dirname' is inode `262153'.
debug: safe_chdir() : After chdir(), `dirname' on device `64768'.
debug: (PID 23776) Found directory entry `delete_me_too'.
Pretending to remove file `keep_this_dir/delete_me_too'.
rm keep_this_dir/delete_me_too
(PID 23775) Back from recursing down `keep_this_dir'.
Entry matching `--protect' pattern skipped. `/tmp/bd808-tmpreaper-test/keep_this_dir'
debug: (PID 23775) Found directory entry `delete_me'.
Pretending to remove file `/tmp/bd808-tmpreaper-test/delete_me'.
rm /tmp/bd808-tmpreaper-test/delete_me

This appears to make tmpreaper retain otherwise empty directories with stale atime if they match a protected pattern. Stale files inside such protected directories are still cleaned up as desired.

bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Oct 16 2019, 3:16 AM

Change 543266 merged by Bstorm:
[operations/puppet@production] toolforge: exclude grid engine TMPDIR directories from tmpreaper

https://gerrit.wikimedia.org/r/543266

Puppet is rolling this out now (and doesn't seem to be doing evil).

In T217815#5580832, @Bstorm wrote:

Puppet is rolling this out now (and doesn't seem to be doing evil).

@Samwilson, @MusikAnimal, @Mbch331: the fix that I made should prevent the directory removal for new webservice jobs, but it will not retroactively recreate any TMPDIR that was already deleted by tmpreaper. You will need to restart your webservices to ensure that their TMPDIR exists.

I am going to start a webservice on the grid engine backend myself today to use for verifying this fix. Once I have verified that the TMPDIR exists >48 hours after the job is launched (tmpreaper cleans up files older than 24 hours once per day so worst case cleanup time is around 48 hours) I will email cloud-announce to let everyone using the grid engine backend for webservice that we tracked down and fixed this annoying bug.

The fix seems to work!

Screen Shot 2019-10-18 at 14.57.27.png (170×566 px, 28 KB)

Rather than notifying maintainers that the can restart their grid engine based webservices to recreate any TMPDIR that tempreaper may have removed, I am going to use the instructions from https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Restarting_all_webservices to restart all grid based webservices.

Mentioned in SAL (#wikimedia-cloud) [2019-10-18T21:29:54Z] <bd808> Rescheduled all grid engine webservice jobs (T217815)

bd808 closed this task as Resolved.Oct 18 2019, 10:45 PM

Thanks! I am no longer seeing this error. I'm still having trouble with some very large responses, but it seems it is an unrelated issue. I'll investigate and file a new task if needed.

Fix announced on the cloud@lists.wikimedia.org list.

	F30853955: Screen Shot 2019-10-18 at 14.57.27.png
	Oct 18 2019, 9:02 PM

Grid Engine lighttpd error "opening temp-file failed: No such file or directory"Closed, ResolvedPublicSpikeActions

Description

Details

Related ObjectsSearch...

Event Timeline

Grid Engine lighttpd error "opening temp-file failed: No such file or directory"
Closed, ResolvedPublicSpike
Actions

Related Objects
Search...