Page MenuHomePhabricator

lighttpd does not correctly close connections (CLOSE_WAIT)
Closed, ResolvedPublic

Description

This happened for croptool on 5 jul 2015:

Last lines in error.log:

[2015-07-05 03:34:03] exiftool.INFO: Exiftool executes command /data/project/croptool/vendor/phpexiftool/exiftool/exiftool -overwrite_original -quiet -TagsFromFile '/data/project/croptool/public_html//files/c1ba468a60b8a2fdbe02cdd36721d9fedf2f8835.jpg' -all:all '/data/project/croptool/public_html//files/c1ba468a60b8a2fdbe02cdd36721d9fedf2f8835_cropped.jpg' [] []
2015-07-05 11:48:57: (server.c.1398) [note] sockets disabled, connection limit reached

Last lines in access.log:

10.68.17.145 tools.wmflabs.org - [05/Jul/2015:07:10:04 +0000] "GET /croptool/backend.php?action=exists&site=commons.wikimedia.org&title=Florazolam%20(cropped).jpg HTTP/1.1" 200 62 "https://tools.wmflabs.org/croptool/?title=Florazolam.jpg" "Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Firefox/31.0 Iceweasel/31.7.0"
10.68.17.145 tools.wmflabs.org - [05/Jul/2015:07:12:38 +0000] "HEAD /croptool/ HTTP/1.1" 200 0 "-" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)"
10.68.17.145 tools.wmflabs.org - [05/Jul/2015:07:16:45 +0000] "GET /croptool/ HTTP/1.1" 200 21933 "-" "Domain Re-Animator Bot (http://domainreanimator.com) - support@domainreanimator.com"
10.68.17.145 tools.wmflabs.org - [05/Jul/2015:07:17:38 +0000] "HEAD /croptool/ HTTP/1.1" 200 0 "-" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)"
10.68.17.145 tools.wmflabs.org - [05/Jul/2015:07:22:38 +0000] "HEAD /croptool/ HTTP/1.1" 200 0 "-" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)"

Note that the 'connection limit reached' message is way later than any requests in access.log.

There's a whole set of connections to the proxy open:

tcp        1      0 tools-webgrid-lighttpd-1210.tools.eqiad.wmflabs:56235 tools-webproxy-02.tools.eqiad.wmflabs:41828 CLOSE_WAIT  tools.croptool 17100020
tcp        1      0 tools-webgrid-lighttpd-1210.tools.eqiad.wmflabs:56235 tools-webproxy-02.tools.eqiad.wmflabs:56003 CLOSE_WAIT  tools.croptool 17135240
tcp        1      0 tools-webgrid-lighttpd-1210.tools.eqiad.wmflabs:56235 tools-webproxy-02.tools.eqiad.wmflabs:33872 CLOSE_WAIT  tools.croptool 17084311
tcp        1      0 tools-webgrid-lighttpd-1210.tools.eqiad.wmflabs:56235 tools-webproxy-02.tools.eqiad.wmflabs:54753 CLOSE_WAIT  tools.croptool 17150293
tcp        1      0 tools-webgrid-lighttpd-1210.tools.eqiad.wmflabs:56235 tools-webproxy-02.tools.eqiad.wmflabs:44893 CLOSE_WAIT  tools.croptool 17129109

So this suggests either someone keeping the connection open, or the webproxy misbehaving and doing that.

$ sudo netstat -e -v -W 2>/dev/null | grep croptool | grep proxy | cut -b146- > croptools-inodes
$ for i in `cat croptools-inodes`; do sudo find /proc/6703/fd -lname "socket:\[$i\]" -printf %A@; echo; done > croptools-timestamps
$ for i in `cat croptools-timestamps`; do date "+%d/%b/%Y:%H:%M:%S" --date="@$i"; done > croptools-formatteddates

Strangely enough, most of the connections seem to be later than the last access.log entry, /and/ they seem to be batched:

$ cat croptools-formatteddates | sort | uniq -c
      1 04/Jul/2015:23:09:01
     15 05/Jul/2015:07:39:01
     31 05/Jul/2015:08:09:01
     66 05/Jul/2015:09:09:01
     54 05/Jul/2015:10:09:01
     38 05/Jul/2015:10:39:01
     37 05/Jul/2015:11:09:01
     35 05/Jul/2015:11:39:01
     22 05/Jul/2015:11:58:34

with a whole batch of connections every half-hour.

Tools-webproxy-02 doesn't seem to see this connection, though, so I'm a bit at a loss to what's happening here.

Restarting the webservice solved the issue.

Using

#!/bin/bash
for i in `qconf -sel | grep webgrid`
do
    echo $i
    echo --------------------
    ssh $i "sudo netstat -e -v -W 2>/dev/null | grep CLOSE_WAIT | sed -e 's/.*CLOSE_WAIT\s*//' | cut -d' ' -f1 | sort | uniq -c"
    echo --------------------
    echo
done

we see a few servers have a few (<=5) connections open, which is probably just actual traffic. However, two tools have large amounts of open connections:"

tools-webgrid-lighttpd-1402.eqiad.wmflabs
--------------------
    103 tools.blockcalc
--------------------
(...)
tools-webgrid-lighttpd-1408.eqiad.wmflabs
--------------------
    150 tools.geohack
--------------------

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added projects: Tools, Toolforge.
valhallasw added subscribers: valhallasw, Danmichaelo, Stemoc.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The nginx/webproxy log doesn't help too much. I think the dates/times mentioned above might just be wrong (I based the find on https://superuser.com/questions/565991/how-to-determine-the-socket-connection-up-time-on-linux)

As for the socket state,
"CLOSE_WAIT means that the local end of the connection has received a FIN from the other end, but the OS is waiting for the program at the local end to actually close its connection."

so that suggests this is lighttpd forgetting to close connections? not sure...

valhallasw renamed this task from croptools down to lighttpd does not correctly close connections (CLOSE_WAIT).Jul 6 2015, 2:08 PM
valhallasw updated the task description. (Show Details)
valhallasw removed a project: Tools.
valhallasw set Security to None.
valhallasw updated the task description. (Show Details)

Strange. Same problem today. 298 CLOSE_WAIT connections at tools-webgrid-lighttpd-1206.eqiad.wmflabs

tcp        1      0 tools-webgrid-lighttpd-1206.tools.eqiad.wmflabs:41146 tools-webproxy-02.tools.eqiad.wmflabs:37825 CLOSE_WAIT  tools.croptool 26528931
tcp        1      0 tools-webgrid-lighttpd-1206.tools.eqiad.wmflabs:41146 tools-webproxy-02.tools.eqiad.wmflabs:53480 CLOSE_WAIT  tools.croptool 26538356
tcp        1      0 tools-webgrid-lighttpd-1206.tools.eqiad.wmflabs:41146 tools-webproxy-02.tools.eqiad.wmflabs:56666 CLOSE_WAIT  tools.croptool 26538419

and equally many

unix  3      [ ]         STREAM     CONNECTING    0        /tmp/php.socket.croptool-0
unix  3      [ ]         STREAM     CONNECTING    0        /tmp/php.socket.croptool-1
unix  3      [ ]         STREAM     CONNECTING    0        /tmp/php.socket.croptool-0

Restart helped.

Note that I enabled https://tools.wmflabs.org/croptool/server-statistics . When the same problem occured six days ago, the status page was still accessible and showed > 200 'fastcgi.active-requests'.

So I added a small script that checks the value of 'fastcgi.active-requests' every minute. If the number exceeds 150, it restarts the server and writes a log entry (to 'check-server.log'). No new anomalies since then.

Happened for tools.wikidata-game today.

This happened for tools.meta today.

The seven-year-old issue http://redmine.lighttpd.net/issues/1829 might actually be the same one. I think I ignored it last time (because it's 7 years old), but we are actually running lighttpd 1.4.19 on tool labs.

One simple test that gives me somewhat comparable behavior is a php sleep, with the client disconnecting. That gives me a CLOSE_WAIT until php returns from the sleep(). Could it be that there are php processes in the background waiting for something? e.g. a slow database query?

An idea would be to enable mod_status to check in what state lighttpd thinks the connection is.

fwiw, this seems to be a fairly regular occurrence for some tools. I grepped all error.logs for "connection limit reached", and these tools seem to be regulars:

 263 _data_project_catscan2_error.log.conlim
 878 _data_project_joanjoc_error.log.conlim
6323 _data_project_geohack_error.log.conlim
7457 _data_project_persondata_error.log.conlim
9226 _data_project_spellcheck_error.log.conlim

Although not all of them have recent issues.

This killed erwin85 tools today.

2016-01-21 18:47:39: (log.c.166) server started
2016-01-21 19:31:10: (log.c.166) server started
2016-01-21 19:31:44: (server.c.1512) server stopped by UID = 0 PID = 23274
2016-01-21 19:47:57: (server.c.1512) server stopped by UID = 0 PID = 3845
/usr/local/bin/lighttpd-starter: line 41: /var/run/lighttpd/erwin85.conf~: Permission denied
/usr/local/bin/lighttpd-starter: line 88: /var/run/lighttpd/erwin85.conf~: Permission denied
/usr/local/bin/lighttpd-starter: line 96: /var/run/lighttpd/erwin85.conf~: Permission denied
mv: accessing `/var/run/lighttpd/erwin85.conf': Permission denied
2016-01-21 19:47:59: (configfile.c.958) opening configfile  /var/run/lighttpd/erwin85.conf failed: Permission denied
/usr/local/bin/lighttpd-starter: line 41: /var/run/lighttpd/erwin85.conf~: Permission denied
/usr/local/bin/lighttpd-starter: line 88: /var/run/lighttpd/erwin85.conf~: Permission denied
/usr/local/bin/lighttpd-starter: line 96: /var/run/lighttpd/erwin85.conf~: Permission denied
mv: accessing `/var/run/lighttpd/erwin85.conf': Permission denied
[...]
2016-01-21 22:02:41: (configfile.c.958) opening configfile  /var/run/lighttpd/erwin85.conf failed: Permission denied
/usr/local/bin/lighttpd-starter: line 41: /var/run/lighttpd/erwin85.conf~: Permission denied
/usr/local/bin/lighttpd-starter: line 88: /var/run/lighttpd/erwin85.conf~: Permission denied
/usr/local/bin/lighttpd-starter: line 96: /var/run/lighttpd/erwin85.conf~: Permission denied
mv: accessing `/var/run/lighttpd/erwin85.conf': Permission denied
2016-01-21 22:13:38: (configfile.c.958) opening configfile  /var/run/lighttpd/erwin85.conf failed: Permission denied
2016-01-21 22:14:12: (log.c.166) server started
2016-01-25 02:26:36: (server.c.1398) [note] sockets disabled, connection limit reached
2016-01-25 11:28:29: (server.c.1512) server stopped by UID = 0 PID = 19052
2016-01-25 11:28:32: (log.c.166) server started

Nothing suspicious happened at the time:

10.68.21.49 tools.wmflabs.org - [24/Jan/2016:21:00:20 +0000] "GET /erwin85/randomarticle.php?lang=en&family=wikipedia&categories=Physics HTTP/1.1" 302 3283 "-" "Faraday v0.8.9"
10.68.21.49 tools.wmflabs.org - [25/Jan/2016:01:58:44 +0000] "GET /erwin85 HTTP/1.1" 301 0 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"
10.68.21.49 tools.wmflabs.org - [25/Jan/2016:02:26:10 +0000] "GET /erwin85/css/Tool.css HTTP/1.1" 200 235 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Yahoo and Google crawlers driving the webservers crazy?

FYI: lighttpd 1.4.40 (released yesterday) addresses "lighttpd does not correctly close connections (CLOSE_WAIT)" and has more robust connection handling and cleanup, as well as detection and propagation of client aborts.

That's great news! Unfortunately, debian is not bundling 1.4.40 yet, even in sid , so we will have to wait for that to happen, unless we find the time to build-and-test packages ourselves...

FYI: lighttpd 1.4.45 is in sid unstable
https://buildd.debian.org/status/package.php?p=lighttpd&suite=sid
and should be in stretch in another week or so.

lighttpd 1.4.45 is on target to be part of the next Debian major release.

Alternatively, you can use Debian packages from stbuehler (another lighttpd maintainer)
https://build.opensuse.org/package/show/home:stbuehler:lighttpd-1.4.x-nightlies/lighttpd

Alternatively, you can use Debian packages from stbuehler (another lighttpd maintainer)
https://build.opensuse.org/package/show/home:stbuehler:lighttpd-1.4.x-nightlies/lighttpd

Unfortunately, tool labs doesn't use debs from those sources; please see local package policy.

Backporting sid sounds more likely.

bd808 claimed this task.
bd808 subscribed.

We are using lighttpd/1.4.45 on the job grid as of https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation