Page MenuHomePhabricator

Query service fails with "Too many open files"
Closed, ResolvedPublic

Description

All queries on the Wikidata query service are currently returning Error 502 (bad gateway).

This was previously reported on [[d:WD:DEV#Query_service_returns_502_Bad_Gateway_for_all_queries]].

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

Investigating:

  • May 03 09:33:48 wdqs1002 bash[13378]: java.io.IOException: Too many open files
  • May 03 09:33:48 wdqs1002 bash[13378]: at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
  • High number of TCP connections in TIME_WAIT / CLOSE

Restart of blazegraph and updater seem to fix the issue, but I need to dig more into this, if only to have better metrics to understand what is happening next time.

hoo renamed this task from Query service is down to Query service fails with "Too many open files".May 3 2016, 12:24 PM

WDQS down again. 320K file handles taken by Java processes. More investigation needed.

With updater shutdown, it seems that number of open pipes stays stable at 5-8K. I did not take those measurement long enough to be confident about the correlation, but that's at least an idea. Of course, keeping the updater down is not a solution...

Mentioned in SAL [2016-05-03T17:18:08Z] <gehel> restarting blazegraph on wdqs1002 (T134238)

@Gehel what about unsetting the file descriptor limit for wdqs? If those are due to real use and not fd leaking, I mean.

Previous analysis is mostly wrong. File descriptors seem under control and NOT jumping from 5k to 100k (if that was the case, I would be suspicious to raise the limit that far).

Looks like updater slowly leaking FDs, to the rate of 2-3 FDs per minute. Eventually it causes the leaked ones to accumulate, but for now restarting updater once per day or two would eliminate the problem. I'll try to see where it comes from.

Mentioned in SAL [2016-05-03T21:11:17Z] <gehel> restarting wdqs1002 (T134238)

So the FD issue is probably a side issue that was detected because we were looking more closely than usual. The main cause of this outage is most probably a traffic peak as seen in Grafana, made worse by the fact that we run on a single server. If that's really the case, it shows that we run too close to our limit load (peaks are around 2-3x normal load).

Mentioned in SAL [2016-05-04T12:34:11Z] <gehel> restarting blazegraph (T134238)

Mentioned in SAL [2016-05-04T16:19:14Z] <gehel> restarting blazegraph (T134238)

Probably 0 at the moment - I don't think there are any interactive graphs out there that use sparql yet. Only a few non-interactive samples (btw, there is another bug there - the order of latitude & longitude seemed to have switched.)

I looked into it myself very briefly today (although I'm constantly distracted as I'm at a conference).

When the service via query.wikidata.org is slow or even down, I usually still get very fast responses when querying the server locally (via curl on the host, using the nginx at port 80). I suspect we have a problem where either the nginx is not allowing more than N active workers for the misc web Varnish (but still for other hosts or specifically localhost) or Varnish doesn't want to have more than N open connections to the services, thus lets the other connections wait.

Random things I see (and have no idea if they are relevant to the issue):

  • We have holes in Graphite data for unrelated metrics (for example load average). Those holes do not completely coincide with blazegraph being unavailable, but there seem to be some correlation.
  • thread dumps taken during blazegraph being unresponsive show only a few threads doing actual work, no stuck threads and no deadlocks at the JVM level.
  • blazegraph stops sending any logs (unusual, blazegraph is usually pretty verbose)

Number of established connections on nginx is low and seem to all be coming from varnish (I checked a sample over a few minutes):

gehel@wdqs1002:/etc/nginx$ sudo netstat -pn | grep 10.64.32.183:80 | grep ESTABLISHED | wc -l
3

Number of connections in TIME_WAIT is coherent with the number of requests seen for the last minute in nginx access.log.

TIME_WAIT seems to be normal, but I also saw very high number of CLOSE_WAIT ones, when Blazegraph was stalling, not sure if it's the cause or the symptom, but this seems to be what was keeping the file handles and causing "too many open files".

CLOSE_WAIT is whan the TCP stack is waiting for the application to acknowledge that the connection is terminated. So yes, a high number of CLOSE_WAIT is an indication of a problem. If blazegraph is frozen (or Jetty more precisely) that could explain. If I read the code / stacktraces correctly, blazegraph is using only async IO, so the reasons I see for a delay in acknowledging CLOSE_WAIT is if the HTTP thread pool is overloaded or if the JVM is completely frozen (GC going crazy?). There are probably other scenarios, but I can't think of any at the moment.

See T134523 for the cause of FD leaks in Updater.

Looks like the real cause is here: https://jira.blazegraph.com/browse/BLZG-1884

I'll add the patch as soon as it's released.

Smalyshev added a subscriber: aude.
Smalyshev added a subscriber: Gehel.

Mentioned in SAL [2016-05-07T12:32:44Z] <hoo> Restarted blazegraph on wdqs1002 (Unresponsive, even locally: java.io.IOException: Too many open files) T134238