Query service fails with "Too many open files"
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Yair_rand
	May 3 2016, 9:08 AM

Description

All queries on the Wikidata query service are currently returning Error 502 (bad gateway).

This was previously reported on [[d:WD:DEV#Query_service_returns_502_Bad_Gateway_for_all_queries]].

Related Objects

Mentioned In: T134462: some pic of graph can't loading?
Mentioned Here: T134523: Jolokia server causes FD leak in Updater
rOMWC3554c7b6c713: Revert "Don't yet allow wikidatasparql graph urls"

Event Timeline

Yair_rand created this task.May 3 2016, 9:08 AM

Restricted Application added projects: Wikidata, Discovery-ARCHIVED. · View Herald TranscriptMay 3 2016, 9:08 AM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

Gehel claimed this task.May 3 2016, 9:31 AM

Lydia_Pintscher merged a task: T134243: WDQ down (502 Bad Gateway).May 3 2016, 9:42 AM

Lydia_Pintscher added a subscriber: Karima.

Lydia_Pintscher moved this task from incoming to ready to go on the Wikidata board.May 3 2016, 9:44 AM

Investigating:

May 03 09:33:48 wdqs1002 bash[13378]: java.io.IOException: Too many open files
May 03 09:33:48 wdqs1002 bash[13378]: at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
High number of TCP connections in TIME_WAIT / CLOSE

Restart of blazegraph and updater seem to fix the issue, but I need to dig more into this, if only to have better metrics to understand what is happening next time.

Frog23 subscribed.May 3 2016, 10:36 AM

hoo subscribed.May 3 2016, 12:07 PM

hoo renamed this task from Query service is down to Query service fails with "Too many open files".May 3 2016, 12:24 PM

WDQS down again. 320K file handles taken by Java processes. More investigation needed.

With updater shutdown, it seems that number of open pipes stays stable at 5-8K. I did not take those measurement long enough to be confident about the correlation, but that's at least an idea. Of course, keeping the updater down is not a solution...

JanZerebecki subscribed.May 3 2016, 2:08 PM

Envlh subscribed.May 3 2016, 3:45 PM

Mentioned in SAL [2016-05-03T17:18:08Z] <gehel> restarting blazegraph on wdqs1002 (T134238)

@Gehel what about unsetting the file descriptor limit for wdqs? If those are due to real use and not fd leaking, I mean.

Previous analysis is mostly wrong. File descriptors seem under control and NOT jumping from 5k to 100k (if that was the case, I would be suspicious to raise the limit that far).

Looks like updater slowly leaking FDs, to the rate of 2-3 FDs per minute. Eventually it causes the leaked ones to accumulate, but for now restarting updater once per day or two would eliminate the problem. I'll try to see where it comes from.

Smalyshev triaged this task as High priority.May 3 2016, 5:33 PM

Smalyshev merged a task: T134243: WDQ down (502 Bad Gateway).May 3 2016, 5:39 PM

Mentioned in SAL [2016-05-03T21:11:17Z] <gehel> restarting wdqs1002 (T134238)

So the FD issue is probably a side issue that was detected because we were looking more closely than usual. The main cause of this outage is most probably a traffic peak as seen in Grafana, made worse by the fact that we run on a single server. If that's really the case, it shows that we run too close to our limit load (peaks are around 2-3x normal load).

Mentioned in SAL [2016-05-04T12:34:11Z] <gehel> restarting blazegraph (T134238)

Mentioned in SAL [2016-05-04T16:19:14Z] <gehel> restarting blazegraph (T134238)

@Yurik How much load / how many queries does what was enabled in https://gerrit.wikimedia.org/r/#/c/284091/ 3554c7b6c7137c6e6e6bd77379d0d705f209d118 contribute to the situation?

Probably 0 at the moment - I don't think there are any interactive graphs out there that use sparql yet. Only a few non-interactive samples (btw, there is another bug there - the order of latitude & longitude seemed to have switched.)

I looked into it myself very briefly today (although I'm constantly distracted as I'm at a conference).

When the service via query.wikidata.org is slow or even down, I usually still get very fast responses when querying the server locally (via curl on the host, using the nginx at port 80). I suspect we have a problem where either the nginx is not allowing more than N active workers for the misc web Varnish (but still for other hosts or specifically localhost) or Varnish doesn't want to have more than N open connections to the services, thus lets the other connections wait.

Random things I see (and have no idea if they are relevant to the issue):

We have holes in Graphite data for unrelated metrics (for example load average). Those holes do not completely coincide with blazegraph being unavailable, but there seem to be some correlation.
thread dumps taken during blazegraph being unresponsive show only a few threads doing actual work, no stuck threads and no deadlocks at the JVM level.
blazegraph stops sending any logs (unusual, blazegraph is usually pretty verbose)

Number of established connections on nginx is low and seem to all be coming from varnish (I checked a sample over a few minutes):

gehel@wdqs1002:/etc/nginx$ sudo netstat -pn | grep 10.64.32.183:80 | grep ESTABLISHED | wc -l
3

Number of connections in TIME_WAIT is coherent with the number of requests seen for the last minute in nginx access.log.

TIME_WAIT seems to be normal, but I also saw very high number of CLOSE_WAIT ones, when Blazegraph was stalling, not sure if it's the cause or the symptom, but this seems to be what was keeping the file handles and causing "too many open files".

CLOSE_WAIT is whan the TCP stack is waiting for the application to acknowledge that the connection is terminated. So yes, a high number of CLOSE_WAIT is an indication of a problem. If blazegraph is frozen (or Jetty more precisely) that could explain. If I read the code / stacktraces correctly, blazegraph is using only async IO, so the reasons I see for a delay in acknowledging CLOSE_WAIT is if the HTTP thread pool is overloaded or if the JVM is completely frozen (GC going crazy?). There are probably other scenarios, but I can't think of any at the moment.

Yurik mentioned this in T134462: some pic of graph can't loading?.May 5 2016, 7:23 PM

See T134523 for the cause of FD leaks in Updater.

Looks like the real cause is here: https://jira.blazegraph.com/browse/BLZG-1884

I'll add the patch as soon as it's released.

Smalyshev claimed this task.May 7 2016, 1:04 AM

Smalyshev added a subscriber: aude.

Smalyshev added a subscriber: Gehel.

Mentioned in SAL [2016-05-07T12:32:44Z] <hoo> Restarted blazegraph on wdqs1002 (Unresponsive, even locally: java.io.IOException: Too many open files) T134238

abian subscribed.May 7 2016, 12:34 PM

Smalyshev closed this task as Resolved.May 9 2016, 8:14 PM

Query service fails with "Too many open files"Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Query service fails with "Too many open files"
Closed, ResolvedPublic
Actions