All queries on the Wikidata query service are currently returning Error 502 (bad gateway).
This was previously reported on [[d:WD:DEV#Query_service_returns_502_Bad_Gateway_for_all_queries]].
All queries on the Wikidata query service are currently returning Error 502 (bad gateway).
This was previously reported on [[d:WD:DEV#Query_service_returns_502_Bad_Gateway_for_all_queries]].
Investigating:
Restart of blazegraph and updater seem to fix the issue, but I need to dig more into this, if only to have better metrics to understand what is happening next time.
WDQS down again. 320K file handles taken by Java processes. More investigation needed.
With updater shutdown, it seems that number of open pipes stays stable at 5-8K. I did not take those measurement long enough to be confident about the correlation, but that's at least an idea. Of course, keeping the updater down is not a solution...
Mentioned in SAL [2016-05-03T17:18:08Z] <gehel> restarting blazegraph on wdqs1002 (T134238)
@Gehel what about unsetting the file descriptor limit for wdqs? If those are due to real use and not fd leaking, I mean.
Previous analysis is mostly wrong. File descriptors seem under control and NOT jumping from 5k to 100k (if that was the case, I would be suspicious to raise the limit that far).
Looks like updater slowly leaking FDs, to the rate of 2-3 FDs per minute. Eventually it causes the leaked ones to accumulate, but for now restarting updater once per day or two would eliminate the problem. I'll try to see where it comes from.
So the FD issue is probably a side issue that was detected because we were looking more closely than usual. The main cause of this outage is most probably a traffic peak as seen in Grafana, made worse by the fact that we run on a single server. If that's really the case, it shows that we run too close to our limit load (peaks are around 2-3x normal load).
@Yurik How much load / how many queries does what was enabled in https://gerrit.wikimedia.org/r/#/c/284091/ 3554c7b6c7137c6e6e6bd77379d0d705f209d118 contribute to the situation?
Probably 0 at the moment - I don't think there are any interactive graphs out there that use sparql yet. Only a few non-interactive samples (btw, there is another bug there - the order of latitude & longitude seemed to have switched.)
I looked into it myself very briefly today (although I'm constantly distracted as I'm at a conference).
When the service via query.wikidata.org is slow or even down, I usually still get very fast responses when querying the server locally (via curl on the host, using the nginx at port 80). I suspect we have a problem where either the nginx is not allowing more than N active workers for the misc web Varnish (but still for other hosts or specifically localhost) or Varnish doesn't want to have more than N open connections to the services, thus lets the other connections wait.
Random things I see (and have no idea if they are relevant to the issue):
Number of established connections on nginx is low and seem to all be coming from varnish (I checked a sample over a few minutes):
gehel@wdqs1002:/etc/nginx$ sudo netstat -pn | grep 10.64.32.183:80 | grep ESTABLISHED | wc -l 3
Number of connections in TIME_WAIT is coherent with the number of requests seen for the last minute in nginx access.log.
TIME_WAIT seems to be normal, but I also saw very high number of CLOSE_WAIT ones, when Blazegraph was stalling, not sure if it's the cause or the symptom, but this seems to be what was keeping the file handles and causing "too many open files".
CLOSE_WAIT is whan the TCP stack is waiting for the application to acknowledge that the connection is terminated. So yes, a high number of CLOSE_WAIT is an indication of a problem. If blazegraph is frozen (or Jetty more precisely) that could explain. If I read the code / stacktraces correctly, blazegraph is using only async IO, so the reasons I see for a delay in acknowledging CLOSE_WAIT is if the HTTP thread pool is overloaded or if the JVM is completely frozen (GC going crazy?). There are probably other scenarios, but I can't think of any at the moment.
Looks like the real cause is here: https://jira.blazegraph.com/browse/BLZG-1884
I'll add the patch as soon as it's released.
Mentioned in SAL [2016-05-07T12:32:44Z] <hoo> Restarted blazegraph on wdqs1002 (Unresponsive, even locally: java.io.IOException: Too many open files) T134238