Page MenuHomePhabricator

502 Bad Gateway errors while trying to run simple queries with the Wikidata Query Service
Closed, ResolvedPublic

Description

We're experiencing issues and 502 Bad Gateway errors while trying to run different kinds of queries using the Wikidata Query Service since yesterday.

The cause is unknown by the moment.

Error message

ERROR: <html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.11.1</center>
</body>
</html>

See also

Event Timeline

abian created this task.Sep 25 2016, 11:17 AM
Restricted Application added a project: Discovery. · View Herald TranscriptSep 25 2016, 11:17 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
abian updated the task description. (Show Details)Sep 25 2016, 11:26 AM
abian updated the task description. (Show Details)Sep 25 2016, 11:31 AM
Esc3300 updated the task description. (Show Details)Sep 25 2016, 11:51 AM
Esc3300 added a subscriber: Esc3300.

Maybe related: T146529

Karima added a subscriber: Karima.Sep 25 2016, 12:43 PM
edsu added a subscriber: edsu.Sep 25 2016, 1:31 PM
Sjoerddebruin triaged this task as High priority.Sep 25 2016, 1:45 PM

Lots of reports coming in.

Mentioned in SAL (#wikimedia-operations) [2016-09-25T14:30:27Z] <gehel> putting wdqs1002 in maintenance mode, server looks unstable, investigating... - T146576

Esc3300 updated the task description. (Show Details)Sep 25 2016, 2:31 PM

According to icinga/grafana, several servers have lag (including that one).

Gehel added a comment.Sep 25 2016, 2:38 PM

logs on wdqs1002 show error opening sockets:

Sep 25 14:35:08 wdqs1002 bash[10426]: 2016-09-25 14:30:56.279:WARN:oejs.ServerConnector:qtp927028538-34-acceptor-0@6b2731c9-ServerConnector@74cdad37{HTTP/1.1}{localhost:9999}:
Sep 25 14:35:08 wdqs1002 bash[10426]: java.io.IOException: Too many open files
Sep 25 14:35:08 wdqs1002 bash[10426]: at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
Sep 25 14:35:08 wdqs1002 bash[10426]: at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
Sep 25 14:35:08 wdqs1002 bash[10426]: at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
Sep 25 14:35:08 wdqs1002 bash[10426]: at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
Sep 25 14:35:08 wdqs1002 bash[10426]: at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
Sep 25 14:35:08 wdqs1002 bash[10426]: at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
Sep 25 14:35:08 wdqs1002 bash[10426]: at java.lang.Thread.run(Thread.java:745)

Mentioned in SAL (#wikimedia-operations) [2016-09-25T14:39:29Z] <gehel> restarting blazegraph on wdqs1002 - T146576

EricJ added a subscriber: EricJ.Sep 25 2016, 2:48 PM

I was getting the same issue intermittently when hitting the sparql endpoint programmatically.

The message was:
Failed to load resource: the server responded with a status of 502 ()

My logging in the JavaScript console:
GET https://query.wikidata.org/sparql?format=JSON&query=%23+query+for+www.dages…e%3Alabel+%7B+bd%3AserviceParam+wikibase%3Alanguage+%22en%22.+%7D%0A%7D%0A 502 ()

I was able to reproduce the issue in the query service (manifesting itself intermittently with a blue bar, as described above).

Gehel added a comment.Sep 25 2016, 2:54 PM

Restarting blazegraph on wdqs1002 seems to have solved the issue. I'll dig into the logs to see if I find something that would explain the issue.

Gehel added a comment.Sep 25 2016, 3:03 PM

It seems that wdqs1002 started showing replication lag issues around 16:00 UTC on Saturday Sept 24. Shortly after, there is a hole in that metric in Graphite. HTTP 502 went up much later, around 19:00 UTC.

Addshore moved this task from incoming to monitoring on the Wikidata board.Sep 26 2016, 11:07 AM

Thanks for fixing this. Afterwards, occasionally, I got old data and the "data updated" timestamp didn't always show that.

Multichill added a subscriber: hoo.Oct 1 2016, 2:21 PM

As this one is still open, I'll just continue in this bug. Since a couple of minutes it's impossible to run queries. I get this all the time:

ERROR: <html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.11.3</center>
</body>
</html>

@hoo mentioned that someone is hammering the service it faulty queries.

hoo added a comment.Oct 1 2016, 2:37 PM

We had a few thousand queries that were TOOL: auxiliary_matcher, those caused syntax errors. Not sure that's enough to make the service unstable.

I can't see anything else obviously wrong at a glance.

Magnus added a subscriber: Magnus.Oct 1 2016, 4:09 PM

Oops, forgot the #
Fixed now.

By looking at https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?panelId=15&fullscreen and doing some testing, I think this particular issue is solved. I would suggest to close this as resolved and create new tickets for remaining, unrelated issues.

Gehel closed this task as Resolved.Oct 2 2016, 7:28 AM
Gehel claimed this task.

As @jcrespo pointed out, the current issue is different that the one raised here. I created T147130 to track this new issue. And I'm closing this one.