Page MenuHomePhabricator

WCQS 500 errors
Closed, ResolvedPublic8 Estimated Story PointsBUG REPORT

Description

I am experiencing 500 Internal Server Error again in WCQS. This happens after a query is run, and the following message is returned in the output (in UI):

Unknown error
<html>
<head><title>500 Internal Server Error</title></head>
<body bgcolor="white">
<center><h1>500 Internal Server Error</h1></center>
<hr><center>nginx/1.14.2</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

I know this often resolves itself, but, anecdotally, this seems to be a common occurrence in WCQS, which seems to have these types of intermittent errors fairly often, from what it seems in the Telegram channel—and it's not nearly as common for WDQS. Is this a known issue or is this downtime tracked?

AC:

  • This error is understood and corrected
  • The error message is understandable by users

Event Timeline

Seems to be OK now, but I agree with @Dominicbm - these intermittent errors will crop up for 10-20 minutes then go away, but it's not clear whether they are being noticed or tracked by the folks on the search team. Any insights would be most welcome. Thanks.

As @Fuzheado notes, it usually resolves itself shortly, but the tool itself feels unreliable. As an example, I've been using it heavily today. It went down for me for 20-30 minutes around 9:30 am today in my time, and then began working again, and now I am encountering the same errors around 12:15 pm. This is a lot of random downtime.

Edited to add: the second downtime was nearly an hour before WCQS came back.

@Gehel My intention was to have one ticket for the reliability issues, and one for the message itself, if that makes sense. Just wanted to point out, because the other one was merged into this, but the new AC for this ticket, which was supposed to be about the error and not the message, looks like it would be closed once the error message was improved but the error continued to occur.

Usually the first stop for this kind of error would be reviewing the ATS Backends <-> Origin Servers Overview which suggest a low rate of 5xxs, typically 1-5% of requests fail. In a quick review of the last few 500 requests on one of the servers they were all malformed queries. We may need to look into more specific timespans rather than the generic 500 errors. Modifying one of the dashboard queries[1] to return success rate per 15 minutes and running it against thanos to get all DC's, looking for time periods of low success, the following time periods should be reviewed:

2022-04-16T17:30-18:10
2022-04-17T08:30-10:00
2022-04-22T16:26-17:12
2022-04-22T19:09-19:36
2022-04-26T16:20-17:37
2022-05-04T19:50-21:42

If this turns up the problem we could consider how it could be turned into an alert.

[1]

sum(increase(trafficserver_backend_requests_seconds_count{status=~"2[0-9][0-9]", cluster=~"cache_text", backend=~"wcqs\\.discovery\\.wmnet"}[15m])) by (backend)
/
sum(increase(trafficserver_backend_requests_seconds_count{status=~"[25][0-9][0-9]", cluster=~"cache_text", backend=~"wcqs\\.discovery\\.wmnet"}[15m])) by (backend)

Not finding anything decisive yet, but will continue looking. It occured to me that if it's happening consistently for an individual user but not in general that it could somehow be related to their authentication cookie. If seems plausible clearing the auth cookie could fix things and help narrow down the cuase, if the problem is related to auth.

Following the thread of something related to auth, I've found that the application server (jetty) which hosts the app has never properly had it's logging setup. Logs only come from the embedded applications, the application server itself ends up with bare minimum logging into the host local journald where it's mostly forgotten about. Currently working out how jetty should be configured for logging to work as expected. This likely means there are wdqs errors that go unnoticed as well. Hoping that with proper logging in place we start to get more details about whatever is causing these 500's.

The default logging is quite minimal, looking through the logs turns up a few unexplained errors that could be related. Not clear any of these are the symptoms of the same problem, but lacking more information best bet seems to be to look into these.

0-3 per day. These don't seem to be new, logs go back to mar 23, and this shows up on mar 24. Frequency is quite low.

May 10 15:34:21 wcqs1001 wcqs-blazegraph[29631]: 2022-05-10 15:34:21.036:WARN:oejs.HttpChannel:qtp968514068-231008: /oauth/check_auth java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms

Occured on multiple days over the last month, but not with any regularity. The value inside quotes is sometimes an html error page, sometimes the included value. suggests error messages are being interpreted as a valid response (but then not validating and failing later):

May 05 05:47:54 wcqs1002 wcqs-blazegraph[24508]: javax.servlet.ServletException: javax.servlet.ServletException: com.github.scribejava.core.exceptions.OAuthException: Response body is incorrect. Can't extract token and secret from this: 'upstream connect error or disconnect/reset before headers. reset reason: overflow'

Change 790775 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikidata/query/rdf@master] Separate logging deps into dedicated package

https://gerrit.wikimedia.org/r/790775

Change 790775 merged by jenkins-bot:

[wikidata/query/rdf@master] Separate logging deps into dedicated package

https://gerrit.wikimedia.org/r/790775

Change 792266 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] rdf query service: Apply WARN log level only to com.bigdata

https://gerrit.wikimedia.org/r/792266

Honestly, if you are seeing few errors, it may also be that heavy users (like myself) are putting off our projects until the tool is more usable. If another data point is useful, I was using WCQS again today, and only submitted a couple of simple queries before I got another internal server error tonight.

Change 792266 merged by Ryan Kemper:

[operations/puppet@production] rdf query service: Apply WARN log level only to com.bigdata

https://gerrit.wikimedia.org/r/792266

@Dominicbm @Fuzheado Understood that you all are still seeing these failures. We're still having trouble reproducing it on our end (we don't have great intuition as why but it's possible that the issue might only surface based on the geoDNS routing). We've added some further logging; next time any of you run into this issue if you could drop by the #wikimedia-search IRC channel and let us know, we can take a look at the logs and see if we can glean anything. Meanwhile on our end we'll keep trying to hone in on the source of the failure.

Mentioned in SAL (#wikimedia-operations) [2022-05-19T17:51:34Z] <ryankemper> T306899 Rolled wdqs and wcqs deploys to adjust logging settings. Hoping this gives us more visibility on the 500 errors WCQS users have been experiencing.

@Dominicbm @Fuzheado Understood that you all are still seeing these failures. We're still having trouble reproducing it on our end (we don't have great intuition as why but it's possible that the issue might only surface based on the geoDNS routing). We've added some further logging; next time any of you run into this issue if you could drop by the #wikimedia-search IRC channel and let us know, we can take a look at the logs and see if we can glean anything. Meanwhile on our end we'll keep trying to hone in on the source of the failure.

Thanks, it's appreciated, I will head to IRC next time. I expect to run more WCQS queries soon, as I have a WMF grant project just approved which requires it. Could I ask, since this ticket seems to be tracking reliability issues, why T306919 was closed as duplicate? (It might even be appropriate to put the IRC channel in the message, if that is useful to the team...)

@Dominicbm We made the error message is understandable by users part of the acceptance criteria of this ticket, so I think that's why T306919 was closed as a duplicate. Basically this ticket covers both the fixing of the actual problem as well as seeing what can be done to make the error message itself more comprehensible.

Reviewed logs again looking for patterns. Not much, but at least logstash is now aggregating together logs from the various hosts. Can see that the /oauth/check_auth java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms errors come in infrequently, but often bunched up a bit. From the last week, on May 22 it came in 5 times starting 11:45 until 15:02. May 26th three times from 14:22 to 14:23, twice on may 27 at 8:07, once on the 28th at 14:20 and once on the 31st at 10:58.

Still no strong proof that these are timeouts are the 500's some users are seeing. Additionally still no success in reproducing errors, I run multiple example queries on most weekdays for a few weeks now but they always work as expected.

Lacking better ideas on how to align the errors with some request that causes the error I've started up tcpdump on all the wcqs instances. They will store up to 100 1GB files per instance before starting to overwrite the initial files. The overall goal here is to match requests from the tcpdump pcap with unexplained error messages like 'Idle timeout expired'

tcpdump -ni lo -W 100 -C 1gb -w /srv/T306899/lo.pcap

@Dominicbm, @Fuzheado: did you see this issue again? Do you have a timestamp that we could correlate with the logs?

Since we did not get any feedback on this error, we're assuming that it has resolved. If not, please re-open and ping us with a timestamp of when the error occurs.

@Dominicbm, @Fuzheado: did you see this issue again? Do you have a timestamp that we could correlate with the logs?

I have been away from WCQS for a while, so the lack of observing errors on my part was more due non-usage than confirming (or not) if it was resolved. But was just trying to use it again today. I ran a few dozen queries (at most) over the last few hours, not at a high rate, and none that were causing timeouts or other errors. After a while—and I hadn't done a query in about an hour at that point—I got the usual error around 01:29 UTC Aug 2, 2022. Reopening with this specific info per @Gehel's suggestion.

Screen Shot 2022-08-01 at 9.29.31 PM.png (658×1 px, 394 KB)

@Dominicbm We made the error message is understandable by users part of the acceptance criteria of this ticket, so I think that's why T306919 was closed as a duplicate. Basically this ticket covers both the fixing of the actual problem as well as seeing what can be done to make the error message itself more comprehensible.

I also want to note that T306919 was closed as duplicate and the error message is understandable by users was added to the acceptance criteria, per above, but it doesn't seem the error message part was addressed before this was closed the first time when the error wasn't reproduced, as it still looks like the same raw HTML markup being shown in the user interface.

Experienced the same error today again, here is an exact timestamp (of the response): Wed, 03 Aug 2022 17:15:19 GMT.

@DAbad , this looks like it may be related to WCQS authentication, which may be relevant to current platform work

Experienced the same error today again, here is an exact timestamp (of the response): Wed, 03 Aug 2022 17:15:19 GMT.

This lines up nicely with a message from logging. This message comes from jetty which is the JVM application runnner that hosts both blazegraph and the oauth middleware.

Aug 3, 2022 @ 17:15:19.203	wcqs1002	WARNING	/oauth/check_auth java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms

Leaving the commons-query.wikimedia.org browser tab open for a few hours and re-running queries every 30-60 minutes or so reproduced a 500 after a few hours. Related js console errors follow. Timestamps are PDT. Unclear if the errors at 13:00 and 13:10 are directly related, but including since they were there)

13:00:42.779 /#%23Depictions%20of%20Douglas%20Adams%0A%23shows%20M-entities%20that%20depict%20Douglas%20Adams%0ASELECT%20%3Ffile%20WHERE%20%7B%0A%20%20%3Ffile%20wdt%3AP180%20wd%3AQ42%20.%0A%7D:1 Access to XMLHttpRequest at 'https://commons.wikimedia.org/wiki/Special:OAuth/authenticate?oauth_token=<redacted a>' (redirected from 'https://commons-query.wikimedia.org/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&nocache=27659280') from origin 'https://commons-query.wikimedia.org' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: Redirect is not allowed for a preflight request.

13:00:42.782 commons.wikimedia.org/wiki/Special:OAuth/authenticate?oauth_token=<redacted a>:1          Failed to load resource: net::ERR_FAILED

13:10:42.786 /#%23Depictions%20of%20Douglas%20Adams%0A%23shows%20M-entities%20that%20depict%20Douglas%20Adams%0ASELECT%20%3Ffile%20WHERE%20%7B%0A%20%20%3Ffile%20wdt%3AP180%20wd%3AQ42%20.%0A%7D:1 Access to XMLHttpRequest at 'https://commons.wikimedia.org/wiki/Special:OAuth/authenticate?oauth_token=<redacted b>' (redirected from 'https://commons-query.wikimedia.org/sparql?query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&nocache=27659290') from origin 'https://commons-query.wikimedia.org' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: Redirect is not allowed for a preflight request.

13:10:42.787 commons.wikimedia.org/wiki/Special:OAuth/authenticate?oauth_token=<redacted b>:1          Failed to load resource: net::ERR_FAILED

13:12:36.726 /#%23Depictions%20of%20Douglas%20Adams%0A%23shows%20M-entities%20that%20depict%20Douglas%20Adams%0ASELECT%20%3Ffile%20WHERE%20%7B%0A%20%20%3Ffile%20wdt%3AP180%20wd%3AQ42%20.%0A%7D:1 Access to XMLHttpRequest at 'https://commons.wikimedia.org/wiki/Special:OAuth/authenticate?oauth_token=<redacted c>' (redirected from 'https://commons-query.wikimedia.org/sparql?query=%23Depictions%20of%20Douglas%20Adams%0A%23shows%20M-entities%20that%20depict%20Douglas%20Adams%0ASELECT%20%3Ffile%20WHERE%20%7B%0A%20%20%3Ffile%20wdt%3AP180%20wd%3AQ42%20.%0A%7D') from origin 'https://commons-query.wikimedia.org' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: Redirect is not allowed for a preflight request.

13:12:36.749 commons.wikimedia.org/wiki/Special:OAuth/authenticate?oauth_token=<redacted c>:1          Failed to load resource: net::ERR_FAILED

13:13:06.992 /sparql:1          Failed to load resource: the server responded with a status of 500 ()

Correlated errors from server logs (13:00 PDT == 20:00 UTC):

Aug 3, 2022 @ 20:13:06.938	wcqs1002	WARNING	/oauth/check_auth java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms

I've tracked down one source of 500 errors, unclear if the original report here is for same thing.

Reproduction:

curl -XPOST https://commons-query.wikimedia.org/any-url-doesnt-matter -d 'foo=bar'

Reason:
This request includes a Content-Length header which nginx ends up passing along to the /oauth/check_auth endpoint. Jetty (hosting the endpoint) sees the Content-Length header and starts waiting for the content to arrive, which never does. After 30s jetty times out the request. This most likely means all request's with the query in the content, rather than a url query string, receive this 500 error.

Resolution:
Whitelist the set of headers that will be passed along to the /oauth/* endpoints to only include the Host and Cookies headers.

Caveats:
While this will fix the timeout, i suspect it will simply fail the request at a different part of the request. At least in my reproduction case the reason the UI is issuing a POST request with the query in the body is that the GET request was rejected due to attempting to re-auth during an XHR and the browser refused to show the response to the javascript. The UI javascript interprets this as the request having never been sent and re-issues the same request over POST. Once this timeout issue is fixed that POST request will have the same CORS problems and it's unlikely we will be able to change mediawiki's Special:OAuth CORS headers for this use case.

Possible Solutions:
Gergo suggested perhaps we can store an oauth1 related token in the cookies. When the JWT expires after 2 hours and requires a re-auth we might be able to re-validate the previously stored oauth1 token, rather than going through the full redirect-bounce which has CORS issues. Will require more investigation and review of oauth 1 flows to determine if this is viable.

I also have had many of these errors. I sometimes work for hours with the same session, and I agree it might be time-outs.
In my experience: I can continue right away if I open a new browser session.

Change 825434 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikidata/query/rdf@master] mw-oauth-proxy: Remember access token and use to refresh session

https://gerrit.wikimedia.org/r/825434

Change 825922 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikidata/query/rdf@master] mw-oauth-proxy: Update nginx test integration to work with POST

https://gerrit.wikimedia.org/r/825922

Change 825925 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] query_service: Avoid passing content body to internal auth endpoints

https://gerrit.wikimedia.org/r/825925

Change 825925 merged by Bking:

[operations/puppet@production] query_service: Avoid passing content body to internal auth endpoints

https://gerrit.wikimedia.org/r/825925

Change 825434 merged by jenkins-bot:

[wikidata/query/rdf@master] mw-oauth-proxy: Remember access token and use to refresh session

https://gerrit.wikimedia.org/r/825434

Change 825922 merged by jenkins-bot:

[wikidata/query/rdf@master] mw-oauth-proxy: Update nginx test integration to work with POST

https://gerrit.wikimedia.org/r/825922

Mentioned in SAL (#wikimedia-operations) [2022-08-30T05:03:45Z] <ryankemper> T306899 T316496 Deployed WCQS 0.3.115. That should (hopefully) resolve these tickets.

Some quick testing makes this look successful. Using curl to perform a POST no longer 500's:

curl 'https://commons-query.wikimedia.org/sparql' \
  -XPOST \
  -H 'cookie: wcqsOauth=<secret>; wcqsSession=<secret>' \
  -d 'query=prefix%20schema:%20%3Chttp://schema.org/%3E%20SELECT%20*%20WHERE%20%7B%3Chttp://www.wikidata.org%3E%20schema:dateModified%20?y%7D&nocache=27701073'

Additionally the underlying issue, that the JWT would expire also looks resolved. Tested by opening the UI in a browser tab along with the network inspector and leaving it for many hours. The UI performs a regular request every 10 minutes to ask about update lag, every couple hours those requests return a 307 response that includes a new JWT and the requests continue to work.

Change 830924 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikidata/query/rdf@master] Store OAuth1AccessToken fully within cookies

https://gerrit.wikimedia.org/r/830924

Change 830924 merged by jenkins-bot:

[wikidata/query/rdf@master] Store OAuth1AccessToken fully within cookies

https://gerrit.wikimedia.org/r/830924