Page MenuHomePhabricator

cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503
Closed, ResolvedPublicBUG REPORT

Description

What happens?:
cxserver is throwing following error, which indicates something is wrong, root cause is unknown. It is coming from internal API call at, http://localhost:6500/w/api.php

Frequency:
195 errors in last 24 hours.

<!DOCTYPE html>
<html lang="en" dir="ltr">
<meta charset="utf-8">
<title>Wikimedia Error</title>
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }
img { float: left; margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645AD; text-decoration: none; }
a:hover { text-decoration: underline; }
</style>
<div class="content" role="main">
<a href="https://www.wikimedia.org"><img src="https://www.wikimedia.org/static/images/wmf.png" srcset="https://www.wikimedia.org/static/images/wmf-2x.png 2x" alt=Wikimedia width=135 height=135></a>
<h1>Service Temporarily Unavailable</h1>
<p>Our servers are currently under maintenance or experiencing a technical problem. Please <a href="" title="Reload this page" onclick="location.reload(false); return false">try again</a> in a few&nbsp;minutes.</p>
</div>
</html>

Full error log: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-syslog-2021.06.21?id=NRy-LXoBQzT5HGEiuoS_

Outcome

Investigation concluded that the large amount errors in cxserver were caused by a failure elsewhere. No further action was taken.

Event Timeline

Nikerabbit renamed this task from cxserver throws maintenance error to cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503.Jun 21 2021, 11:59 AM

Most errors are coming from ServiceRunner checks, but I was able to reproduce this errors by just repeatedly calling https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en myself.

CXServer does not serve this kind of error page, it's coming from somewhere higher up. But I can't find any logs on cxserver what could be going wrong, so I am unable to find the root cause. I need advice how to debug this further.

The message comes from mw-api, http://localhost:6500/w/api.php is the local address of the https://wikitech.wikimedia.org/wiki/Envoy#Services_Proxy mw-api listener. You can see errors arising from the service-proxy itself if you look for logs from kubernetes.container_name: cxserver-production-tls-proxy. The following relates to the error linked in the description:

https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-syslog-2021.06.21?id=w9a-LXoBStjVNP_PyjQr

It's likely that the error is coming from wbsearchentities or wbgetentities API and cxserver's error handling just forwards the whole error reply through it's own API. Tagging Wikidata in case they are aware of any issues related to these APIs.

I'm not sure of anything that would be causing this.
What sort of requests is cxserver making to the wikidata api?

Usually there is a request id of some kind in the error message, but this has none. I wonder if this is some kind of rate limiting or timeout issue?

Change 702077 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[translatewiki@master] Add systemd timer for autoexport

https://gerrit.wikimedia.org/r/702077

Ignore above, accidentally pasted wrong bug number.

My current understanding:

  • This error does not originate from cxserver
  • Usually API errors contain an API response with error codes.
  • The HTTP 503 error is hiding the underlying issue and it's not giving any error codes or request ids to explore further. I do not know how to find the underlying issue.
  • This issue does not affect CXServer only, but also other projects using the tls-proxy at rate of over 100 per minute: https://logstash.wikimedia.org/goto/67bdc723be54629059656d3034460bad
  • It is not only Wikidata, but also Commons and other wikis to which the requests are failing
  • One simple query that failed is https://wikidata.org/w/api.php?action=wbgetentities&props=labels&ids=Q1139546&languages=en&format=json – this should not fail due to any reason related to the query itself

Is this a general issue with the tls-proxy that some amount of requests are failing? Or if not and this is just the way it reports all errors from upstream, how can we make the tls-proxy log the actual issue somewhere (or where is it if it is done already)?

Picking up from the IRC conversation yesterday @RLazarus figured that the response body looks like it is https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/errorpages/503.html
At the time this issue was opened (June 21) we did had some database issues, so increased rate of 503's from apiservers are most likely due to that.

Looking at the last two weeks the picture has changed from June to now with only a hand full of requests failing for cxserver in the last two weeks, most of them due to upstream connection failure ("UF" in response flags field). Those errors might happen from time to time due to the service-proxy creating persistent connections which then might get closed server side or due some network issues. But as that is happening at a very low rate, we did not dig more into that by now.

Picking up from the IRC conversation yesterday @RLazarus figured that the response body looks like it is https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/errorpages/503.html
At the time this issue was opened (June 21) we did had some database issues, so increased rate of 503's from apiservers are most likely due to that.

Looking at the last two weeks the picture has changed from June to now with only a hand full of requests failing for cxserver in the last two weeks, most of them due to upstream connection failure ("UF" in response flags field). Those errors might happen from time to time due to the service-proxy creating persistent connections which then might get closed server side or due some network issues. But as that is happening at a very low rate, we did not dig more into that by now.

Thanks a lot!

It seems these errors has reduced greatly or vanished now (atleast since last 24 hours!)

@Nikerabbit We can wait for a day or two and then moved it to Done.

Moving to done. Errors are no longer happening or not frequent atleast.

Nikerabbit claimed this task.