ca?sourcelanguages=en occasionally fails with HTTP 503
Closed, ResolvedPublicBUG REPORT
Actions

Description

What happens?:
cxserver is throwing following error, which indicates something is wrong, root cause is unknown. It is coming from internal API call at, http://localhost:6500/w/api.php

Frequency:
195 errors in last 24 hours.

<!DOCTYPE html>
<html lang="en" dir="ltr">
<meta charset="utf-8">
<title>Wikimedia Error</title>
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }
img { float: left; margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645AD; text-decoration: none; }
a:hover { text-decoration: underline; }
</style>
<div class="content" role="main">
<a href="https://www.wikimedia.org"><img src="https://www.wikimedia.org/static/images/wmf.png" srcset="https://www.wikimedia.org/static/images/wmf-2x.png 2x" alt=Wikimedia width=135 height=135></a>
<h1>Service Temporarily Unavailable</h1>
<p>Our servers are currently under maintenance or experiencing a technical problem. Please <a href="" title="Reload this page" onclick="location.reload(false); return false">try again</a> in a few&nbsp;minutes.</p>
</div>
</html>

Full error log: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-syslog-2021.06.21?id=NRy-LXoBQzT5HGEiuoS_

Outcome

Investigation concluded that the large amount errors in cxserver were caused by a failure elsewhere. No further action was taken.

Event Timeline

KartikMistry created this task.Jun 21 2021, 11:55 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 21 2021, 11:55 AM

• Nikerabbit renamed this task from cxserver throws maintenance error to cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503.Jun 21 2021, 11:59 AM

Most errors are coming from ServiceRunner checks, but I was able to reproduce this errors by just repeatedly calling https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en myself.

CXServer does not serve this kind of error page, it's coming from somewhere higher up. But I can't find any logs on cxserver what could be going wrong, so I am unable to find the root cause. I need advice how to debug this further.

KartikMistry added a project: serviceops.Jun 21 2021, 12:05 PM

The message comes from mw-api, http://localhost:6500/w/api.php is the local address of the https://wikitech.wikimedia.org/wiki/Envoy#Services_Proxy mw-api listener. You can see errors arising from the service-proxy itself if you look for logs from kubernetes.container_name: cxserver-production-tls-proxy. The following relates to the error linked in the description:

https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-syslog-2021.06.21?id=w9a-LXoBStjVNP_PyjQr

• Nikerabbit added a project: Language-Team (Language-2021-April-June).Jun 21 2021, 1:48 PM

It's likely that the error is coming from wbsearchentities or wbgetentities API and cxserver's error handling just forwards the whole error reply through it's own API. Tagging Wikidata in case they are aware of any issues related to these APIs.

Restricted Application added a project: [DEPRECATED] wdwb-tech. · View Herald TranscriptJun 22 2021, 8:14 AM

I'm not sure of anything that would be causing this.
What sort of requests is cxserver making to the wikidata api?

Addshore moved this task from Inbox to Research on the [DEPRECATED] wdwb-tech board.Jun 22 2021, 8:56 AM

See

These look like simple queries to me, except if the latter one has a lot of languages.

Usually there is a request id of some kind in the error message, but this has none. I wonder if this is some kind of rate limiting or timeout issue?

• Nikerabbit moved this task from Quarter Backlog to Blocked on the Language-Team (Language-2021-April-June) board.Jun 28 2021, 2:38 PM

Change 702077 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[translatewiki@master] Add systemd timer for autoexport

https://gerrit.wikimedia.org/r/702077

gerritbot added a project: Patch-For-Review.Jun 29 2021, 8:58 AM

Ignore above, accidentally pasted wrong bug number.

Pginer-WMF edited projects, added Language-Team (Language-2021-July-September); removed Language-Team (Language-2021-April-June).Jul 12 2021, 8:06 AM

Pginer-WMF moved this task from Quarter Backlog to Blocked on the Language-Team (Language-2021-July-September) board.

My current understanding:

This error does not originate from cxserver
Usually API errors contain an API response with error codes.
The HTTP 503 error is hiding the underlying issue and it's not giving any error codes or request ids to explore further. I do not know how to find the underlying issue.
This issue does not affect CXServer only, but also other projects using the tls-proxy at rate of over 100 per minute: https://logstash.wikimedia.org/goto/67bdc723be54629059656d3034460bad
It is not only Wikidata, but also Commons and other wikis to which the requests are failing
One simple query that failed is https://wikidata.org/w/api.php?action=wbgetentities&props=labels&ids=Q1139546&languages=en&format=json – this should not fail due to any reason related to the query itself

Is this a general issue with the tls-proxy that some amount of requests are failing? Or if not and this is just the way it reports all errors from upstream, how can we make the tls-proxy log the actual issue somewhere (or where is it if it is done already)?

Addshore moved this task from Research to Investigate & Discuss on the [DEPRECATED] wdwb-tech board.Jul 16 2021, 8:38 AM

KartikMistry added a subscriber: akosiaris.Jul 19 2021, 8:09 AM

Picking up from the IRC conversation yesterday @RLazarus figured that the response body looks like it is https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/errorpages/503.html
At the time this issue was opened (June 21) we did had some database issues, so increased rate of 503's from apiservers are most likely due to that.

Looking at the last two weeks the picture has changed from June to now with only a hand full of requests failing for cxserver in the last two weeks, most of them due to upstream connection failure ("UF" in response flags field). Those errors might happen from time to time due to the service-proxy creating persistent connections which then might get closed server side or due some network issues. But as that is happening at a very low rate, we did not dig more into that by now.

In T285219#7220841, @JMeybohm wrote:

Picking up from the IRC conversation yesterday @RLazarus figured that the response body looks like it is https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/errorpages/503.html
At the time this issue was opened (June 21) we did had some database issues, so increased rate of 503's from apiservers are most likely due to that.

Looking at the last two weeks the picture has changed from June to now with only a hand full of requests failing for cxserver in the last two weeks, most of them due to upstream connection failure ("UF" in response flags field). Those errors might happen from time to time due to the service-proxy creating persistent connections which then might get closed server side or due some network issues. But as that is happening at a very low rate, we did not dig more into that by now.

Thanks a lot!

It seems these errors has reduced greatly or vanished now (atleast since last 24 hours!)

@Nikerabbit We can wait for a day or two and then moved it to Done.