repeated 503 errors for 90 minutes now
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	doctaxon
	Sep 23 2016, 8:44 AM

Description

full error line trying to do a bot job at tools.taxonbot:

/ format json / maxlag 5 / action query / prop info / titles {:Erika Sunnegårdh}
<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<title>Wikimedia Error</title>
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 560px; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { float: left; margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645AD; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
.text-muted { color: #777; }
</style>
<div class="content" role="main">
<a href="//www.wikimedia.org"><img src="//www.wikimedia.org/static/images/wmf.png" srcset="//www.wikimedia.org/static/images/wmf-2x.png 2x" alt=Wikimedia width=135 height=135></a>
<h1>Error</h1>
<p>Our servers are currently under maintenance or experiencing a technical problem. This is probably temporary and should be fixed&nbsp;soon.<br>The error message at the bottom of this page should contain more information.<br>Please <a href="" title="Reload this page" onclick="window.location.reload(false); return false">try again</a> accordingly in a few&nbsp;minutes.</p>
</div>
<div class="footer">
<p>If you report this error to the Wikimedia System Administrators, please include the details below.</p>
<p class="text-muted"><code>
Request from 10.68.23.58 via cp1065 cp1065, Varnish XID 3997275082<br>Error: 503, Service Unavailable at Fri, 23 Sep 2016 08:28:19 GMT</code></p></div></html>
missing value to go with key
    while executing
"dict get [json::json2dict $json] {*}$args"
    (procedure "get" line 2)
    invoked from within
"get $json query pages"
    (procedure "page" line 2)
    invoked from within
"page [post $wiki {*}$query / prop info / titles $lemma]"
    (procedure "redirect" line 3)
    invoked from within
"redirect $item"
    ("foreach" body line 1)
    invoked from within
"foreach item $znline {if {![missing $item] && ![redirect $item] && $item ni $bkl} {lappend nline2 \[\[$item\]\]}}"
    invoked from within
"if ![empty line] {
                                regsub -all -nocase -- {\[\[(?!Datei:|File:|:)} $line \[\[: nline
                                if {$listformat eq {SHORTLIST}} {
                                        regexp -- {(\d\d\...."
    ("foreach" body line 3)
    invoked from within
"foreach line $locportal {
                        lassign {} nline2 nline12
                        if ![empty line] {
                                regsub -all -nocase -- {\[\[(?!Datei:|File:|:)} $line \[\[: nline
                        ..."
    (body of "dict with")
    invoked from within
"dict with data {
#puts \n$portal\n$data
                if {$listformat eq {SHORTLIST}} {set dateformat %d.%m.} else {set dateformat {%d. %b}}
                set altdate [clock ..."
    ("foreach" body line 7)
    invoked from within
"foreach {portal data} $e {
if {$portal ne {Benutzer:Nobart/Neue Filme} && !$aaaa} {continue} else {incr aaaa}
#puts $portal
#if $offset {break}
#if {$..."
    (file "./NeueArtikel4.tcl" line 33)

This error occurs in intervals from seconds to minutes for about 90 minutes now.
I suppose, that this is very interesting: Request from 10.68.23.58 via cp1065 cp1065, Varnish XID 3997275082<br>Error: 503, Service Unavailable at Fri, 23 Sep 2016 08:28:19 GMT

cp1065 - problems to reach the proxy?

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Joe	T146451 repeated 503 errors for 90 minutes now
		Resolved		Joe	T147773 Restart HHVM on API appservers every about 48 hours

Event Timeline

doctaxon created this task.Sep 23 2016, 8:44 AM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptSep 23 2016, 8:44 AM

Restricted Application added subscribers: Luke081515, Aklapper. · View Herald Transcript

doctaxon triaged this task as High priority.Sep 23 2016, 8:45 AM

Adding Traffic so they can give it a quick look.

jcrespo unsubscribed.Sep 23 2016, 10:34 AM

changed Priority because there have to run a lot of bot scripts Wikipedia users needs to work with it. The unbreak is open.

Request from 10.68.23.58 via cp1055 cp1055, Varnish XID 3441681239<br>Error: 503, Service Unavailable at Fri, 23 Sep 2016 11:15:26 GMT

I got this error from cp1055 now, too.

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptSep 23 2016, 11:19 AM

@doctaxon can you indicate the full url you are trying?

from chat about the topic:

11:26 < wikibugs> Labs, Tool-Labs, Operations, Traffic: repeated 503 errors
                  for 90 minutes now on cp1065 -
                  https://phabricator.wikimedia.org/T146451#2661965
                  (jcrespo) @doctaxon can you indicate the full url you are
                  trying?
11:29 < doctaxon> jynus: full url? I am trying it by API
11:29 < doctaxon> format json / maxlag 5 / action query / prop info / titles
                  {:Kurt Couto}
11:29 < jynus> yes, but mediawiki api requires a host an an url, which one
               are you using?
11:30 < jynus> e.g. https://en.wikipedia.org/w/api.php?action=my_call
11:31 < jynus> maybe you are not using https, which is required
11:31 < doctaxon> this is an example of many:
https://de.wikipedia.org/w/index.php?title=Kurt_Couto&action=info
11:32 < jynus> ok
11:32 < jynus> thank you, can you add that to the ticket so more people see
               it?
11:32 < jynus> if we have the url, we can see the logs, so it is important
11:32 < Steinsplitter> doctaxon: antwortet dir das api? (habe jetzt nicht
                       mitgelesen), ich kriege keine aw fom api
11:34 < doctaxon> Steinsplitter: see error line in T146451
11:34 < stashbot> T146451: repeated 503 errors for 90 minutes now on cp1065
                  - https://phabricator.wikimedia.org/T146451
11:35 < doctaxon> this is what I got in the shell, on bastion and the grid
11:35 < Steinsplitter> same problem here.
11:35 < doctaxon> jynus: I got the error now from cp1053

doctaxon added a subscriber: Steinsplitter.Sep 23 2016, 11:38 AM

doctaxon added a subscriber: Giftpflanze.

Getting the problem when accessing the Wikimedia Commons api via labs or labs grid engine.

For example when attempting to getting image info for File:@Baldwin_School_Auditorium.jpg via API.

@Steinsplitter do you get the data correctly if you try from your computer?

is the error related to the cache proxies, if there are reports of all the cp1065, cp 1053, cp 1055 ...?

In T146451#2661974, @doctaxon wrote:

is the error related to the cache proxies, if there are reports of all the cp1065, cp 1053, cp 1055 ...?

It depends. All the errors might come from one cache backend for example, but at this point I guess it's more probable the issue is upstream from that.

Who is responsible for that?

In T146451#2661972, @Joe wrote:

@Steinsplitter do you get the data correctly if you try from your computer?

Yes, ~ 40 successful attempts.

next error trying this:
https://de.wikipedia.org/w/index.php?title=Offshore-Windpark_Borssele&action=info

/ format json / maxlag 5 / action query / prop info / titles {Offshore-Windpark Borssele}
<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<title>Wikimedia Error</title>
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 560px; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { float: left; margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645AD; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
.text-muted { color: #777; }
</style>
<div class="content" role="main">
<a href="//www.wikimedia.org"><img src="//www.wikimedia.org/static/images/wmf.png" srcset="//www.wikimedia.org/static/images/wmf-2x.png 2x" alt=Wikimedia width=135 height=135></a>
<h1>Error</h1>
<p>Our servers are currently under maintenance or experiencing a technical problem. This is probably temporary and should be fixed&nbsp;soon.<br>The error message at the bottom of this page should contain more information.<br>Please <a href="" title="Reload this page" onclick="window.location.reload(false); return false">try again</a> accordingly in a few&nbsp;minutes.</p>
</div>
<div class="footer">
<p>If you report this error to the Wikimedia System Administrators, please include the details below.</p>
<p class="text-muted"><code>
Request from 10.68.23.58 via cp1067 cp1067, Varnish XID 2356430101<br>Error: 503, Service Unavailable at Fri, 23 Sep 2016 11:57:01 GMT</code></p></div></html>
missing value to go with key
    while executing
"dict get [json::json2dict $json] {*}$args"
    (procedure "get" line 2)
    invoked from within
"get $json query pages"
    (procedure "page" line 2)
    invoked from within
"page [post $wiki {*}$query / prop info / titles $lemma]"
    (procedure "missing" line 3)
    invoked from within
"missing $lemma"
    ("foreach" body line 13)
    invoked from within
"foreach {hit lemma ts} [lrange [join [lsort -unique -decreasing $titles]] 1 end] {
                        if {$lemma eq {/leer/}} {continue}
                        if {$hit == 3} {
#                               puts..."
    (body of "dict with")
    invoked from within
"dict with data {
#puts \n$portal\n$data
                if {$listformat eq {SHORTLIST}} {set dateformat %d.%m.} else {set dateformat {%d. %b}}
                set altdate [clock ..."
    ("foreach" body line 7)
    invoked from within
"foreach {portal data} $e {
if {$portal ne {Portal:Politikwissenschaft/Neue Artikel} && !$aaaa} {continue} else {incr aaaa}
#puts $portal
#if $offset {..."
    (file "./NeueArtikel4.tcl" line 33)

@doctaxon do you get an error consistently for that url? if so, trying from where?

I still can't reproduce your problem, that seems not to be limited to the API servers afterall.

no, it's not consistently but random, it's always API info up to now, the parameter titles is different

• ema subscribed.Sep 23 2016, 12:10 PM

runs good for 8 minutes now

I've tried reproducing the issue for a while without success. @Joe restarted hhvm on mw1280-90 due to memory leaks, perhaps that helped?

zhuyifei1999 subscribed.Sep 23 2016, 12:36 PM

Okay, I suppose, the problem has been solved. What have you done to solve it?

@doctaxon: nothing, except for @Joe's restart of the HHVMs mentioned above.

@doctaxon I tracked down mw1203 and mw1280-1290 as potential source of problems because of how much cpu/RAM they were consuming, and issued a rolling restart of those servers, as logged in the SAL; I noticed that after restarting the service on the first three, the 5xx count on api.php went down significantly, so I decided to wait half an hour before continuing with the restarts. It seems that one of the aforementioned servers was the cause of the (still limited) number of errors you were seeing.

Clearly our monitoring should be more fine-grained and also check the number of 500s per-host that we serve. Over the whole 50-machines cluster, the number of errors was still too low to alarm us.

Top! Thank you very much!

Joe closed this task as Resolved.Sep 23 2016, 1:05 PM

Joe claimed this task.

Hi, I think, a restart is needed again, there are too much 503 errors on several proxy servers like cp1053, cp1054 and cp1067.

A reasonable bot working is not possible any more.

If those errors occur again and again, a technical check of these proxies has to be done, I suggest.

Firing with traffic (different API URLs) the error report occurs about every 1.5 minutes (!)

(Sorry, but what is an unbreak now! error report, if here is nobody, who does unbreak it now? I even got no reply here ... :-( )

Mentioned in SAL (#wikimedia-operations) [2016-10-07T08:20:55Z] <_joe_> restarting hhvm on a few api appservers, due to memory leaks (T146451)

• ema moved this task from Backlog to Radar/Not for service by Traffic on the Traffic board.Oct 7 2016, 8:32 AM

elukey subscribed.Oct 7 2016, 9:06 AM

All the restarts finished right now, the cluster should be in a much better shape now.

Joe closed this task as Resolved.Oct 7 2016, 9:45 AM

BBlack renamed this task from repeated 503 errors for 90 minutes now on cp1065 to repeated 503 errors for 90 minutes now.Oct 7 2016, 2:18 PM

(took the cache host out of the title to prevent confusion in future Phab searches for problems on specific cache hosts, since it didn't turn out to be relevant).

Joe mentioned this in T147773: Restart HHVM on API appservers every about 48 hours.Oct 10 2016, 6:36 AM

Joe created subtask T147773: Restart HHVM on API appservers every about 48 hours.

Hi!

I got the same problems again. I think the HHVM on API appservers has to be restart again due to memory leak.

Thank you ...

TheresNoTime subscribed.Oct 17 2016, 3:31 PM

en.wp just returned 503 for me. cp1052

Thibaut120094 awarded a token.Oct 17 2016, 3:34 PM

Thibaut120094 subscribed.

yuvipanda removed projects: Cloud-Services, Toolforge.Oct 17 2016, 3:39 PM

TheresNoTime awarded a token.Oct 17 2016, 3:46 PM

Starting-to-pile-on problem report at en.WP. In summary: The world is ending. :D

Xaosflux subscribed.Oct 17 2016, 4:05 PM

Cameron11598 subscribed.Oct 17 2016, 4:15 PM

DatGuy subscribed.Oct 17 2016, 4:33 PM

BethNaught subscribed.Oct 17 2016, 4:51 PM

Anomie merged a task: T148448: Api cluster issues.Oct 17 2016, 5:52 PM

Anomie added subscribers: Zppix, Paladox.

JEumerus subscribed.Oct 17 2016, 6:44 PM

Betacommand subscribed.Oct 17 2016, 8:51 PM

greg subscribed.Oct 17 2016, 9:20 PM

Jnorton7558 subscribed.Oct 17 2016, 11:54 PM

For the record, yesterday's problem is different from the one we had before; it's still a memleak but of a different nature.

if no ticket is open for that, I'll open one this morning.

stwalkerster subscribed.Oct 18 2016, 11:43 AM

Joe closed subtask T147773: Restart HHVM on API appservers every about 48 hours as Resolved.Oct 19 2016, 8:38 AM

Joe closed this task as Resolved.Oct 19 2016, 11:17 AM

Hi Joe!

Now I get a very similar error since about two weeks (or longer), but this 502 Bad Gateway happens several time within 5 minutes(!) doing simple API queries or API action login. It seems that something goes here very wrong.

These are some of the errors:

/ format json / maxlag 5 / action query / prop templates / titles {Geschichte des Verkehrs} / tltemplates {Vorlage:QS-Transport und Verk
ehr}
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.11.6</center>
</body>
</html>

/ format json / maxlag 5 / action query / prop templates / titles {Geschichte des Verkehrs} / tltemplates Vorlage:QS-Übersetzung
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.11.6</center>
</body>
</html>

/ format json / maxlag 5 / action query / prop templates / titles {Geschichte des Verkehrs} / tltemplates Vorlage:QS-Unternehmen
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.11.6</center>
</body>
</html>

/ format json / maxlag 5 / action query / prop revisions / rvprop content / rvlimit 1 / titles Wikipedia:Löschprüfung / rvsection 20 / u
tf8
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.11.6</center>
</body>
</html>

doctaxon added a comment.Nov 26 2016, 4:55 AM

This comment was removed by doctaxon.

Now I get a very similar error since about two weeks (or longer), but this 502 Bad Gateway

@doctaxon: This task is about 503 errors. Please file separate tasks for separate bugs.
Restoring previous "resolved" status of this task.

doctaxon mentioned this in T151686: several 502 Bad Gateway.Nov 26 2016, 6:46 PM

repeated 503 errors for 90 minutes nowClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

repeated 503 errors for 90 minutes now
Closed, ResolvedPublic
Actions

Related Objects
Search...