Page MenuHomePhabricator

repeated 503 errors for 90 minutes now
Closed, ResolvedPublic

Description

full error line trying to do a bot job at tools.taxonbot:

/ format json / maxlag 5 / action query / prop info / titles {:Erika Sunnegårdh}
<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<title>Wikimedia Error</title>
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 560px; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { float: left; margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645AD; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
.text-muted { color: #777; }
</style>
<div class="content" role="main">
<a href="//www.wikimedia.org"><img src="//www.wikimedia.org/static/images/wmf.png" srcset="//www.wikimedia.org/static/images/wmf-2x.png 2x" alt=Wikimedia width=135 height=135></a>
<h1>Error</h1>
<p>Our servers are currently under maintenance or experiencing a technical problem. This is probably temporary and should be fixed&nbsp;soon.<br>The error message at the bottom of this page should contain more information.<br>Please <a href="" title="Reload this page" onclick="window.location.reload(false); return false">try again</a> accordingly in a few&nbsp;minutes.</p>
</div>
<div class="footer">
<p>If you report this error to the Wikimedia System Administrators, please include the details below.</p>
<p class="text-muted"><code>
Request from 10.68.23.58 via cp1065 cp1065, Varnish XID 3997275082<br>Error: 503, Service Unavailable at Fri, 23 Sep 2016 08:28:19 GMT</code></p></div></html>
missing value to go with key
    while executing
"dict get [json::json2dict $json] {*}$args"
    (procedure "get" line 2)
    invoked from within
"get $json query pages"
    (procedure "page" line 2)
    invoked from within
"page [post $wiki {*}$query / prop info / titles $lemma]"
    (procedure "redirect" line 3)
    invoked from within
"redirect $item"
    ("foreach" body line 1)
    invoked from within
"foreach item $znline {if {![missing $item] && ![redirect $item] && $item ni $bkl} {lappend nline2 \[\[$item\]\]}}"
    invoked from within
"if ![empty line] {
                                regsub -all -nocase -- {\[\[(?!Datei:|File:|:)} $line \[\[: nline
                                if {$listformat eq {SHORTLIST}} {
                                        regexp -- {(\d\d\...."
    ("foreach" body line 3)
    invoked from within
"foreach line $locportal {
                        lassign {} nline2 nline12
                        if ![empty line] {
                                regsub -all -nocase -- {\[\[(?!Datei:|File:|:)} $line \[\[: nline
                        ..."
    (body of "dict with")
    invoked from within
"dict with data {
#puts \n$portal\n$data
                if {$listformat eq {SHORTLIST}} {set dateformat %d.%m.} else {set dateformat {%d. %b}}
                set altdate [clock ..."
    ("foreach" body line 7)
    invoked from within
"foreach {portal data} $e {
if {$portal ne {Benutzer:Nobart/Neue Filme} && !$aaaa} {continue} else {incr aaaa}
#puts $portal
#if $offset {break}
#if {$..."
    (file "./NeueArtikel4.tcl" line 33)

This error occurs in intervals from seconds to minutes for about 90 minutes now.
I suppose, that this is very interesting: Request from 10.68.23.58 via cp1065 cp1065, Varnish XID 3997275082<br>Error: 503, Service Unavailable at Fri, 23 Sep 2016 08:28:19 GMT

cp1065 - problems to reach the proxy?

Event Timeline

doctaxon created this task.Sep 23 2016, 8:44 AM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptSep 23 2016, 8:44 AM
Restricted Application added subscribers: Luke081515, Aklapper. · View Herald Transcript
doctaxon triaged this task as High priority.Sep 23 2016, 8:45 AM
jcrespo added a subscriber: jcrespo.

Adding Traffic so they can give it a quick look.

jcrespo removed a subscriber: jcrespo.Sep 23 2016, 10:34 AM
doctaxon raised the priority of this task from High to Unbreak Now!.Sep 23 2016, 11:19 AM

changed Priority because there have to run a lot of bot scripts Wikipedia users needs to work with it. The unbreak is open.

Request from 10.68.23.58 via cp1055 cp1055, Varnish XID 3441681239<br>Error: 503, Service Unavailable at Fri, 23 Sep 2016 11:15:26 GMT

I got this error from cp1055 now, too.

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptSep 23 2016, 11:19 AM

@doctaxon can you indicate the full url you are trying?

from chat about the topic:

11:26 < wikibugs> Labs, Tool-Labs, Operations, Traffic: repeated 503 errors
                  for 90 minutes now on cp1065 -
                  https://phabricator.wikimedia.org/T146451#2661965
                  (jcrespo) @doctaxon can you indicate the full url you are
                  trying?
11:29 < doctaxon> jynus: full url? I am trying it by API
11:29 < doctaxon> format json / maxlag 5 / action query / prop info / titles
                  {:Kurt Couto}
11:29 < jynus> yes, but mediawiki api requires a host an an url, which one
               are you using?
11:30 < jynus> e.g. https://en.wikipedia.org/w/api.php?action=my_call
11:31 < jynus> maybe you are not using https, which is required
11:31 < doctaxon> this is an example of many:
https://de.wikipedia.org/w/index.php?title=Kurt_Couto&action=info
11:32 < jynus> ok
11:32 < jynus> thank you, can you add that to the ticket so more people see
               it?
11:32 < jynus> if we have the url, we can see the logs, so it is important
11:32 < Steinsplitter> doctaxon: antwortet dir das api? (habe jetzt nicht
                       mitgelesen), ich kriege keine aw fom api
11:34 < doctaxon> Steinsplitter: see error line in T146451
11:34 < stashbot> T146451: repeated 503 errors for 90 minutes now on cp1065
                  - https://phabricator.wikimedia.org/T146451
11:35 < doctaxon> this is what I got in the shell, on bastion and the grid
11:35 < Steinsplitter> same problem here.
11:35 < doctaxon> jynus: I got the error now from cp1053
doctaxon added a subscriber: Giftpflanze.

Getting the problem when accessing the Wikimedia Commons api via labs or labs grid engine.

For example when attempting to getting image info for File:@Baldwin_School_Auditorium.jpg via API.

Joe added a subscriber: Joe.Sep 23 2016, 11:43 AM

@Steinsplitter do you get the data correctly if you try from your computer?

is the error related to the cache proxies, if there are reports of all the cp1065, cp 1053, cp 1055 ...?

Joe added a comment.Sep 23 2016, 11:44 AM

is the error related to the cache proxies, if there are reports of all the cp1065, cp 1053, cp 1055 ...?

It depends. All the errors might come from one cache backend for example, but at this point I guess it's more probable the issue is upstream from that.

Who is responsible for that?

@Steinsplitter do you get the data correctly if you try from your computer?

Yes, ~ 40 successful attempts.

next error trying this:
https://de.wikipedia.org/w/index.php?title=Offshore-Windpark_Borssele&action=info

/ format json / maxlag 5 / action query / prop info / titles {Offshore-Windpark Borssele}
<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<title>Wikimedia Error</title>
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 560px; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { float: left; margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645AD; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
.text-muted { color: #777; }
</style>
<div class="content" role="main">
<a href="//www.wikimedia.org"><img src="//www.wikimedia.org/static/images/wmf.png" srcset="//www.wikimedia.org/static/images/wmf-2x.png 2x" alt=Wikimedia width=135 height=135></a>
<h1>Error</h1>
<p>Our servers are currently under maintenance or experiencing a technical problem. This is probably temporary and should be fixed&nbsp;soon.<br>The error message at the bottom of this page should contain more information.<br>Please <a href="" title="Reload this page" onclick="window.location.reload(false); return false">try again</a> accordingly in a few&nbsp;minutes.</p>
</div>
<div class="footer">
<p>If you report this error to the Wikimedia System Administrators, please include the details below.</p>
<p class="text-muted"><code>
Request from 10.68.23.58 via cp1067 cp1067, Varnish XID 2356430101<br>Error: 503, Service Unavailable at Fri, 23 Sep 2016 11:57:01 GMT</code></p></div></html>
missing value to go with key
    while executing
"dict get [json::json2dict $json] {*}$args"
    (procedure "get" line 2)
    invoked from within
"get $json query pages"
    (procedure "page" line 2)
    invoked from within
"page [post $wiki {*}$query / prop info / titles $lemma]"
    (procedure "missing" line 3)
    invoked from within
"missing $lemma"
    ("foreach" body line 13)
    invoked from within
"foreach {hit lemma ts} [lrange [join [lsort -unique -decreasing $titles]] 1 end] {
                        if {$lemma eq {/leer/}} {continue}
                        if {$hit == 3} {
#                               puts..."
    (body of "dict with")
    invoked from within
"dict with data {
#puts \n$portal\n$data
                if {$listformat eq {SHORTLIST}} {set dateformat %d.%m.} else {set dateformat {%d. %b}}
                set altdate [clock ..."
    ("foreach" body line 7)
    invoked from within
"foreach {portal data} $e {
if {$portal ne {Portal:Politikwissenschaft/Neue Artikel} && !$aaaa} {continue} else {incr aaaa}
#puts $portal
#if $offset {..."
    (file "./NeueArtikel4.tcl" line 33)
Joe added a comment.Sep 23 2016, 12:08 PM

@doctaxon do you get an error consistently for that url? if so, trying from where?

I still can't reproduce your problem, that seems not to be limited to the API servers afterall.

doctaxon added a comment.EditedSep 23 2016, 12:09 PM

no, it's not consistently but random, it's always API info up to now, the parameter titles is different

ema added a subscriber: ema.Sep 23 2016, 12:10 PM

runs good for 8 minutes now

ema added a comment.EditedSep 23 2016, 12:28 PM

I've tried reproducing the issue for a while without success. @Joe restarted hhvm on mw1280-90 due to memory leaks, perhaps that helped?

Okay, I suppose, the problem has been solved. What have you done to solve it?

ema added a comment.Sep 23 2016, 12:53 PM

@doctaxon: nothing, except for @Joe's restart of the HHVMs mentioned above.

Joe added a comment.Sep 23 2016, 12:55 PM

@doctaxon I tracked down mw1203 and mw1280-1290 as potential source of problems because of how much cpu/RAM they were consuming, and issued a rolling restart of those servers, as logged in the SAL; I noticed that after restarting the service on the first three, the 5xx count on api.php went down significantly, so I decided to wait half an hour before continuing with the restarts. It seems that one of the aforementioned servers was the cause of the (still limited) number of errors you were seeing.

Clearly our monitoring should be more fine-grained and also check the number of 500s per-host that we serve. Over the whole 50-machines cluster, the number of errors was still too low to alarm us.

Top! Thank you very much!

Joe closed this task as Resolved.Sep 23 2016, 1:05 PM
Joe claimed this task.
doctaxon reopened this task as Open.EditedOct 7 2016, 7:06 AM

Hi, I think, a restart is needed again, there are too much 503 errors on several proxy servers like cp1053, cp1054 and cp1067.

A reasonable bot working is not possible any more.

doctaxon added a comment.EditedOct 7 2016, 7:09 AM

If those errors occur again and again, a technical check of these proxies has to be done, I suggest.

doctaxon added a comment.EditedOct 7 2016, 7:52 AM

Firing with traffic (different API URLs) the error report occurs about every 1.5 minutes (!)

(Sorry, but what is an unbreak now! error report, if here is nobody, who does unbreak it now? I even got no reply here ... :-( )

Mentioned in SAL (#wikimedia-operations) [2016-10-07T08:20:55Z] <_joe_> restarting hhvm on a few api appservers, due to memory leaks (T146451)

ema moved this task from Triage to Watching on the Traffic board.Oct 7 2016, 8:32 AM
elukey added a subscriber: elukey.Oct 7 2016, 9:06 AM
Joe added a comment.Oct 7 2016, 9:06 AM

All the restarts finished right now, the cluster should be in a much better shape now.

Joe closed this task as Resolved.Oct 7 2016, 9:45 AM
BBlack renamed this task from repeated 503 errors for 90 minutes now on cp1065 to repeated 503 errors for 90 minutes now.Oct 7 2016, 2:18 PM
BBlack added a subscriber: BBlack.Oct 7 2016, 2:21 PM

(took the cache host out of the title to prevent confusion in future Phab searches for problems on specific cache hosts, since it didn't turn out to be relevant).

doctaxon reopened this task as Open.Oct 17 2016, 3:27 PM

Hi!

I got the same problems again. I think the HHVM on API appservers has to be restart again due to memory leak.

Thank you ...

Samtar added a subscriber: Samtar.Oct 17 2016, 3:31 PM

en.wp just returned 503 for me. cp1052

Izno added a subscriber: Izno.EditedOct 17 2016, 4:04 PM

Starting-to-pile-on problem report at en.WP. In summary: The world is ending. :D

DatGuy added a subscriber: DatGuy.Oct 17 2016, 4:33 PM
greg added a subscriber: greg.Oct 17 2016, 9:20 PM
Joe added a comment.Oct 18 2016, 7:09 AM

For the record, yesterday's problem is different from the one we had before; it's still a memleak but of a different nature.

if no ticket is open for that, I'll open one this morning.

Joe closed this task as Resolved.Oct 19 2016, 11:17 AM
doctaxon reopened this task as Open.EditedNov 26 2016, 4:54 AM

Hi Joe!

Now I get a very similar error since about two weeks (or longer), but this 502 Bad Gateway happens several time within 5 minutes(!) doing simple API queries or API action login. It seems that something goes here very wrong.

These are some of the errors:

/ format json / maxlag 5 / action query / prop templates / titles {Geschichte des Verkehrs} / tltemplates {Vorlage:QS-Transport und Verk
ehr}
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.11.6</center>
</body>
</html>

/ format json / maxlag 5 / action query / prop templates / titles {Geschichte des Verkehrs} / tltemplates Vorlage:QS-Übersetzung
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.11.6</center>
</body>
</html>

/ format json / maxlag 5 / action query / prop templates / titles {Geschichte des Verkehrs} / tltemplates Vorlage:QS-Unternehmen
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.11.6</center>
</body>
</html>

/ format json / maxlag 5 / action query / prop revisions / rvprop content / rvlimit 1 / titles Wikipedia:Löschprüfung / rvsection 20 / u
tf8
<html>
<head><title>502 Bad Gateway</title></head>
<body bgcolor="white">
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.11.6</center>
</body>
</html>
This comment was removed by doctaxon.
Aklapper closed this task as Resolved.Nov 26 2016, 2:27 PM

Now I get a very similar error since about two weeks (or longer), but this 502 Bad Gateway

@doctaxon: This task is about 503 errors. Please file separate tasks for separate bugs.
Restoring previous "resolved" status of this task.