Page MenuHomePhabricator

Frequent server errors (503 and 502), happened several times in the last 2 days
Closed, InvalidPublic

Description

In the last 2 days, after having pressed button "Publish changes" or button "Show preview" in edit mode (I was logged in) I have seen, for 3 or 4 times, the following error message.

Error

Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes.

See the error message at the bottom of this page for more information.

If you report this error to the Wikimedia System Administrators, please include the details below.

Request from x.x.x.x via cp3064 cp3064, Varnish XID 1019756636
Error: 503, Backend fetch failed at Sat, 11 Dec 2021 14:50:09 GMT`

Have the backend servers been really restarted or did they experience too much traffic load?

Event Timeline

Ade56facc renamed this task from Frequent server errors (503), happened several times in the last 2 days to Frequent backend server errors (503), happened several times in the last 2 days.Dec 11 2021, 3:11 PM

Hi, Same to me while uploading some files via https://commons.wikimedia.org/wiki/Special:Upload

Request from - via cp3054.esams.wmnet, ATS/8.0.8
Error: 502, Next Hop Connection Failed at 2021-12-11 15:28:33 GMT

for a 48 MB file.

Request from - via cp3054.esams.wmnet, ATS/8.0.8
Error: 502, Next Hop Connection Failed at 2021-12-11 15:29:10 GMT

for a 41 MB file.

Again now:

Request from 92.145.93.28 via cp3054 cp3054, Varnish XID 529632607
Error: 503, Backend fetch failed at Sat, 11 Dec 2021 23:18:47 GMT

for https://archive.org/download/GustaveFlaubert_LEducationSentimentale/Gustave_Flaubert_-_L_Education_sentimentale_P1_Chap04_V2.mp3
48.9 MB

One more

Request from 92.145.93.28 via cp3054 cp3054, Varnish XID 49490611
Error: 503, Backend fetch failed at Sun, 12 Dec 2021 13:15:28 GMT

for https://www.archive.org/download/walden_version_2_nb_1501_librivox/walden_13_thoreau_128kb.mp3 54.3 MB

Several times the same error with https://archive.org/download/awakening_librivox/awakening_02_chopin.mp3 48.42 MB

Request from 92.145.93.28 via cp3054 cp3054, Varnish XID 252198530
Error: 503, Backend fetch failed at Sun, 12 Dec 2021 15:51:49 GMT

However, the file uploaded fine. This is a a pain.

Around 13..14 (1 .. 2 p.m. UTC) I noticed long delays (6 .. 10 seconds) when trying to "Publish" or "Show preview" (of) an article.

In one case the delay was so long that browser aborted request (I guess after 15 .. 20 seconds).

ema triaged this task as High priority.Dec 15 2021, 8:56 AM

Hello @Ade56facc and @Yann, thanks for the bug reports.

We may be looking at two different problems here, at least judging from the fact that some of the error codes you reported are 503 originating from Varnish and some are 502 coming from ATS.

All errors reported by @Yann were sent by the cache server with hostname cp3054, strongly suggesting that the ATS 502 errors are from ats-tls and not ats-be. If ats-be were implicated we'd likely see different hostnames in the errors due to the algorithm used to select backend caches.

@Ade56facc: To further diagnose the problem and figure out whether or not we are looking at two distinct issues it would be useful to know which articles you were editing when getting the errors, or at least their approximate size in bytes. My suspicion is that the problem here is about "large" POST requests timing out.

ema renamed this task from Frequent backend server errors (503), happened several times in the last 2 days to Frequent server errors (503 and 502), happened several times in the last 2 days.Dec 15 2021, 9:03 AM

@Ade56facc: To further diagnose the problem and figure out whether or not we are looking at two distinct issues it would be useful to know which articles you were editing when getting the errors, or at least their approximate size in bytes. My suspicion is that the problem here is about "large" POST requests timing out.

In both cases (December, 10..11 and 13) I was editing an article with a size of about 64KB (wiki source code) in en.wikipedia.org, nothing special or too large.

Internet speed was good (see below) and when those events happened I was only trying to POST (Publish or Show Preview) changes to an article (there was no other Internet nor PC background activity, browser Firefox 95, with 5 or 6 TABs opened on wiki pages, was the only APP used, my PC disk was idle, almost no CPU used, Windows 10 8GB RAM).

Sometime before and after those reported slow downs I browsed (without any editing) a few other small or middle sized wiki articles and I noticed that, on December 11, browsing of wiki pages was a bit slow, let's say around 1s .. 2s before retrieving a wiki page; it was more like an initial delay because when first bytes started to arrive then the rest came fastly and the download of each page, included all images, completed in much less than 1s without other delays; instead on December 13 (00.30 .. 2 PM UTC) also browsing wiki pages was (sometime) much slower than usual, let's say a few (2 .. 4) seconds more than usual; I guess there were traffic spikes.

In practice I noticed real problems only when in editing mode after trying to "Publish" or to "Show Preview" changes to an article.

Talking about changes to above mentioned wiki article (web server) it usually takes around 1.5 .. 2 seconds to send (POST) data to server and to receive the updated page, so experimenting delays of 4 .. 10 .. 20 seconds has been quite unusual.

BCornwall changed the task status from Open to Stalled.May 1 2023, 8:55 PM
BCornwall subscribed.

Hello!

This is quite an old ticket. We're sorry that this fell through the cracks; the amount of tickets we receive can easily overwhelm our small team!

In any case, would someone be willing to tell whether this is still an issue? We have since replaced portions of the stack with other technologies.

As there's been no response, I'm going to close this ticket. Please, do re-open if this is still occurring as we'd love to fix it!