Page MenuHomePhabricator

503 errors for several Wikipedia pages
Closed, ResolvedPublic

Description

I keep getting 503 errors:

I tried going to my watchlist on enwiki, and saw the following error
Special:Watchlist - Request from [redacted] via cp1089 cp1089, Varnish XID 928187350 Error: 503, Backend fetch failed at...

Furthermore, it appears that user script imports are randomly not working, instead logging
GET [script url] net::ERR_ABORTED 503

Potentially related, going to the main page of the simple english wikipedia alternates between working and an error, such as
...via cp1089 cp1089, Varnish XID 921831419 Error: 503, Backend fetch failed at...

Event Timeline

DannyS712 updated the task description. (Show Details)

I was getting this just now (for watchlists and any search I did) for a couple of minutes, but it is working again now.

Request from [x] via cp1089 cp1089, Varnish XID 1067058835
Error: 503, Backend fetch failed at Fri, 03 May 2019 05:13:42 GMT

Aklapper renamed this task from Multiple wikipedia errors to 503 errors for several Wikipedia pages.May 3 2019, 9:43 AM
Aklapper edited projects, added WMF-General-or-Unknown; removed MediaWiki-General.

The errors are back - enwiki's main page just gave me "via cp1089 cp1089, Varnish XID 805012932 Error: 503, Backend fetch failed"

Dzahn triaged this task as High priority.May 6 2019, 10:18 PM
QEDK raised the priority of this task from High to Needs Triage.May 19 2019, 8:01 AM
QEDK subscribed.

Widespread occurrence, VPT threads and lots of users affected. No point making a "Me too" here but is there an estimate on when it will be fixed?

@Marostegui did a restart on cp1081 at 2019-05-19T05:09 which helped. Most recently I'm getting the issue from cp1087.

I just came to report that this was happening again, eg: "via cp1087 cp1087, Varnish XID 381950846"

The problem is happening in Chinese Wikipedia. Mainly when saving pages, sometimes accessing pages. Some visual features (incl. the interface) sometimes cannot be rendered properly.
Update: From what I observe, the problem I face has relieved at the recent moment.

@jijiki did a restart on cp1087 at 2019-09-19T08:13 which should help for now.

The problem is happening in Wikidata

The problem is happening in Indonesian Wikipedia

@jijiki No, it is alright now in id.wikipedia.

CDanis claimed this task.

Thanks! We now believe this is resolved.

@CDanis just got another one trying to save on meta wiki:
... via cp1081 cp1081, Varnish XID 295561815 Error: 503, Backend fetch failed at Sat, 25 May 2019 08:08:27 GMT

... cp1075 cp1075, Varnish XID 1035109169
Error: 503, Backend fetch failed at Mon, 03 Jun 2019 07:56:08 GMT

Just a question: is this intermittent behaviour expected or is there something actually being fixed everytime we bring it up (no snark, genuine question)

Just got it again

via cp1075 cp1075, Varnish XID 1031998560
Error: 503, Backend fetch failed at Mon, 03 Jun 2019 08:11:59 GMT

The traffic team restarted two Varnish backends, the issue should be fixed now. Thanks a lot for the reports, please let us know if anything strange still happens :)

In T222418#5229410, @Ankit-Maity wrote:

Just a question: is this intermittent behaviour expected or is there something actually being fixed everytime we bring it up (no snark, genuine question)

Genuine anwser: So I wasn't involved on this, but to the best of my knowledge, these incidents above are not (very) related to each other, except on the symptoms that readers or editors suffer: an increase of 503 errors wich can be due to many different causes, normally due to a problem on the traffic/cache serving/connectivitiy/upper parts of the infrastructure (e.g. not the application).

Some can be due to maintenance or software/hardware problems, but some others can be due to external triggers. For example (not a real example), if one person, on purpose or by accident, generates an incredible amount of load (e.g. more than all other millions of users at the same time), some other users, on some locations, during some time can get instability. Work to minimize those is always ongoing, but new problems appear all the time that require new solutions and technologies, and sometimes it takes a long time to implement those. Allow me to be vague so to not tip people on how to abuse those weaknesses, but as soon as incidents are solved, analyzed and fixed/protections put in place we are pretty transparent and (if you are interested on the details), they are published at subpages at: https://wikitech.wikimedia.org/wiki/Incident_documentation . We DO understand those are indeed annoying and try to avoid problems at all cost.

Reporting these issues is actually helpful (as it is said above, not much the subsequent "+1"s) to understand the impact and be able to answer quickly, beyond the alerts and monitoring we already have in place. I would suggest, however, that probably a separate ticket each time would be preferred as they are normally issues not related to each other (503 errors). Software issues "XXX error" are sometimes related if they return the same errors, while infrastructure problems all tend to share a few error codes.

Hope that simplified explanation helps.

That explanation certainly helps

In T222418#5229410, @Ankit-Maity wrote:

Just a question: is this intermittent behaviour expected or is there something actually being fixed everytime we bring it up (no snark, genuine question)

Genuine anwser: So I wasn't involved on this, but to the best of my knowledge, these incidents above are not (very) related to each other, except on the symptoms that readers or editors suffer: an increase of 503 errors with can be due to many different causes, normally due to a problem on the traffic/cache serving/connectivitiy/upper parts of the infrastructure (e.g. not the application).
...
Reporting these issues is actually helpful (as it is said above, not much the subsequent "+1"s) to understand the impact and be able to answer quickly, beyond the alerts and monitoring we already have in place. I would suggest, however, that probably a separate ticket each time would be preferred as they are normally issues not related to each other (503 errors). Software issues "XXX error" are sometimes related if they return the same errors, while infrastructure problems all tend to share a few error codes.

It's good to know that there are things to do when issues occur and this would be a good pointer for any more rushes to the village pump the next time we have (touch wood) an outage.

For extra context, 503 errors can also happen randomly, the current stats say that 99.999512% of requests are successful, and getting to 100% is almost impossible. That means that our of 200 000 requests, one may fail. The issue happens when the percentage is low (e.g. 99.8% availability and lower) and people suffer from that by having repeated errors on reading or editing.

There is a team (Site Reliability Engineering) entirely dedicated to keeping the "nines" high.