503 errors for several Wikipedia pages
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	DannyS712
	May 3 2019, 5:07 AM

Description

I keep getting 503 errors:

I tried going to my watchlist on enwiki, and saw the following error
Special:Watchlist - Request from [redacted] via cp1089 cp1089, Varnish XID 928187350 Error: 503, Backend fetch failed at...

Furthermore, it appears that user script imports are randomly not working, instead logging
GET [script url] net::ERR_ABORTED 503

Potentially related, going to the main page of the simple english wikipedia alternates between working and an error, such as
...via cp1089 cp1089, Varnish XID 921831419 Error: 503, Backend fetch failed at...

Related Objects

Duplicates Merged Here: T223762: Try to visit some pages on zhwikiversity but get a 503 error
T223763: HTTP 503 when viewing some JavaScript page with action=raw&ctype=text/javascript

Event Timeline

DannyS712 created this task.May 3 2019, 5:07 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 3 2019, 5:07 AM

DannyS712 updated the task description. (Show Details)May 3 2019, 5:09 AM

DannyS712 updated the task description. (Show Details)

DannyS712 updated the task description. (Show Details)May 3 2019, 5:12 AM

I was getting this just now (for watchlists and any search I did) for a couple of minutes, but it is working again now.

Request from [x] via cp1089 cp1089, Varnish XID 1067058835
Error: 503, Backend fetch failed at Fri, 03 May 2019 05:13:42 GMT

Taiwania_Justo subscribed.May 3 2019, 5:49 AM

Krenair subscribed.May 3 2019, 5:50 AM

abi_ subscribed.May 3 2019, 5:58 AM

Aklapper renamed this task from Multiple wikipedia errors to 503 errors for several Wikipedia pages.May 3 2019, 9:43 AM

Aklapper edited projects, added WMF-General-or-Unknown; removed MediaWiki-General.

The errors are back - enwiki's main page just gave me "via cp1089 cp1089, Varnish XID 805012932 Error: 503, Backend fetch failed"

Samwilson added a project: SRE.May 4 2019, 11:33 PM

Krenair added a project: Traffic.May 4 2019, 11:33 PM

possible continuation of https://wikitech.wikimedia.org/wiki/Incident_documentation/20190503-varnish

Krenair added subscribers: Joe, • ema.May 4 2019, 11:59 PM

Dzahn triaged this task as High priority.May 6 2019, 10:18 PM

Widespread occurrence, VPT threads and lots of users affected. No point making a "Me too" here but is there an estimate on when it will be fixed?

@Marostegui did a restart on cp1081 at 2019-05-19T05:09 which helped. Most recently I'm getting the issue from cp1087.

I just came to report that this was happening again, eg: "via cp1087 cp1087, Varnish XID 381950846"

Capankajsmilyo subscribed.May 19 2019, 8:08 AM

JJMC89 merged a task: T223763: HTTP 503 when viewing some JavaScript page with action=raw&ctype=text/javascript.May 19 2019, 8:11 AM

JJMC89 merged a task: T223762: Try to visit some pages on zhwikiversity but get a 503 error.

JJMC89 added subscribers: Xiplus, Wang_Qiliang, Stang.

JJMC89 added a subscriber: Ericliu1912.

The problem is happening in Chinese Wikipedia. Mainly when saving pages, sometimes accessing pages. Some visual features (incl. the interface) sometimes cannot be rendered properly.
Update: From what I observe, the problem I face has relieved at the recent moment.

QEDK merged a task: T223763: HTTP 503 when viewing some JavaScript page with action=raw&ctype=text/javascript.May 19 2019, 8:14 AM

QEDK added a subscriber: RazeSoldier.

@jijiki did a restart on cp1087 at 2019-09-19T08:13 which should help for now.

The problem is happening in Wikidata

The problem is happening in Indonesian Wikipedia

@ReaperDawn @GerardM are you still getting 503s?

For posterity:

https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=1558225397000&to=1558268597000&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend

https://logstash.wikimedia.org/goto/94066e323078ed2e1a0ea6b24f801361

cp1081 was restarted at 05:09:48 UTC
cp1087 was restarted at 08:17:41 UTC

@jijiki No, it is alright now in id.wikipedia.

Thanks! We now believe this is resolved.

QEDK awarded a token.May 19 2019, 1:33 PM

Wang_Qiliang awarded a token.May 19 2019, 2:04 PM

@CDanis just got another one trying to save on meta wiki:
... via cp1081 cp1081, Varnish XID 295561815 Error: 503, Backend fetch failed at Sat, 25 May 2019 08:08:27 GMT

... cp1075 cp1075, Varnish XID 1035109169
Error: 503, Backend fetch failed at Mon, 03 Jun 2019 07:56:08 GMT

Just a question: is this intermittent behaviour expected or is there something actually being fixed everytime we bring it up (no snark, genuine question)

QEDK rescinded a token.Jun 3 2019, 8:01 AM

Just got it again

via cp1075 cp1075, Varnish XID 1031998560
Error: 503, Backend fetch failed at Mon, 03 Jun 2019 08:11:59 GMT

The traffic team restarted two Varnish backends, the issue should be fixed now. Thanks a lot for the reports, please let us know if anything strange still happens :)

In T222418#5229410, @Ankit-Maity wrote:

Just a question: is this intermittent behaviour expected or is there something actually being fixed everytime we bring it up (no snark, genuine question)

Genuine anwser: So I wasn't involved on this, but to the best of my knowledge, these incidents above are not (very) related to each other, except on the symptoms that readers or editors suffer: an increase of 503 errors wich can be due to many different causes, normally due to a problem on the traffic/cache serving/connectivitiy/upper parts of the infrastructure (e.g. not the application).

Some can be due to maintenance or software/hardware problems, but some others can be due to external triggers. For example (not a real example), if one person, on purpose or by accident, generates an incredible amount of load (e.g. more than all other millions of users at the same time), some other users, on some locations, during some time can get instability. Work to minimize those is always ongoing, but new problems appear all the time that require new solutions and technologies, and sometimes it takes a long time to implement those. Allow me to be vague so to not tip people on how to abuse those weaknesses, but as soon as incidents are solved, analyzed and fixed/protections put in place we are pretty transparent and (if you are interested on the details), they are published at subpages at: https://wikitech.wikimedia.org/wiki/Incident_documentation . We DO understand those are indeed annoying and try to avoid problems at all cost.

Reporting these issues is actually helpful (as it is said above, not much the subsequent "+1"s) to understand the impact and be able to answer quickly, beyond the alerts and monitoring we already have in place. I would suggest, however, that probably a separate ticket each time would be preferred as they are normally issues not related to each other (503 errors). Software issues "XXX error" are sometimes related if they return the same errors, while infrastructure problems all tend to share a few error codes.

Hope that simplified explanation helps.

• ema moved this task from Backlog to Caching on the Traffic board.Jun 3 2019, 3:11 PM

That explanation certainly helps

In T222418#5230418, @jcrespo wrote:

In T222418#5229410, @Ankit-Maity wrote:

Just a question: is this intermittent behaviour expected or is there something actually being fixed everytime we bring it up (no snark, genuine question)

Genuine anwser: So I wasn't involved on this, but to the best of my knowledge, these incidents above are not (very) related to each other, except on the symptoms that readers or editors suffer: an increase of 503 errors with can be due to many different causes, normally due to a problem on the traffic/cache serving/connectivitiy/upper parts of the infrastructure (e.g. not the application).
...
Reporting these issues is actually helpful (as it is said above, not much the subsequent "+1"s) to understand the impact and be able to answer quickly, beyond the alerts and monitoring we already have in place. I would suggest, however, that probably a separate ticket each time would be preferred as they are normally issues not related to each other (503 errors). Software issues "XXX error" are sometimes related if they return the same errors, while infrastructure problems all tend to share a few error codes.

It's good to know that there are things to do when issues occur and this would be a good pointer for any more rushes to the village pump the next time we have (touch wood) an outage.

For extra context, 503 errors can also happen randomly, the current stats say that 99.999512% of requests are successful, and getting to 100% is almost impossible. That means that our of 200 000 requests, one may fail. The issue happens when the percentage is low (e.g. 99.8% availability and lower) and people suffer from that by having repeated errors on reading or editing.

There is a team (Site Reliability Engineering) entirely dedicated to keeping the "nines" high.

DannyS712 moved this task from Unsorted to Resolved tasks (others) on the User-DannyS712 board.Jun 4 2019, 3:29 PM

Stang unsubscribed.Nov 3 2021, 3:03 AM

503 errors for several Wikipedia pagesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

503 errors for several Wikipedia pages
Closed, ResolvedPublic
Actions