Page MenuHomePhabricator

"Obama" page on Beta Cluster often responds with 500 or 503
Open, NormalPublic

Description

The Barack Obama page often fails with a 500 or 503 on the beta cluster. When it does respond successfully, it takes a long while.

503

Response

Request from 73.181.4.128 via deployment-cache-text05 deployment-cache-text05, Varnish XID 47912111
Error: 503, Backend fetch failed at Tue, 04 Dec 2018 16:11:26 GMT

Request headers

:authority: en.wikipedia.beta.wmflabs.org
:method: GET
:path: /wiki/Barack_Obama
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9,es;q=0.8
cache-control: no-cache
cookie: GeoIP=US:CO:Denver:39.64:-104.91:v4; centralauth_User=Teststephen; centralauth_Token=665ec74188d005d16b67ad882b316d43; enwikiUserID=13929; enwikiUserName=Teststephen; forceHTTPS=true; enwikiSession=2lcfsqgsllg4bdgffetht05f2ita40at; WMF-Last-Access=04-Dec-2018; WMF-Last-Access-Global=04-Dec-2018; centralauth_Session=6517540bfa7994929a6fb367269c938e
pragma: no-cache
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/70.0.3538.77 Chrome/70.0.3538.77 Safari/537.36

Response headers

age: 0
cache-control: private, s-maxage=0, max-age=0, must-revalidate
content-encoding: gzip
content-type: text/html; charset=utf-8
date: Tue, 04 Dec 2018 16:11:26 GMT
server: Varnish
server-timing: cache;desc="int-local"
status: 503
vary: Accept-Encoding
via: 1.1 varnish (Varnish/5.1), 1.1 varnish (Varnish/5.1)
x-analytics: WMF-Last-Access=04-Dec-2018;WMF-Last-Access-Global=04-Dec-2018;https=1
x-cache: deployment-cache-text05 int, deployment-cache-text05 miss
x-cache-status: int-local
x-client-ip: 73.181.4.128
x-varnish: 47912110, 6366645

500

Request headers

:authority: en.m.wikipedia.beta.wmflabs.org
:method: GET
:path: /wiki/Barack_Obama
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9,es;q=0.8
cache-control: no-cache
cookie: GeoIP=US:CO:Denver:39.64:-104.91:v4; enwikiUserID=13929; enwikiUserName=Teststephen; centralauth_User=Teststephen; centralauth_Token=665ec74188d005d16b67ad882b316d43; loginnotify_prevlogins=2018-1mfuhjz-l9zoaixm1nkzty7bhnbwk9sa8y2qzth; forceHTTPS=true; WMF-Last-Access-Global=04-Dec-2018; WMF-Last-Access=04-Dec-2018; enwikiSession=771a5abtfs0iv63r0vrua2sbdsjdbeg6; centralauth_Session=443538f13517d189190bb40f6a96cba7
pragma: no-cache
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/70.0.3538.77 Chrome/70.0.3538.77 Safari/537.36

Response headers

age: 0
cache-control: private, s-maxage=0, max-age=0, must-revalidate
date: Tue, 04 Dec 2018 16:21:40 GMT
server: Varnish
server-timing: cache;desc="miss"
status: 500
via: 1.1 varnish (Varnish/5.1)
x-analytics: WMF-Last-Access=04-Dec-2018;WMF-Last-Access-Global=04-Dec-2018;https=1
x-cache: deployment-cache-text05 miss
x-cache-status: miss
x-client-ip: 73.181.4.128
x-varnish: 15058873

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 5 2018, 1:34 PM

BC continues to be very slow. 503s appear to always emit from deployment-cache-text04 deployment-cache-text04:

Request from 73.252.38.252 via deployment-cache-text04 deployment-cache-text04, Varnish XID 80272967
Error: 503, Backend fetch failed at Tue, 06 Mar 2018 20:53:05 GMT

This seems to always happen on deployment-cache-text04:

Request from 73.252.38.252 via deployment-cache-text04 deployment-cache-text04, Varnish XID 116922198
Error: 503, Backend fetch failed at Thu, 08 Mar 2018 19:55:45 GMT

https://en.m.wikipedia.beta.wmflabs.org/wiki/Main_Page seems to always work while the Obama article fails:

Request from 73.252.38.252 via deployment-cache-text04 deployment-cache-text04, Varnish XID 117707771
Error: 503, Backend fetch failed at Thu, 08 Mar 2018 19:58:13 GMT

Request from 73.252.38.252 via deployment-cache-text04 deployment-cache-text04, Varnish XID 80022381
Error: 503, Backend fetch failed at Thu, 08 Mar 2018 19:59:27 GMT

It's consistently text04:

Request from 73.252.38.252 via deployment-cache-text04 deployment-cache-text04, Varnish XID 161469920
Error: 503, Backend fetch failed at Wed, 21 Mar 2018 13:33:01 GMT

This is still an issue:

Request from 73.252.38.252 via deployment-cache-text04 deployment-cache-text04, Varnish XID 236099863
Error: 503, Backend fetch failed at Thu, 12 Apr 2018 17:10:35 GMT

All of beta is currently down.

jcrespo added subscribers: Joe, jcrespo.EditedApr 12 2018, 5:17 PM

Giuseppe mentioned some test stretch patches on beta, it may be unrelated, but adding him so he is aware of ongoing issues.

thcipriani triaged this task as High priority.Apr 12 2018, 6:18 PM
thcipriani added a subscriber: thcipriani.

hrm, everything from load.php is failing. Don't know if this is necessarily deployment-cache-text-04's problem since IIRC that's just a varnish machine. The backends sending back 503 could be the real culprit, looking now.

thcipriani lowered the priority of this task from High to Normal.Apr 12 2018, 7:05 PM

Well the deployment-mediawiki-07 backend was the cause of 503s today. I changed the appserver backend in hiera to deployment-mediawiki05 via https://horizon.wikimedia.org/project/prefixpuppet/?tab=prefix_puppet__puppet-deployment-cache-text and beta came back. Changing priority back to normal.

@Joe could you take a look at the 503s when you get a chance?

This is hard to explain. So when deployment-cache-text-04 used deployment-mediawiki-07 as a backend this page was coming back with a 503:

https://en.m.wikipedia.beta.wmflabs.org/w/load.php?debug=true&lang=en&modules=ext.cite.styles%7Cmediawiki.hlist%7Cmediawiki.ui.button%2Cicon%7Cskins.minerva.base.reset%2Cstyles%7Cskins.minerva.content.styles%7Cskins.minerva.icons.images%7Cskins.minerva.tablet.styles&only=styles&skin=minerva

But I can curl that page locally and it seems to work fine (contains CSS, reports 200):

root@deployment-mediawiki-07:~# curl -I -H 'Host: en.wikipedia.beta.wmflabs.org' 'http://deployment-mediawiki-07/w/load.php?debug=true&lang=en&modules=ext.cite.styles%7Cmediawiki.hlist%7Cmediawiki.ui.button%2Cicon%7Cskins.minerva.base.reset%2Cstyles%7Cskins.minerva.content.styles%7Cskins.minerva.icons.images%7Cskins.minerva.tablet.styles&only=styles&skin=minerva'                                                                                      
HTTP/1.1 200 OK                                                                                                                                       
Date: Thu, 12 Apr 2018 19:09:40 GMT
Server: deployment-mediawiki-07.deployment-prep.eqiad.wmflabs
X-Powered-By: HHVM/3.18.6-dev
Access-Control-Allow-Origin: *
Vary: Accept-Encoding
Pragma: no-cache
X-Content-Type-Options: nosniff
Cache-control: private, no-cache, must-revalidate
ETag: W/"13pyhf8"
Backend-Timing: D=132643 t=1523560180654502
Content-Type: text/css; charset=utf-8

restored deployment-mediawiki-07 as appserver backend. It seems the ferm service is having trouble starting on that machine, so the previous varnish 503s were just varnish giving up trying to reach deployment-mediawiki-07.

Ferm still refuses to start complains about DNS query for 'deployment-prometheus01.deployment-prep.eqiad.wmflabs' failed: NXDOMAIN. Getting it to run once was enough to let requests through iptables.

Well the deployment-mediawiki-07 backend was the cause of 503s today. I changed the appserver backend in hiera to deployment-mediawiki05

So TIL we're actually only using one of these appservers as the backend for web requests (currently deployment-mediawiki-07), some other for api requests (currently deployment-mediawiki05) and so on. I expected beta to spread web/api requests to these different servers. Especially I would expect 'cache::app_directors:' to include all hosts (no matter for which service). So I take it from here that deployment-mediawiki04 is unused (not receiving any traffic). Why do we have such instances?

@EddieGP : I'm not sure what changed with the addition of mediawiki07, but I can confirm that mediawiki04 was definitely serving traffic as of Tue/Wed, I had used it to test some HHVM changes.

I think this task is still being worked on but in case it helps, here's another report from the Obama page this morning:

Request from 73.252.38.252 via deployment-cache-text04 deployment-cache-text04, Varnish XID 247943572
Error: 503, Backend fetch failed at Mon, 16 Apr 2018 13:24:12 GMT

..And a page preview request from the dog page consistently shows (note: deployment-mediawiki04):

https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/Origin_of_the_domestic_dog
{
  "type": "https://mediawiki.org/wiki/HyperSwitch/errors/internal_http_error",
  "method": "post",
  "detail": "Error: connect EHOSTUNREACH 10.68.19.128:80",
  "uri": "http://deployment-mediawiki04.deployment-prep.eqiad.wmflabs/w/api.php"
}

..And another page preview request from the same page (note: deployment-mediawiki-07):

https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/Precambrian
{
  "type": "https://mediawiki.org/wiki/HyperSwitch/errors/internal_http_error",
  "method": "post",
  "detail": "504: internal_http_error",
  "uri": "http://deployment-mediawiki-07.deployment-prep.eqiad.wmflabs/w/api.php"
}
Joe added a comment.Apr 16 2018, 1:51 PM

@Niedzielski interstingly, When requiring the /summary/precambrian page, I see a successful request to the API cluster, so the error is not a 503 on the part of MediaWiki, but rather some error in the api query or in the data.

So I think you're looking at a different type of error, that has to be searched within MediaWiki and not at the appserver level.

Could the summary endpoint issue be a network or caching problem related to this task? I was wondering because it seems like the Node.js service is issuing a request to 10.68.19.128 which fails with EHOSTUNREACH. I've opened T192287 to track this issue separately but please merge it back into this if I'm mistaken. /cc @bearND @Mholloway

@Niedzielski I'll look into this on the mobileapps side today. It's possible there's a problem in the config for the beta cluster. Just to be clear, is the failure in the original case you reported for the "Barack Obama" page or the "Obama" redirect page? Or does it happen in either case?

Thanks @Mholloway. This task is specific to the Barack Obama page. The page summary issue is in T192287 and occurs on certain summary responses (all happened to be found on page previews linked from the Dog page).

Here's the latest report:

Request from 73.252.38.252 via deployment-cache-text04 deployment-cache-text04, Varnish XID 254949111
Error: 503, Backend fetch failed at Wed, 18 Apr 2018 13:07:40 GMT
Krinkle renamed this task from Beta cluster Obama page often responds with 503 to "Obama" page on Beta Cluster often responds with 503.Apr 18 2018, 3:41 PM
Niedzielski renamed this task from "Obama" page on Beta Cluster often responds with 503 to "Obama" page on Beta Cluster often responds with 500 or 503.Dec 4 2018, 4:24 PM
Niedzielski updated the task description. (Show Details)
Niedzielski updated the task description. (Show Details)