Page MenuHomePhabricator

Investigate why mobileapps in k8s "/{domain}/v1/data/css/mobile/site" endpoint takes way longer than on scb to complete
Closed, ResolvedPublic

Description

We got the following alert

PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29

Which is consistent with https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?panelId=20&fullscreen&orgId=1&refresh=5m&from=now-1h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=mobileapps&var-container_name=All having a suspiciously flat line for --domain_v1_data_css_mobile_site. Quantiles aren't much better (https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?panelId=37&fullscreen&orgId=1&refresh=5m&from=now-1h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=mobileapps&var-container_name=All). Again suspiciously flat.

This is indicative of a timeout of some sort, however it's still unclear why this is happening, the migration was paused and the percentage of traffic reaching k8s dropped back to 10%. Even if those requests fail, restbase will retry and probably sent to an scb host which will reply and then the response will be cached.

Related Objects

StatusSubtypeAssignedTask
StalledNone
ResolvedNone
Resolvedakosiaris
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedReedy
ResolvedReedy
ResolvedBawolff
ResolvedAnomie
ResolvedBawolff
ResolvedBawolff
ResolvedLegoktm
ResolvedLucas_Werkmeister_WMDE
ResolvedBawolff
Resolvedsbassett
Resolvedsbassett
ResolvedJdforrester-WMF
Resolvedsbassett
Resolvedsbassett
ResolvedReedy
ResolvedReedy
ResolvedJdforrester-WMF
ResolvedReedy
ResolvedReedy
ResolvedReedy
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedReedy
ResolvedReedy
ResolvedReedy
ResolvedJdforrester-WMF
Resolvedhashar
Resolvedhashar
ResolvedJdforrester-WMF
Resolvedhashar
DeclinedMoritzMuehlenhoff
Invalidthcipriani
Resolved mmodell
Resolvedhashar
ResolvedJoe
ResolvedJMeybohm
ResolvedJMeybohm
DuplicateDzahn
DeclinedDzahn
ResolvedJdforrester-WMF
OpenNone
Resolvedakosiaris
ResolvedJdforrester-WMF
Resolvedakosiaris
Resolved Mholloway

Event Timeline

I believe the culprit is this line: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/mobileapps/+/refs/heads/master/lib/css.js#15 which is making a request to the public MediaWiki load.php endpoint rather than requesting from an internal appserver as it should. Will push a fix shortly.

I 'll have a look at the networking part. It's quite possible that this is failure to talk to edge caches, as we forbid that in kubernetes.

I believe the culprit is this line: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/mobileapps/+/refs/heads/master/lib/css.js#15 which is making a request to the public MediaWiki load.php endpoint rather than requesting from an internal appserver as it should. Will push a fix shortly.

Wow, this was quick! Thanks @Mholloway !

Change 613213 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] Site-CSS: Fix load.php request to use request templates

https://gerrit.wikimedia.org/r/613213

Change 613237 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/deployment-charts@master] mobileapps: add request template for load.php requests to config

https://gerrit.wikimedia.org/r/613237

Change 613237 merged by jenkins-bot:
[operations/deployment-charts@master] mobileapps: add request template for load.php requests to config

https://gerrit.wikimedia.org/r/613237

mobileapps chart change for the load.php request template has been deployed.

Change 613645 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/puppet@production] Add mw_resource_loader_uri to Node.js service config vars

https://gerrit.wikimedia.org/r/613645

Change 613646 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps/deploy@master] Add request template for load.php

https://gerrit.wikimedia.org/r/613646

Change 613213 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Site-CSS: Fix load.php request to use request templates

https://gerrit.wikimedia.org/r/613213

Change 613645 merged by Alexandros Kosiaris:
[operations/puppet@production] Add mw_resource_loader_uri to Node.js service config vars

https://gerrit.wikimedia.org/r/613645

Change 613646 merged by jenkins-bot:
[mediawiki/services/mobileapps/deploy@master] Add request template for load.php

https://gerrit.wikimedia.org/r/613646

Change 614754 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/deployment-charts@master] Update mobileapps to 2020-07-17-172246-production

https://gerrit.wikimedia.org/r/614754

Change 614754 merged by jenkins-bot:
[operations/deployment-charts@master] Update mobileapps to 2020-07-17-172246-production

https://gerrit.wikimedia.org/r/614754

The fix has been rolled out and /data/css/mobile/site response times are now looking much more reasonable (~200-300 ms).

Mholloway renamed this task from Investigate why mobileapps in k8s "/{domain}/v1/data/css/mobile/base" endpoint takes way longer than on scb to complete to Investigate why mobileapps in k8s "/{domain}/v1/data/css/mobile/site" endpoint takes way longer than on scb to complete.Jul 20 2020, 3:14 PM