Page MenuHomePhabricator

Mobileapps is often throttled on codfw
Closed, ResolvedPublic

Assigned To
Authored By
akosiaris
Apr 5 2022, 3:12 PM
Referenced Files
F35040832: image.png
Apr 6 2022, 8:54 AM
F35040817: image.png
Apr 6 2022, 8:54 AM
F35040829: image.png
Apr 6 2022, 8:54 AM
F35040837: image.png
Apr 6 2022, 8:54 AM
F35039855: image.png
Apr 5 2022, 4:10 PM
F35039850: image.png
Apr 5 2022, 4:10 PM
F35039848: image.png
Apr 5 2022, 4:10 PM
F35039857: image.png
Apr 5 2022, 4:10 PM

Description

While mobileapps in eqiad has no issues, mobileapps in codfw (where the majority of processings happens because of restbase-async and changeprop being active there by design) is consistently (and at times severely throttled)

30 day graphs show:

avg[1]

image.png (1×1 px, 316 KB)
max[2]
image.png (1×1 px, 354 KB)

Latencies would quite probably benefit from increase the CPU limits and possibly error rates for the more expensive endpoints too.

[1] https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?orgId=1&from=now-30d&to=now&viewPanel=78
[2] https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?orgId=1&from=now-30d&to=now&viewPanel=94

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 777369 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] mobileapps: Increase CPU limits

https://gerrit.wikimedia.org/r/777369

Change 777369 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Increase CPU limits

https://gerrit.wikimedia.org/r/777369

After patch was merged and deployed we have happier graphs!

Before

avg

image.png (1×1 px, 316 KB)
max
image.png (1×1 px, 354 KB)

After

avg

image.png (378×611 px, 24 KB)
max
image.png (386×612 px, 29 KB)

And there is some correlation that we might have a decrease in errors too

image.png (1×1 px, 96 KB)

As well as a decrease in some latencies

image.png (344×807 px, 64 KB)
(arrows pointing to quantiles that seems to have dropped)

For the latencies specifically, it's a bit early to tell with a high degree of confidence, but it is looking good. I 'll have another look in 24h or so.

And now that enough time has passed, indeed the rate of errors is lower (I 've arbitrarily drawn a couple of lines at around the 90+th percentile to showcase it easily. There is also less variation so this isn't exactly scientific, but I 'd say good enough.

image.png (569×2 px, 68 KB)

We are definitely in a better place.

Latency wise, the p99 doesn't show major improvements although there are some notable ones like /page/summary

image.png (563×2 px, 303 KB)

The major changes are in p50

image.png (566×2 px, 341 KB)
where we see /page/talk latency having a very noticeable drop by a factor of ~65%. The rest of the endpoints have minor changes if any at all.

and p90

image.png (558×2 px, 432 KB)
where /page/summary and /page/talk have dropped from a consistently high latency to a more varied distribution with a lower average over time. The majority of the other endpoints see noticeable changes, albeit less pronounced.

I am resolving this followup from https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-03-27_api. This will decrease errors from mobileapps in the future and responses will be given to changeprop more promptly, lowering the amount of retries.