Page MenuHomePhabricator

Raise mw-api-int replicas for increased load from mobileapps
Closed, ResolvedPublic

Description

When the feature to use core page HTML for mobileapps was turned on, we saw requests to mw-api-int almost double.

image.png (500×1 px, 46 KB)

While some of this increase is probably due to the missing headers causing pregenerated content to be ignored and requested directly from mw-api-int, we should be prepared for a similar increase in requests.

mw-api-int currently has 160 replicas (907 CPUs) and we don't have the available capacity to double that at the moment.

I propose we raise the number of replicas by 50% to 240, and accelerate T351074: Move servers from the appserver/api cluster to kubernetes so we can do some emergency scaling if it's not enough. Releasing the hold from T355544: Migrate hosts from codfw row A/B ASW to new LSW devices seems a necessity, even if it means re-imaging these servers one more time, which we would need to do for renaming anyways.

Event Timeline

Clement_Goubert created this task.

Content-Transform-Team can you give us an estimate of the increase in requests a successful rollout would have?

The best metric available I can think of is the amount of outgoing requests from PCS (mobileapps service) to RESTBase provided be envoy telemetry:
mobileapps->restbase-for-services->200
https://grafana.wikimedia.org/goto/jy38wstSz?orgId=1

More specifically this roughly looks like ~600-~900 req/s more traffic

image.png (1×3 px, 879 KB)

Thanks for the info, so roughly +30% at the high end. I think we should do the 50% increase and adjust from there.

@Clement_Goubert We fixed the 2 blocking issues and we are ready to try another switchover of PCS outgoing traffic. Should we wait for the workers increase before deploying?

I'll prepare the patch. Going to 240 replicas would put us at around 700 CPUs available on the wikikube cluster, which is a little less than I like but should be ok.

Just in case it ends up causing problems with deployments, @hnowlan or @kamila could you prepare to reimage ~6 appservers or api_appservers in both codfw and eqiad please? That should cover most of the capacity increase.

Change 1007584 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-int: Increase replicas to 240 total

https://gerrit.wikimedia.org/r/1007584

Change 1007584 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: Increase replicas to 240 total

https://gerrit.wikimedia.org/r/1007584

Change 1007893 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-int: Hold eqiad back on resources

https://gerrit.wikimedia.org/r/1007893

Change 1007893 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: Hold eqiad back on resources

https://gerrit.wikimedia.org/r/1007893

@Jgiannelos I've applied the resource increase, so you can proceed next week. I'm keeping this task open for now, since we'll probably adjust replicas back down once we have a good handle on the actual resource consumption of the switchover.