Page MenuHomePhabricator

Set API and appserver weights in eqiad
Closed, ResolvedPublic

Description

In the 2022-03-27 API incident, we found that the api-https weights are all set to 30, even though the newer machines can handle more traffic. This caused the cluster to saturate unevenly. We should raise the weights on the newer servers.

Event Timeline

herron triaged this task as Medium priority.Mar 28 2022, 4:21 PM

eqiad: the servers can be divided into 4 groups by procurement ticket:

group A - T159963 - mw1307 through mw1348 - R430 Dual Intel Xeon E5-2650 CPU (2.2GHz/12 cores per cpu)

group B - T228214 - mw1349 through mw1384 - R440 Intel Xeon Silver 4114 2.2GHz

group C - T233639 - mw1385 through mw1413 - R440 Intel Xeon Silver 4214 2.2GHz

group D - T271155 - mw1414 through mw1456 - R440 Intel Xeon Silver 4214 2.2GHz


codfw:

group A - T151779 - mw2251 through mw2258 - R430 Xeon E5 2650 - 2.2GHz/12core - already currently set to 25

group B - T183404 - mw2559 through mw2290) - R430 Xeon E5 2650 - 2.3GHz/10 cores - already currently set to 25

group C - T231255 - mw2291 through mw2376 - R440 Intel Xeon Silver 4214 2.2GHz - already currently set to 30


Based on this I would say we should set mw1307 through mw1348 to 25 (or 20?) and then we'd be done here.

There is a new type of servers now:

group D - mw2416, mw2417 and mw2418 - R440 - Xeon Silver 4210R 2.4G - (40 processors, 128GB RAM), that's only 40 processors vs 48 but MORE RAM, 128 vs 96)

are assigned as API and they are of the new type: Dell PowerEdge R440 - ConfigC 202107

role assignments: https://gerrit.wikimedia.org/r/c/operations/puppet/+/785918

procurement ticket: https://phabricator.wikimedia.org/T286516

@RLazarus @Joe Just saw this again in the history after a while. re: https://config-master.wikimedia.org/pybal/eqiad/api-https

My suggestion was to set mw1307 through mw1348 to 25 (or 20?) because they are of type A as above.

Presumably after that this ticket would then be resolved. Should I do it? Would you want to check as well?

That sounds right to me; it would give us the same distribution as codfw, which is probably as much work as we need to do on this. I don't think it's worth investing time into any deliberate benchmarking, but if the cluster happens to saturate unevenly again in a future incident, we can tweak opportunistically.

By the way, we should do the same thing for the appservers-https cluster: https://config-master.wikimedia.org/pybal/eqiad/appservers-https. If you don't mind lowering those Xeon E5s to 25 at the same time, we can just repurpose this ticket to cover both. (But if it'll be a longer-lived item, we can open a new task for it, either way works for me.)

Thanks for grabbing this! If it turns out you don't have time after all, just let me know and I can pick it up.

Dzahn renamed this task from Set API server weights to Set API and appserver weights in eqiad.Aug 24 2022, 8:41 PM

Mentioned in SAL (#wikimedia-operations) [2022-08-24T21:20:05Z] <mutante> setting weight to 25 (from 30) for appservers and API servers in the range mw1307 through mw1348 because they are of an older hardware type (not changing weights of jobrunners/videoscalers even if in this range) (T304800)

@RLazarus Thank you! Done! I changed the value from 30 to 25 for any server within the range mw1307 through mw1348 that was either appserver or api_appserver cluster. I ignored the weights for jobscaler/videoscaler even if within this range.

https://config-master.wikimedia.org/pybal/eqiad/api-https

https://config-master.wikimedia.org/pybal/eqiad/appservers-https

https://config-master.wikimedia.org/pybal/eqiad/jobrunner

And here is the netbox query to show that is the correct range that comes from procurement ticket T159963

https://netbox.wikimedia.org/dcim/devices/?q=mw13&serial=&asset_tag=&mac_address=&console_ports=&console_server_ports=&power_ports=&power_outlets=&interfaces=&pass_through_ports=&has_primary_ip=&virtual_chassis_member=&local_context_data=&cf_purchase_date=&cf_ticket=+T159963

Dzahn claimed this task.