Page MenuHomePhabricator

Investigate low resource usage on elastic1061-67
Closed, ResolvedPublic

Description

On 2019-12-17 13:44 UTC both server load and QPS dropped on this set of servers and has stayed low for the following month.

per-server QPS (click "show all 49" to see problem): https://grafana.wikimedia.org/explore?orgId=1&left=%5B%221576589990313%22,%221576590608996%22,%22eqiad%20prometheus%2Fops%22,%7B%22expr%22:%22sum(clamp_min(deriv(elasticsearch_indices_search_query_total%7Bexported_cluster%3D%5C%22production-search-eqiad%5C%22%7D%5B2m%5D),%200))%20by%20(instance)%22,%22context%22:%22explore%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D
Above graph can also be zoomed out to 30 days to see that QPS dropped and stayed low.

This is suspiciously well correlated with some SAL entries:

13:53 gehel@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
13:51 gehel@cumin1001: START - Cookbook sre.hosts.decommission
13:49 gehel@cumin1001: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
13:48 gehel@cumin1001: START - Cookbook sre.hosts.decommission

@Gehel Any idea what might have happened here?

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2020-01-09T17:37:53Z] <volans> confctl set/weight=10 for elastic10[53-67] - T242348

EBernhardson claimed this task.

Looks like the problem was load balancer weights, setting them to the same across the cluster evened out resource usage and narrowed the gap between busiest and most idle servers (prev: 120 to 900 qps per server, now 300 to 700 with one outlier at 200). Overall the graph shows much better distribution of work across the cluster.

Followup on how the weights got set to 0 to start with:

10:04 < volans> a couple of months ago the default stuff in conftool was changed because the 'services' were removed from hiera and at that time the default weight was per-service. Removing the services a new default had to be set per server, and the only reasonable default was 0, so 0 was used.
10:04 < volans> That explains why now new server objects are added to ectd with weight=0
10:05 < volans> One other side of the picture is pybal, that has never really supported weight 0 for a bunch of $reasons
10:05 < volans> and it was decided long time ago that some architectural changes to it's code were needed to make it support properly weight 0
10:05 < volans> that didn't happened yet.
10:06 < volans> the current pybal code has:
10:06 < volans> if server.weight:
10:06 < volans> cmd += ' -w %d' % server.weight
10:06 < volans> and clearly 0 is False in Python, so doesn't enter in that if and doens't pass the weight to ipvs
10:06 < volans> from ipvsadm:
10:07 < volans> from man ipvsadm:
10:07 < volans> "The valid values of weight are 0 through to 65535. The default is 1."
10:07 < volans> so, not passing the weight down to ipvs, we ended up with 1
10:08 < volans> although it seems from this bits only that a super easy patch could fix it, it's actually more complex than this for a bunch of down the line behaviours in different scenarios
10:08 < volans> IIRC _joe_ has a script that actually simplify the process of adding new servers with the right weight, I'll ask where is that
10:09 < volans> otherwise the short gist is: when adding new servers, before setting pooling=yes, don't forget to set the weight first