Page MenuHomePhabricator

Stop using auto_expand_replicas on indices hosted by the cirrussearch cluster
Closed, DeclinedPublic1 Estimated Story Points

Description

auto_expand_replicas is a nice feature for "small" clusters allowing to automatically scale index replicas following changes in the cluster topology.
While it might make sense for cirrussearch to set this to "0-2" by default for small installations, it makes little sense for the WMF production setup.
In T400160#11106744 we have some evidence that possibly such settings could cause an increase in computational cost of some cluster operation (this setting is apparently evaluated quite frequently to determine if the number of replicas have to change).

I don't see a valid reason for us to have such settings, we can't possibly run the cluster with two nodes so "0-2" makes little sense.

AC:

  • decide if it's worth the effort
  • stop creating indices with auto_expand_replicas on the cirrussearch cluster

Details

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1182192 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Stop using auto_expand_replicas

https://gerrit.wikimedia.org/r/1182192

Change #1182192 abandoned by Ebernhardson:

[operations/mediawiki-config@master] cirrus: Stop using auto_expand_replicas

Reason:

on review, this isn't going to do what we need. To avoid the hostname regex checks when deciding capacity we need to fully eliminate auto_expand_replicas, and that would have to happen on the cirrus side.

https://gerrit.wikimedia.org/r/1182192

It turns out we can't avoid the regex hostname checks by setting a zero-width range on auto_expand_replicas, we would have to really get rid of it. That is possible, but it seems more error prone than is worthwhile as there are a variety of places that change the replica count, do some work, then change it back. Having multiple code paths through all of that seems less ideal.

Moving forward I think our solution will be partially to know that we can't have a large ban list, and that it should go away as soon as reasonable. Also potentially T399900 can monitor the situation and remind us of the above. Plausibly we could also upstream some code to improve the performance, we suspect a medium sized lru cache for compiled regex automatons could improve performance. Or even a cache over (pattern, string) -> match if it was appropriately sized would potentially be a large improvement.