Context on outage leading to this ticket
A recent outage (https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-09-13_cirrussearch_restart) was caused as a side effect of puppet automatically restarting affected systemd units. In this incident, puppet was ran manually 6 hosts at a time across the whole fleet, which (together with the puppet run completing extremely quickly) led to an unacceptably high number of hosts having their elasticsearch services restarting concurrently.
During the outage, the catalog was applied in just over 20 seconds; the batch size of 6 hosts at a time for 71 total hosts meant 12 batches were done, which took around 4-6 minutes in total.
Running puppet manually rather than waiting for the automated once-every-30-minutes puppet agent runs was the correct decision, but the way it was restarted (6 hosts across the whole fleet) was not. There were two main issues:
(1) Given the quick time with which the puppet runs completed, 6 hosts was too many at a time.
(2) Our Elasticsearch indices are configured such that we can sustain 3 Elasticsearch hosts being restarted concurrently (for example, this is the pace we do our normal rolling operations at). In theory 6 concurrent hosts across both eqiad and codfw should be safe, but given that it was done "6 hosts across the whole fleet" rather than "3 hosts across eqiad and codfw", it is likely that the cumin command led to 6 eqiad hosts being restarted concurrently, and then 6 codfw hosts being restarted concurrently (assuming it uses lexicographic ordering). Fix this by explicitly running the cumin commands on elastic2* (for codfw) and elastic1* (for eqiad) rather than just elastic* (for everything)
What to actually do for this ticket
- Iron out the appropriate general procedure and document it here: https://wikitech.wikimedia.org/wiki/Search#Administration (perhaps a new sub-section, Deploying puppet changes impacting elasticsearch)
Suggestion: An appropriate procedure would be
- Disable puppet across elastic*, and then use cumin to run puppet on 3 hosts concurrently (explicitly ran on eqiad xor codfw, ie elastic1* xor elastic2* respectively) with a sleep of 3-5 minutes included in the command, OR
- Disable puppet across elastic*, and then use cumin to run puppet on only one host at a time, without requiring a sleep (or with just a brief sleep of ~1 minute).