Page MenuHomePhabricator

cirrussearch: iron out and document procedure for puppet changes triggering restart
Closed, ResolvedPublic3 Estimated Story Points

Description

Context on outage leading to this ticket

A recent outage (https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-09-13_cirrussearch_restart) was caused as a side effect of puppet automatically restarting affected systemd units. In this incident, puppet was ran manually 6 hosts at a time across the whole fleet, which (together with the puppet run completing extremely quickly) led to an unacceptably high number of hosts having their elasticsearch services restarting concurrently.

During the outage, the catalog was applied in just over 20 seconds; the batch size of 6 hosts at a time for 71 total hosts meant 12 batches were done, which took around 4-6 minutes in total.

Running puppet manually rather than waiting for the automated once-every-30-minutes puppet agent runs was the correct decision, but the way it was restarted (6 hosts across the whole fleet) was not. There were two main issues:

(1) Given the quick time with which the puppet runs completed, 6 hosts was too many at a time.
(2) Our Elasticsearch indices are configured such that we can sustain 3 Elasticsearch hosts being restarted concurrently (for example, this is the pace we do our normal rolling operations at). In theory 6 concurrent hosts across both eqiad and codfw should be safe, but given that it was done "6 hosts across the whole fleet" rather than "3 hosts across eqiad and codfw", it is likely that the cumin command led to 6 eqiad hosts being restarted concurrently, and then 6 codfw hosts being restarted concurrently (assuming it uses lexicographic ordering). Fix this by explicitly running the cumin commands on elastic2* (for codfw) and elastic1* (for eqiad) rather than just elastic* (for everything)

What to actually do for this ticket

Suggestion: An appropriate procedure would be

  • Disable puppet across elastic*, and then use cumin to run puppet on 3 hosts concurrently (explicitly ran on eqiad xor codfw, ie elastic1* xor elastic2* respectively) with a sleep of 3-5 minutes included in the command, OR
  • Disable puppet across elastic*, and then use cumin to run puppet on only one host at a time, without requiring a sleep (or with just a brief sleep of ~1 minute).

Event Timeline

My impression is that puppet should never, under any circumstances, be allowed to stop or restart an elasticsearch instance. It seems it would be much safer if we enforce that spicerack is the only way to restart the elastic service.

My impression is that puppet should never, under any circumstances, be allowed to stop or restart an elasticsearch instance. It seems it would be much safer if we enforce that spicerack is the only way to restart the elastic service.

This seems sensible to me and I think what is already intended per the comment at https://gerrit.wikimedia.org/g/operations/puppet/+/280ac57fded9eb5028e36fffbe24049f357461ff/modules/elasticsearch/manifests/instance.pp#310

# Note that we don't notify the Elasticsearch service of changes to its
# config files because you need to be somewhat careful when restarting it.
# So, for now at least, we'll be restarting it manually.

I would look into systemd::service (more specifically the restart option that by default is false), it should be what's needed to change the current behavior IIRC.

Suggestion: An appropriate procedure would be

My suggestion would be to do something along these lines:

  1. disable puppet on all the hosts that are involved in the change before merging it. I usually check the targets via cumin (using stuff like c:profile::blabla:etc..) but in this case it could have been as easy as using the cumin's label (A:elastic IIUC in this case).
  2. enable + run puppet on the canary/canaries (A:elastic-canary) and inspect the output of the puppet run. Anything suspicious at this stage could trigger several follow ups (rollback, additional change on top of the current one, etc..), and only one or few more hosts (depending on how big the canary set is) would be affected. It would give time to the operator to understand what's wrong without rushing to stop puppet on other nodes etc..
  3. if the puppet run looks good, rollout the change to the other nodes in small batches with adequate sleep time.

The procedure can be expanded of course, but this is what I have been using so far when deploying to production and I always found it useful. In this case running puppet on the canary should have returned a note about elastic being restarted, that might have saved the day (sometimes understanding exactly what puppet did is not trivial I know).

Lemme know your thoughts @RKemper :)

Yeah that way I see it there's basically two problems that led to the outage:

(1) The proximate cause of puppet unexpectedly restarting the es services
(2) The operator error in not noticing the service restart on the canary and then rolling back the change

For #1 we will want to explicitly tell puppet not to restart / notify elasticsearch of the config changes. This seems to be the intent of the current code, as mentioned above. It's also possible that we stumbled upon some unexpected behavior (or possibly an actual bug although that seems highly unlikely) of systemctl daemon-reload's behavior itself (i.e. that the issue may not be in puppet-land). Most probably something in puppet changed out from under us however.

For #2, totally agreed with what elukey said above. In the case of this outage I did run puppet on a single host first, but was kind of in autopilot mode and didn't pick up on the service restart and the obvious implications therein so for all intents and purposes I didn't use a canary. While issue #1 is the proximate cause, #2 is in a certain sense the root cause since even if puppet unexpectedly restarts something, that should still be caught by our procedure and a rollback should immediately be initiated, as noted by elukey and others.

I updated (removed) the original suggestions in the ticket description body since if we're ever confronted with a case where we know puppet will lead to a restart, the appropriate procedure is to use our sre.elasticsearch.rolling-operation cookbook, which handles all the tricky bits such as pausing writes on indices, waiting for cluster green status before proceeding to the next server batch, etc etc.

So in conclusion, I definitely agree that the proper procedure is pausing puppet on all hosts before merging -> run puppet on canary, pausing to verify that everything looks correct -> proceed to rest of fleet in batches (batches of 3 is a good idea for elasticsearch), and that if anything looks off on the canary (in this case, if the service unexpectedly restarts), an immediate rollback or some other ameliorative action should be taken

For #1 I think that using systemd::service instead of service should be enough, but there may be some nuances that needs to be tested reviewed.

Last but not the least - this is a great follow up, but please keep in mind that human errors/mistakes/etc.. happen even to the best SREs, so don't feel bad for the outage! The important part is trying to prevent it in the future. This follow up will surely help others too, in fact I'd suggest people that joined the WMF recently to take a look at it (we could bring this up during the next SRE meeting).

Change 739379 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elasticsearch: disallow puppet to restart

https://gerrit.wikimedia.org/r/739379

Just deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/721644 / https://gerrit.wikimedia.org/r/c/operations/puppet/+/739375 so was able to poke around a bit regarding puppet / systemd restart behavior wrt elasticsearch. I can see changing the unit file leads to a systemd daemon-reload:

Notice: /Stage[main]/Elasticsearch/Systemd::Unit[elasticsearch_6@.service]/File[/lib/systemd/system/elasticsearch_6@.service]/content:
--- /lib/systemd/system/elasticsearch_6@.service        2021-09-13 16:27:05.411764148 +0000
+++ /tmp/puppet-file20211116-31995-1qu6ef8      2021-11-16 23:20:04.323367486 +0000
@@ -18,7 +18,7 @@
 Group=elasticsearch

 # Force creation of /var/run/elasticsearch (see T276198)
-ExecPreStart=systemd-tmpfiles --create /usr/lib/tmpfiles.d/elasticsearch.conf
+ExecStartPre=systemd-tmpfiles --create /usr/lib/tmpfiles.d/elasticsearch.conf
 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/%i.pid --quiet

 # StandardOutput is configured to redirect to journalctl since

Info: Computing checksum on file /lib/systemd/system/elasticsearch_6@.service
Info: /Stage[main]/Elasticsearch/Systemd::Unit[elasticsearch_6@.service]/File[/lib/systemd/system/elasticsearch_6@.service]: Filebucketed /lib/systemd/system/elasticsearch_6@.service to puppet with sum 2a7fffcadfa83baea7348e104e872914
Notice: /Stage[main]/Elasticsearch/Systemd::Unit[elasticsearch_6@.service]/File[/lib/systemd/system/elasticsearch_6@.service]/content: content changed '{md5}2a7fffcadfa83baea7348e104e872914' to '{md5}297ce1c2d02c48085693cdbd29ae8e8d'
Info: /Stage[main]/Elasticsearch/Systemd::Unit[elasticsearch_6@.service]/File[/lib/systemd/system/elasticsearch_6@.service]: Scheduling refresh of Exec[systemd daemon-reload for elasticsearch_6@.service]
Notice: /Stage[main]/Elasticsearch/Systemd::Unit[elasticsearch_6@.service]/Exec[systemd daemon-reload for elasticsearch_6@.service]: Triggered 'refresh' from 1 event
Notice: Applied catalog in 24.05 seconds

systemd daemon-reload will

Reload systemd manager configuration. This will rerun all generators (see systemd. generator(7)), reload all unit files, and recreate the entire dependency tree.`

So it's a "soft reload", taking changed configurations from filesystem and regenerating dependency trees

The reload all unit files part is presumably why the systemd daemon-reload causes systemd to see the new changes to the unit file (in this case the ExecStartPre change):

Nov 16 23:20:04 elastic1049 systemd[1]: [/lib/systemd/system/elasticsearch_6@.service:21] Executable path is not absolute, ignoring: systemd-tmpfiles --create /usr/lib/tmpfiles.d/elasticsearch.conf

So in this case the systemd unit reloaded but didn't restart. A daemon-reload is the behavior I would expect, so at least in this case puppet did the right thing.

Based off my reading of https://puppet.com/docs/puppet/5.5/types/service.html it's still not clear why the restart in the previous incident was triggered, given that we don't subscribe to the unit file, so in theory there shouldn't be a notify / subscribe event based off its change.

But regardless, it looks like with clever use of the puppet::service's restart flag we should be able to make it impossible for puppet to restart the elasticsearch service:

(from https://puppet.com/docs/puppet/5.5/types/service.html)

You can provide an explicit command for restarting with the restart attribute. (Otherwise, if hasrestart=>false (default) then the service's stop and start commands will be used)

So we can see with https://gerrit.wikimedia.org/g/operations/puppet/+/280ac57fded9eb5028e36fffbe24049f357461ff/modules/elasticsearch/manifests/instance.pp#313 that we currently leave restart unset. I'd propose that we set it to a no-op command, so that puppet runs that no-op instead of doing a stop followed by a start.

Not 100% sure if that's the route we want to take but it seems like a way to really ensure that puppet won't restart. I opened up a patch for that here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/739379

Change 739379 merged by Ryan Kemper:

[operations/puppet@production] elasticsearch: disallow puppet to restart

https://gerrit.wikimedia.org/r/739379

Mentioned in SAL (#wikimedia-operations) [2021-11-18T00:17:56Z] <ryankemper> T290902 Disabling puppet across all elastic*: ryankemper@cumin1001:~$ sudo cumin '*elastic*' 'sudo disable-puppet "Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/739379"'

Mentioned in SAL (#wikimedia-operations) [2021-11-18T00:18:43Z] <ryankemper> T290902 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/739379; running puppet agent on arbitrary elastic host: ryankemper@elastic1051:~$ sudo run-puppet-agent --force

Mentioned in SAL (#wikimedia-operations) [2021-11-18T00:20:33Z] <ryankemper> T290902 Test host looks good, proceeding to rest of fleet ryankemper@cumin1001:~$ sudo cumin -b 4 '*elastic*' 'sudo run-puppet-agent --force'

This seems mostly resolved. I added a small section[1] to the search page on wikitech to fill out the AC.