Page MenuHomePhabricator

Spicerack downtime methods fail when the admin reason includes an apostrophe
Closed, ResolvedPublic

Description

@BTullis discovered today that the sre.druid.roll-restart-workers cookbook fails when it tries to downtime hosts, giving this output:

btullis@cumin1001:~$ sudo cookbook sre.druid.roll-restart-workers analytics

[...]

Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: an-druid[1001-1005],druid[1001-1003]
----- OUTPUT of 'bash -c 'echo -n.../rw/icinga.cmd '' -----                                                                                                                                                        
bash: -c: line 0: unexpected EOF while looking for matching `"'                                                                                                                                                    
bash: -c: line 1: syntax error: unexpected end of file                                                                                                                                                             
================                                                                                                                                                                                                   
PASS |                                                                                                                                                                   |   0% (0/1) [00:00<?, ?hosts/s]          
FAIL |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  2.92hosts/s]
100.0% (1/1) of nodes failed to execute command 'bash -c 'echo -n.../rw/icinga.cmd '': alert1001.wikimedia.org
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'bash -c 'echo -n.../rw/icinga.cmd ''. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Exception raised while executing cookbook sre.druid.roll-restart-workers:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/druid/roll-restart-workers.py", line 79, in run
    with self.icinga_hosts.downtimed(self.reason, duration=timedelta(minutes=60)):

[...]

  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 668, in _execute
    raise RemoteExecutionError(ret, "Cumin execution failed")
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001

(Full output: P16993)

And the extended log includes this line, giving the failing command:

2021-08-10 16:49:52,268 btullis 32037 [INFO clustershell.py:853 in ev_timer] Completed command 'bash -c 'echo -n "[1628614191] SCHEDULE_HOST_DOWNTIME;an-druid1001;1628614191;1628617791;1;0;3600;btullis@cumin1001;Roll restart of Druid's jvm daemons." > /var/lib/icinga/rw/icinga.cmd ''

(Full /var/log/spicerack/sre/druid/roll-restart-workers-extended.log for this execution: P16994)

I'm pretty sure that in https://gerrit.wikimedia.org/r/705500 I introduced a shell-escaping regression, causing the downtime methods to fail when the given admin reason includes an apostrophe (in this case Druid's should have been escaped to Druid\'s.) Will confirm that's the issue, and send a patch. Thanks @BTullis for uncovering this!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 711420 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/cookbooks@master] Workaround quote escaping bug

https://gerrit.wikimedia.org/r/711420

Change 711420 merged by Btullis:

[operations/cookbooks@master] Workaround quote escaping bug

https://gerrit.wikimedia.org/r/711420

The patch at https://gerrit.wikimedia.org/r/711420 is only a workaround for a single cookbook, removing the apostrophe. Not a fix for the underlying bug.
However, I can confirm that the cookbook runs when the apostrophe is removed.

Change 712784 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/software/spicerack@master] icinga: Use shlex to quote the command string for bash -c.

https://gerrit.wikimedia.org/r/712784

Change 712784 merged by jenkins-bot:

[operations/software/spicerack@master] icinga: Use shlex to quote the command string for bash -c.

https://gerrit.wikimedia.org/r/712784

Leaving this open until the next version of Spicerack is released with the fix, then we should be all set.

Ack, I'll make a new release in the next few days, claiming the task.

New spicerack release done, I've deployed it to the cumin hosts and tested that I can now downtime with an apostrophe in the reason message. Thanks for the fix.