Page MenuHomePhabricator

More structured cookbooks to reboot hosts
Open, MediumPublic

Description

High level plan:

  • Create some common infrastructure for a rolling reboot cookbook in form of library functions which are then only glued together with as little code duplication as possible.
  • For each service a specific cookbook is instantiated in the form or sre.service.reboot (e.g. sre.swift.reboot) and a separate cookbook to create one off hosts as sre.hosts.reboot-single

What should the general reboot framework for reboots cover?

  • targets a set of hosts addressed by a Cumin alias or a Cumin host globbing
  • it takes a target batch size of hosts (with an upper bound specified in digits and percentage (as sizes can differ per DC)
  • the batch is down timed and depooled
  • reboots are triggered for all servers in the batch in parallel
  • for the batch of rebooted hosts the availability of the host is checked (Icinga checks recovering plus potential service-specific metrics/tool/consistency check)
  • once the whole batch is recovered, repool and start with the next batch

When adding a new service-specific cookbook (e.g. thumbor), then the new sre.reboot.thumbor cookbook would only need to define the acceptable level of hosts being down and the criteria to confirm a host is back up fine. And after it's initially vetted, the reboot process is much more future-proof than a generic cookbook which might be misused.

Once such a generic framework exists, the existing cookbooks could also be converted over:

  • sre.elasticsearch.rolling-reboot: Rolling reboot of elasticsearch servers
  • sre.hadoop.reboot-workers: Reboot Hadoop worker nodes
  • sre.maps.reboot: Maps reboot cookbook
  • sre.wdqs.reboot: WDQS reboot cookbook

(and potentially also fold it sre.hosts.upgrade-and-reboot one way or the other, the upgrade part is a bit of an antipattern)

Available tools in Spicerack:

  • Since the first development of that CR Spicerack has now a better support for running cumin commands on clusters behind load balancers, see LBRemoteCluster. This should be used to manage any cluster behind a load balancer for its additional features and safety nets that provides.
  • In case we want this new cookbook to work only on hosts not behind a load balancer, we should enforce that checking with confctl that that's indeed the case.
  • What about hosts that are part of clusters that have a more specialized cookbook that should be used instead?Should we detect it and tell the user the other cookbook should be used? At the same time all of those cookbooks restart the whole cluster AFAIK.
  • Still in the case that we want this new cookbook to work only on hosts not behind a load balancer, we should use a less generic name than reboot that implies that it works with any host.
  • As for the Icinga downtime we have a way to force a recheck of all the services that should be useful to prevent false alarms.
  • Again for Icinga, we can now get the status of the checks for a given set of hosts, allowing to quickly see if they are in optimal state. See get_status().

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone

Event Timeline

Quick stupid idea - 1) Insert hook after downtime for custom code. 2) Have a configured way to tell which hosts load which class in the hierarchy, be it an abort "this host should never be rebooted", or some other functionality "depool from pybal". 3) Start writing reboot modules for all hosts until complete coverage. 4) Profit!

As the CR specificity mentions operating on a single host i don't think that LBRemoteCluster comes into play here. I do think it would be great to have cookbooks witch can safley reboot every host in a cluster or just restart the critical services. however there is also a use case to just reboot an individual host.

Reading between the lines i think the following concerns exist in relation to the linked CR

  • we may remove icinga downtime before the host is green
  • there is no pool/depoool option

For the first issue we can use the soon to be released HostStatus class.

For the repool/depool aspect of things I think all hosts should have /usr/local/bin/(de)pool installed on the server which dose the correct thing to depool itself or nothing if it doesn't have a pooled service. this would provide a very simple interface for cookbooks and users to depool and individual host or with cumin a selection which IMO is a big win in genral.

Currently 55% of servers are missing the pool/depool cmd and a quick eyeball suggest that most of them are missing it because there is nothing to pool/depool or would be imo fairly trivial to add a pool/depool script. the two that i think may be problematic would be the dns auth servers and the lvs servers. Of course im also expecting some historical context to potentially raise its head here.

For the repool/depool aspect of things I think all hosts should have /usr/local/bin/(de)pool

for the record i would definitely work on this piece if people agree its a good idea

As the CR specificity mentions operating on a single host i don't think that LBRemoteCluster comes into play here.

For smaller clusters we might still cross a threshold of depooled hosts if doing a single reboot while maybe some hosts are already depooled for other reasons, so maybe it might be useful here too.

My main point is that we either should support the ones that need depool too or add some sort of protection to prevent the cookbook run on those hosts.
I'd like some input from the service owners of those clusters.

If the final goal of this cookbook is to ever reboot at most a single host, then we should call it differently too, like reboot-host or similar.

Of course im also expecting some historical context to potentially raise its head here.

Cannot speak for other clusters, but while databases are not impossible, we are very far way to be able to do that fully automatically. But we are happy to have someone/guide someone working on it, if you want to start on it. 0:-D.

As the CR specificity mentions operating on a single host i don't think that LBRemoteCluster comes into play here.

For smaller clusters we might still cross a threshold of depooled hosts if doing a single reboot while maybe some hosts are already depooled for other reasons, so maybe it might be useful here too.

My main point is that we either should support the ones that need depool too or add some sort of protection to prevent the cookbook run on those hosts.
I'd like some input from the service owners of those clusters.

I'm generally sceptical of a generic clustered rolling reboot script. There's a number of subtle issues (starting with depool thresholds for small clusters, to some clusters needing a specific pace as e.g. caches can get voided) and from my PoV having a very generic rolling-reboot script might lead us to an outage. I would rather envision to make cookbooks for rolling cluster restarts explicit. We already have the cookbooks mentioned by Riccardo in the task description and ideally we create some common infrastructure for a rolling reboot cookbook in form of library functions which are then only glued together with as little code duplication as possible.

One example. Let's assume there's a generic library function (or set of) which

  • targets a set of hosts addressed by a Cumin alias or a Cumin host globbing
  • it takes a target batch size of hosts (with an upper bound specified in digits and percentage (as sizes can differ per DC)
  • the batch is down timed and depooled
  • reboots are triggered
  • for the batch of rebooted hosts the availability of the host is checked (Icinga checks recovering plus potential service-specific metrics/tool/consistency check)
  • once the whole batch is recovered, repool and start with the next batch

Let's say we now add a new cookbook to restart Thumbor. Then the new sre.reboot.thumbor cookbook would only need to define the acceptable level of hosts being down and the criteria to confirm a host is back up fine.
And after it's initially vetted, the reboot process is much more future-proof than a generic cookbook which might be misused.

Once such a generic framework exists, the existing cookbooks for Elastic / Hadoop could also be converted over.

If the final goal of this cookbook is to ever reboot at most a single host, then we should call it differently too, like reboot-host or similar.

That is the intention in fact. Happy to rename it either way.

And in that scheme from above the cookbook could be named sre.reboot.single-host?

MoritzMuehlenhoff renamed this task from Create cookbook to reboot hosts to More structured cookbooks to reboot hosts.May 27 2020, 11:57 AM
MoritzMuehlenhoff updated the task description. (Show Details)

Change 596187 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/cookbooks@master] Add a Spicerack cook book to reboot hosts

https://gerrit.wikimedia.org/r/596187

Change 596187 merged by Muehlenhoff:
[operations/cookbooks@master] Add a Spicerack cook book to reboot hosts

https://gerrit.wikimedia.org/r/596187

For posterity, one traceback triggered by some terminal sequence magic in Puppet's "Ruby 2.1 is deprecated" warning:

Exception raised while executing cookbook sre.hosts.reboot-single:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 410, in _run
    ret = self.module.run(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reboot-single.py", line 50, in run
    puppet.wait_since(reboot_time)
  File "/usr/lib/python3/dist-packages/spicerack/decorators.py", line 94, in wrapper
    return func(*args, **kwargs)  # type: ignore
  File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 254, in wait_since
    last_run = datetime.utcfromtimestamp(int(output.message().decode()))
ValueError: invalid literal for int() with base 10: "\x1b[1;33mWarning: Support for ruby version 2.1.5 is deprecated and will be removed in a future release. See https://puppet.com/docs/puppet/latest/system_requirements.html for a list of supported rub
END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)

For posterity, one traceback triggered by some terminal sequence magic in Puppet's "Ruby 2.1 is deprecated" warning:

@MoritzMuehlenhoff I know what it is, we run some puppet command to get the last puppet run and then parse the output, we can add some stderr redirection to /dev/null but then risk to hide other errors. Thoughts?

Do we need to fix it or was a specific odd case with a very old version of ruby?

Do we need to fix it or was a specific odd case with a very old version of ruby?

I think we can safely ignore it, this is only an issue on Jessie (as it's Ruby is so old) and the cookbook still works fine (except that it logs a failure), I mostly just added it to the task in case someone else it searching for the error, let's not was time on fixing Jessie-specific bugs.