Page MenuHomePhabricator

Support downtiming services in our cookbooks
Closed, ResolvedPublic

Description

It currently is not possible to downtime services via the downtime cookbook (as it expects hosts) but that would be really helpful.

Event Timeline

JMeybohm triaged this task as Medium priority.Mar 18 2021, 9:49 AM
JMeybohm created this task.

@JMeybohm thanks for the task, this is surely something we want to add support for.
There's also a catch that I'm not sure how to solve right now, because the since services in Icinga are identified by their description and that's usually either hardcoded or partially auto-generated by Puppet and might contain host-specific data. So hardcoding the check descriptions in the cookbooks might be too brittle and prone to break very often..

We are already able to parse the Icinga status.dat file and from there get the list of existing checks for a given host, so that might help in some ways, but we should be careful to design a robust enough API that allows to loosely couple (maybe matching services based on a pattern?) what's hardcoded in the cookbook and the related service checks we want to match.

fwiw, a possibly desired UX would be something like

$ sre.downtime.service 'service1|service2|service3' or $ sre.downtime.service service1 [service2] [service3]

where service is in the form of <service_name>.svc.<site>.wmnet which is a hostname. That is, we are ok with downtiming the entire "fake"(in icinga terms) server host and all services, no need to go around doing complex pattern matching and identifying services one by one.

Doh, I think we have naming clash here :)

  • service: as in Icinga single service belonging to an Icinga host
  • service: as in a WMF .svc. service but treated as a Host in Icinga terms

Indeed if what we're looking is to get downtime for Icinga hosts that are not real hosts, that's much easier to do and sure we should add it. I cand look into that.

Doh, I think we have naming clash here :)

I figured, hence the comment.

  • service: as in Icinga single service belonging to an Icinga host
  • service: as in a WMF .svc. service but treated as a Host in Icinga terms

Indeed if what we're looking is to get downtime for Icinga hosts that are not real hosts, that's much easier to do and sure we should add it. I cand look into that.

Many thanks!

Change 674549 had a related patch set uploaded (by Volans; owner: Volans):
[operations/software/spicerack@master] icinga: add new IcingaHosts class

https://gerrit.wikimedia.org/r/674549

@JMeybohm with the above patch, once merged and deployed, you'll be able to use icinga_hosts(["foo.bar.baz", ...], verbatim_hosts=True) to get an IcingaHosts instance that will not mangle the hostnames you'll be giving, that at that point must be valid Icinga host definitions, but not forcely hostnames.

Change 674549 merged by jenkins-bot:
[operations/software/spicerack@master] icinga: add new IcingaHosts class

https://gerrit.wikimedia.org/r/674549

@JMeybohm this is now all supported.
We have a sre.hosts.remove-downtime cookbook that when run with --force will ask the user if it wants to proceed with the hosts without verifying them with puppetdb:

$ sudo cookbook -d sre.hosts.remove-downtime --force "cumin1001.mgmt"
DRY-RUN: Executing cookbook with args: Namespace(config_file='/etc/spicerack/config.yaml', cookbook='sre.hosts.remove-downtime', cookbook_args=['--force', 'cumin1001.mgmt'], dry_run=True, list=False, verbose=False)
>>> Will remove downtime for 1 unverified hosts: cumin1001.mgmt
Type "go" to proceed or "abort" to interrupt the execution
[...SNIP...]

It's also supported directly in spicerack for use in any cookbook via the https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.IcingaHosts class that can be accessed using https://doc.wikimedia.org/spicerack/master/api/index.html#spicerack.Spicerack.icinga_hosts from any spicerack instance within any cookbook.

I'm resolving the task but feel free to reopen if it doesn't cover all your use cases and/or you encounter any issue.