Page MenuHomePhabricator

[wmcs-cookbooks] Downtime alerts from cloudcumins
Open, MediumPublic

Description

Currently, the functions in wmcs_libs/alerts.py ssh to alert1001.wikimedia.org to downtime alerts. This works only if you run the cookbook from your laptop and you have global root privileges.

We have several cookbooks that we want to run from cloudcumins (e.g. wmcs.ceph.roll_reboot_osds or wmcs.openstack.roll_reboot_cloudgws) that require downtiming some alerts on wmcs-managed physical hosts. At the moment from cloudcumins you can ssh into those hosts (with the cloud_cumin_master key) but you cannot silence alerts related to those hosts.

Some thoughts:

  • we can probably ignore Icinga alerts, as we want to move away from them anyway
  • we need a way to silence Prometheus alerts only for wmcs-managed hosts
    • is there a way to give limited access to the Prometheus API/CLI or do we need a separate Prometheus instance?

Event Timeline

fnegri triaged this task as Medium priority.Sep 27 2023, 1:51 PM
fnegri added a parent task: Restricted Task.
fnegri added a parent task: Restricted Task.

is there a way to give limited access to the Prometheus API/CLI or do we need a separate Prometheus instance?

AFAIK Alertmanager doesn't have any authz/n and we're managing access via firewall right now. But I hope to be corrected and that they added new features that could be leveraged for this.

The easiest at the moment is to add cloudcumin hosts to profile::alertmanager::api::rw

Change 965468 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] cloud_management: add am profile for silences

https://gerrit.wikimedia.org/r/965468

Change 965468 merged by David Caro:

[operations/puppet@production] cloud_management: add cloudcumins to am api rw

https://gerrit.wikimedia.org/r/965468

One problem is that spicerack uses this to construct the silence start/end dates:

spicerack/alertmanager.py
start = datetime.utcnow().astimezone(tz=timezone.utc)

However, this seems to do double timezone-conversion, datetime.utcnow() gets the current UTC time but as timezone-unaware, and astimezone then assumes that's local time and applies a duplicate offset:

# It is currently 13:42 EET, so 11:42 UTC
>>> datetime.utcnow().astimezone()
datetime.datetime(2024, 1, 16, 11, 42, 36, 960520, tzinfo=datetime.timezone(datetime.timedelta(seconds=7200), 'EET'))
>>> datetime.utcnow().astimezone(tz=timezone.utc)
datetime.datetime(2024, 1, 16, 9, 43, 27, 417246, tzinfo=datetime.timezone.utc)

That needs to be fixed so we can use the same code on cloudcumin (which just can't SSH to the AM box) and local laptops (which don't tend to be in UTC).

Change 991016 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/wmcs-cookbooks@main] openstack: cloudvirt: safe_reboot: Downtime during reboot

https://gerrit.wikimedia.org/r/991016

Change 991017 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/software/spicerack@master] alertmanager: fix timezone bug

https://gerrit.wikimedia.org/r/991017

Change 991017 merged by jenkins-bot:

[operations/software/spicerack@master] alertmanager: fix timezone bug

https://gerrit.wikimedia.org/r/991017

Change #991016 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] openstack: cloudvirt: safe_reboot: Downtime during reboot

https://gerrit.wikimedia.org/r/991016