Page MenuHomePhabricator

Spicerack: add support for Alertmanager
Open, MediumPublic

Description

As some SRE teams are starting to define alerts in alertmanager, we should add some support for it in Spicerack, at least to support alert silencing and remove of silences, equivalent to the Icinga downtime feature.

Related documentation is available on wikitech.

Keep in mind that the current v2 of the API are at version: 0.0.1 and have very limited features, but support silencing alerts, although requires to keep the state as the silence ID is required to be able to delete it.

Related blog post with some useful info.

Event Timeline

Volans triaged this task as Medium priority.Oct 13 2021, 11:01 AM
Volans created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Volans added subscribers: jbond, fgiunchedi.

I had a chat with @jbond about this yesterday, putting the summary here for future reference for those that will work on this.

In Spicerack we currently expose an icinga_hosts accessor that allows to interact with Icinga for a given set of hosts.
As we will need to support both systems for the foreseable future, we probably want to expose a single interface for the cookbook users for the most common operations (downtime, check if all is "green") while still allowing to access the specific lower level interfaces to Icinga and Alertmanager separately if needed.

To put this into practical terms, something like:

class Spicerack:
    def alerting_hosts():  # was monitoring_hosts()
    def icinga_hosts():
    def alertmanager_hosts():

So that when using the monitoring one actions will be performed in both systems automatically.
The existing HostsStatus and HostStatus classes currently defined in icinga.py should probably generalized a bit to make it work with both systems and be moved to the monitoring module (or elsewhere).

There are some open questions that should probably be discussed a bit more in depth with SRE Observability:

  • Have a consistent way to match a host in alertmanager, ensuring that a given label is consistently either the hostname or the FQDN (but always the same) and can be reliably used to filter alerts for a given host.
    • For alerts that are not attached to physical hosts, the label should match the hosts that we have in Icinga ($hosntname.mgmt, example.wikimedia.org, 192.168.1.1, etc...)
    • Will this sort of mapping that has a concept of hosts and services connected to it be maintained going forward?
  • What label to select when downtiming (silencing in alertmanager terms), only SRE-related alerts or all of them? When reimaging or doing hardware maintenance it seems we would want all of them, but in other use cases maybe we just need to downtime some of them?
  • Silences in alertmanager can be deleted only having their ID, that can either be saved when creating the downtime (it can return the ID, in the case of the context manager it will be just saved locally, in the case of a downtime call it will be returned to the caller) or can be searched filtering the silences but that might pose the risk of deleting someone else's silence that matches the criteria. So we'll need to decide what to do and in case if we want to add some particular unique identifier to the silences.
  • As the /alerts API just returns the alerting items and not the ones defined, the concept of "being in optimal" state in Alertmanager is to not have any alert that matches the host (or other criteria). How can we ensure there are alerts configured for a given host that are being checked?
    • In the case of a reimage for example, we wait for Icinga to be in optimal state after the final reboot to ensure that Icinga has caught up and all services are up and running. How to ensure the same in the prometheus+alertmanager world where the data is pilled and the alerts show up only if critical? Would a direct query to prometheus for a generic metric (say uptime) to check that there is recent data be enough to reasonably consider the host checked by alertmanager?
    • How long does it take to alertmanager to alert for a new host that just started sending metrics?
  • To support also the downtime of specific services (in Icinga terms), to match the current capabilities of the Icinga module, it would be beneficial to have the service names be just that and don't include the source (they currently have Icinga/Some alert name for example0)
  • The current authentication to the alertmanager API is host-based, that means that we can't rely on the authentication to support Spicerack's dry-run mode but the module has to do all it can to prevent RW calls when in dry-run mode because it can't use RO credentials once the host is authorized for RW operations. Would that change in the near future? There will be some form of authentication and TLS added to the alertmanager APIs?

Have a consistent way to match a host in alertmanager, ensuring that a given label is consistently either the hostname or the FQDN (but always the same) and can be reliably used to filter alerts for a given host.

in relation to this and labels in general i wonder if we should try to defein some parity with the ecs

Silences in alertmanager can be deleted only having their ID

Something else we discussed around this is if we hold create one silence with a dynamically generated regex, or should we create a one to one mapping of host/services to silences. i.e. if working on cp100[1-5] that has e.g 5 separate services. should i create:

  • 1 silence matching hostname: ^cp100[1-5]\.
  • 5 silences eac one matching e.g. hostname:cp1001
  • 25 silences all matching e.g. hostname:cp1001 && service:ssh

As the /alerts API just returns the alerting items and not the ones defined, the concept of "being in optimal" state in Alertmanager is to not have any alert that matches the host (or other criteria). How can we ensure there are alerts configured for a given host that are being checked?

I don't think we should use alert manager to make sure checks are configured and actually firing, for that we should directly query whatever tool we are using for active measurements, specifically, as the most basic check we should query the tool is preforming the host ping (or equivalent) check to ensure that a host is up. so from this PoV i think the current IcngaHosts.wait_for_optimal would still be the best way to check a host is "optimal".

In addition to this having an equivalent wait_for_optimal method for prometheus interface would be good, however this will likely be harder to define and we may have to just pick some prom metric we know should exist on every host, e.g. something exported from node_exporter. When we see these metrics arrive, in a healthy state, we can consider the host healthy from a prometheus prespective. Finaly we could check that nothing has alerted in alertmanager i.e. the monitoring interface would be something like

class Monitoring:
  def wait_for_optimal(self):
     self.icinga_hosts.wait_for_optimal()
     self.prometheus_hosts.wait_for_optimal()
     # something based on
     self.alertmanager_hosts.active_alerts == 0

With this we can attest that:

  • a host is up
  • systemd is reporting all services are started
  • proemtheous is able to poll the machine for $some metrics.

As an interface provided by spicerack i think this is enough and it should be up to cookbook authors to preform an additional service specific checks to ensure a host/service is actually up and optimal

In the case of a reimage for example, we wait for Icinga to be in optimal state after
How to ensure the same in the prometheus+alertmanager

This feels like jumping the gun or perhaps asking the wrong question. currently at a very basic level something will need to be configured to preforme at least something like

  • dose the host respond to ping
  • are all services started and healthy
  • is ssh reachable

Currently theses tests are configured in icinga as such we should should directly ask icinga and make sure theses checks are configured and returning data before moving on to the next stage of checking if a host is configured, up and optimal. In a future world some other tool other then icinga may preform the basic set of tests, then at that point we should have a requirement to ensure there is an easy way for spicerack to ensure a host is configured and a similar set of checks are being preformed correctly.

I think the big thing that is missing between icinga and some furture with alert manager, is that with icinga we know ahead of time all checks being preformed, however with alert manager things can get a bit more complicated. checks are configured in multiple different places and potentially on multiple different systems so its harder for us to say "theses are all the checks that are scheduled for this host, make sure they are all good before proceeding". however i think that as things get more complicated this is something that just gets more difficult to generalise and i think its reasonable to instead define what we actully guarantee. e.g.

def wait_for_optimal:
  """Function will wait for a host to be optimal where optimal is deinfed as
   * responding to ping
   * no faults in systemd
   * something from promethous
   * something other rather generic test which would be easy to generalise i.e. something listening on this list of ports

   further service specific checks for "optimal" stat should be preformed by the caller"""
  • something other rather generic which would be easy to generalise

This could even be a simple promethous exporter which we implement our self, all it dose is return a metric optimal=True|False and it is upto the service owner to write some script to determine this state which could be as simple as the following or something very complex

#!/bin/bash
netcat -z localhost 22 && printf optimal=True || printf optimal=False

To support also the downtime of specific services (in Icinga terms), to match the current capabilities of the Icinga module, it would be beneficial to have the service names be just that and don't include the source (they currently have Icinga/Some alert name for example0)

To add to this i think the source is usefull but it may be better stored as a label i.e. currently title: Icinga/Some alert name for example0 would become title: "Some alert name for example0", labels: { source: icinga}

I had a chat with @jbond about this yesterday, putting the summary here for future reference for those that will work on this.

In Spicerack we currently expose an icinga_hosts accessor that allows to interact with Icinga for a given set of hosts.
As we will need to support both systems for the foreseable future, we probably want to expose a single interface for the cookbook users for the most common operations (downtime, check if all is "green") while still allowing to access the specific lower level interfaces to Icinga and Alertmanager separately if needed.

To put this into practical terms, something like:

class Spicerack:
    def monitoring_hosts():
    def icinga_hosts():
    def alertmanager_hosts():

So that when using the monitoring one actions will be performed in both systems automatically.
The existing HostsStatus and HostStatus classes currently defined in icinga.py should probably generalized a bit to make it work with both systems and be moved to the monitoring module (or elsewhere).

There are some open questions that should probably be discussed a bit more in depth with SRE Observability:

  • Have a consistent way to match a host in alertmanager, ensuring that a given label is consistently either the hostname or the FQDN (but always the same) and can be reliably used to filter alerts for a given host.

Safe to assume the "hostname" (I used quotes because I think it makes sense to keep hostname.mgmt too) will be in instance and not FQDN (librenms is the exception here but we'll fix that). Whether port is/will be in instance I expanded T293198 to include solutions and their tradeoffs; please let me know what you think! Something else to keep in mind is the use of =~ which might be helpful/useful.

  • For alerts that are not attached to physical hosts, the label should match the hosts that we have in Icinga ($hosntname.mgmt, example.wikimedia.org, 192.168.1.1, etc...)

Yes I'm expecting this (or sth very close) scheme to be maintained

  • Will this sort of mapping that has a concept of hosts and services connected to it be maintained going forward?

There's nothing like that built-in into AM/Prometheus the same way it is in Icinga; in that sense alerts are not attached to any host and there's no distinction host/service. Though we can inhibit certain alerts if other alerts are firing for example, what's the use case you have in mind ?

  • What label to select when downtiming (silencing in alertmanager terms), only SRE-related alerts or all of them? When reimaging or doing hardware maintenance it seems we would want all of them, but in other use cases maybe we just need to downtime some of them?

Agreed for hw maint / reimage all alerts for the affected host(s) should be silenced; for other use cases I'd say up to the author/team

  • Silences in alertmanager can be deleted only having their ID, that can either be saved when creating the downtime (it can return the ID, in the case of the context manager it will be just saved locally, in the case of a downtime call it will be returned to the caller) or can be searched filtering the silences but that might pose the risk of deleting someone else's silence that matches the criteria. So we'll need to decide what to do and in case if we want to add some particular unique identifier to the silences.
  • As the /alerts API just returns the alerting items and not the ones defined, the concept of "being in optimal" state in Alertmanager is to not have any alert that matches the host (or other criteria). How can we ensure there are alerts configured for a given host that are being checked?
    • In the case of a reimage for example, we wait for Icinga to be in optimal state after the final reboot to ensure that Icinga has caught up and all services are up and running. How to ensure the same in the prometheus+alertmanager world where the data is pilled and the alerts show up only if critical? Would a direct query to prometheus for a generic metric (say uptime) to check that there is recent data be enough to reasonably consider the host checked by alertmanager?

(essentially echoing what John pointed out) checking the up metric for job=node and instance=host:9100 will signal if/when prometheus is pulling node-exporter metrics; we could also check node-specific metrics such as node_boot_time_seconds to monitor uptime.

  • How long does it take to alertmanager to alert for a new host that just started sending metrics?

It largely depends on the alert itself, i.e. the clause for; this article explains it better than I can https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html

  • To support also the downtime of specific services (in Icinga terms), to match the current capabilities of the Icinga module, it would be beneficial to have the service names be just that and don't include the source (they currently have Icinga/Some alert name for example0)

Could you expand on why? If we're downtiming icinga service foo then on alertmanager we're downtiming Icinga/foo (for the same host)

  • The current authentication to the alertmanager API is host-based, that means that we can't rely on the authentication to support Spicerack's dry-run mode but the module has to do all it can to prevent RW calls when in dry-run mode because it can't use RO credentials once the host is authorized for RW operations. Would that change in the near future? There will be some form of authentication and TLS added to the alertmanager APIs?

There are no plans for that at the moment

Have a consistent way to match a host in alertmanager, ensuring that a given label is consistently either the hostname or the FQDN (but always the same) and can be reliably used to filter alerts for a given host.

in relation to this and labels in general i wonder if we should try to defein some parity with the ecs

That's an interesting thought, (thinking out loud) the label names will require adjusting from their ecs names since label names in prometheus must match the regex [a-zA-Z_][a-zA-Z0-9_]*

Silences in alertmanager can be deleted only having their ID

Something else we discussed around this is if we hold create one silence with a dynamically generated regex, or should we create a one to one mapping of host/services to silences. i.e. if working on cp100[1-5] that has e.g 5 separate services. should i create:

  • 1 silence matching hostname: ^cp100[1-5]\.
  • 5 silences eac one matching e.g. hostname:cp1001
  • 25 silences all matching e.g. hostname:cp1001 && service:ssh

Off the top of my head the single silence is probably easier to reason about (?) possibly host-based too (i.e. one silence per host) might be ok

As the /alerts API just returns the alerting items and not the ones defined, the concept of "being in optimal" state in Alertmanager is to not have any alert that matches the host (or other criteria). How can we ensure there are alerts configured for a given host that are being checked?

I don't think we should use alert manager to make sure checks are configured and actually firing, for that we should directly query whatever tool we are using for active measurements, specifically, as the most basic check we should query the tool is preforming the host ping (or equivalent) check to ensure that a host is up. so from this PoV i think the current IcngaHosts.wait_for_optimal would still be the best way to check a host is "optimal".

In addition to this having an equivalent wait_for_optimal method for prometheus interface would be good, however this will likely be harder to define and we may have to just pick some prom metric we know should exist on every host, e.g. something exported from node_exporter. When we see these metrics arrive, in a healthy state, we can consider the host healthy from a prometheus prespective. Finaly we could check that nothing has alerted in alertmanager i.e. the monitoring interface would be something like

(agreed, see my answer above)

Something else that occurred to me is the ALERTS meta-metric, Prometheus exports that metric for all alerts that are "pending" (i.e. might fire if they are past their for clause) and "firing". In other words checking Prometheus for ALERTS{instance..} a little while after the host is back up will give you a good indication of its state. Another solution is of course to check alertmanager's API for alerts that are currently firing (and have notified)

In the case of a reimage for example, we wait for Icinga to be in optimal state after
How to ensure the same in the prometheus+alertmanager

This feels like jumping the gun or perhaps asking the wrong question. currently at a very basic level something will need to be configured to preforme at least something like

  • dose the host respond to ping
  • are all services started and healthy
  • is ssh reachable

Currently theses tests are configured in icinga as such we should should directly ask icinga and make sure theses checks are configured and returning data before moving on to the next stage of checking if a host is configured, up and optimal. In a future world some other tool other then icinga may preform the basic set of tests, then at that point we should have a requirement to ensure there is an easy way for spicerack to ensure a host is configured and a similar set of checks are being preformed correctly.

I think the big thing that is missing between icinga and some furture with alert manager, is that with icinga we know ahead of time all checks being preformed, however with alert manager things can get a bit more complicated. checks are configured in multiple different places and potentially on multiple different systems so its harder for us to say "theses are all the checks that are scheduled for this host, make sure they are all good before proceeding". however i think that as things get more complicated this is something that just gets more difficult to generalise and i think its reasonable to instead define what we actully guarantee. e.g.

Indeed, I think for basic host alerts waiting a few minutes for alerts to fire should be good enough; for higher level alerts I think passing the alert/tags to check and for how long should do it too

To support also the downtime of specific services (in Icinga terms), to match the current capabilities of the Icinga module, it would be beneficial to have the service names be just that and don't include the source (they currently have Icinga/Some alert name for example0)

To add to this i think the source is usefull but it may be better stored as a label i.e. currently title: Icinga/Some alert name for example0 would become title: "Some alert name for example0", labels: { source: icinga}

I'm not opposed to the change but would to understand better the use case

  • To support also the downtime of specific services (in Icinga terms), to match the current capabilities of the Icinga module, it would be beneficial to have the service names be just that and don't include the source (they currently have Icinga/Some alert name for example0)

Could you expand on why? If we're downtiming icinga service foo then on alertmanager we're downtiming Icinga/foo (for the same host)

I expect the users to write cookbooks or calling the downtime cookbook for service foo. And they don't forcely need to know if the alert is actually triggered by Icinga or AM (and it's possible that at some point we'll start alerting from AM also the alerts in Icinga). So it would be easier if we could just have the same title no matter where the alert actually lives.

The proposal from John seems very reasonable, to have a source of the alerts that tell us from which monitoring system it's coming from (icinga, librenms, alertmanager/prometheus itself etc...)

  • To support also the downtime of specific services (in Icinga terms), to match the current capabilities of the Icinga module, it would be beneficial to have the service names be just that and don't include the source (they currently have Icinga/Some alert name for example0)

Could you expand on why? If we're downtiming icinga service foo then on alertmanager we're downtiming Icinga/foo (for the same host)

I expect the users to write cookbooks or calling the downtime cookbook for service foo. And they don't forcely need to know if the alert is actually triggered by Icinga or AM (and it's possible that at some point we'll start alerting from AM also the alerts in Icinga). So it would be easier if we could just have the same title no matter where the alert actually lives.

Ok that's fair, we can add source label to alerts from icinga/prometheus (via icinga-exporter and alert relabeling rules respectively)

Today @jbond and I joined the office hours of SRE Observability and discussed a bit the plan for the above.

We agreed to split this into 2 phases. A first one were only the support for silences in alertmanager will be provided, and a second one where also the capabilities to check if a host is in an optimal state will be added.

For Phase1 I think we need an interface more or less like the following (existing items are marked with # already exists)

# in spicerack/__init__.py
# in Spicerack class
def icinga_hosts(self, target_hosts: TypeHosts, *, verbatim_hosts: bool = False) -> IcingaHosts:  # already exists
def alertmanager_hosts(self, target_hosts: TypeHosts, *, verbatim_hosts: bool = False) -> AlertmanagerHosts:
def alerting_hosts(self, target_hosts: TypeHosts, *, verbatim_hosts: bool = False) -> AlertingHosts:

# in spicerack/alertmanager.py
class AlertmanagerHosts:
  def __init__(self, target_hosts: TypeHosts, *, verbatim_hosts: bool = False, dry_run: bool = True) -> None:
  @contextmanager
  def downtimed(self, reason: Reason, *, duration: timedelta = timedelta(hours=4), remove_on_error: bool = False) -> Iterator[None]:  # implementation can be copied from spicerack.icinga.IcingaHosts.downtimed(), the intersection is so low there is no point in doing a base class at this point 
  def downtime(self, reason: Reason, *, duration: timedelta = timedelta(hours=4)) -> str:  # returns the downtime ID or IDs, depending on the implementation
  def remove_downtime(self, downtime_id: str) -> None:  # or remove_downtimes(self, ids: Iterable[str]) if we want to allow to remove more at once

# in spicerack/alerting.py
class AlertingHosts:
  # for each method should call both IcingaHosts and AlertmanagerHosts methods
  def __init__(self, icinga_hosts: IcingaHosts, alertmanager_hosts: AlertmanagerHosts) -> None:
  @contextmanager
  def downtimed(self, reason: Reason, *, duration: timedelta = timedelta(hours=4), remove_on_error: bool = False) -> Iterator[None]:
  def downtime(self, reason: Reason, *, duration: timedelta = timedelta(hours=4)) -> str:  # returns the downtime ID or IDs, depending on the implementation
  def remove_downtime(self, downtime_id: str) -> None:  # or remove_downtimes(self, ids: Iterable[str]) if we want to allow to remove more at once

TBD if we should include the support for downtiming services in this first iteration, basically services_downtimed(), downtime_services() and remove_service_downtimes().

As for the removal of the downtime, we might need to implement both by ID and by filtering, I can see both being useful, by ID in the context manager to be sure we're removing only the one(s) it has created and by filtering for those that wants to run the remove-downtime cookbook.

I had a chat with @jbond about this yesterday, putting the summary here for future reference for those that will work on this.

In Spicerack we currently expose an icinga_hosts accessor that allows to interact with Icinga for a given set of hosts.
As we will need to support both systems for the foreseable future, we probably want to expose a single interface for the cookbook users for the most common operations (downtime, check if all is "green") while still allowing to access the specific lower level interfaces to Icinga and Alertmanager separately if needed.

To put this into practical terms, something like:

class Spicerack:
    def monitoring_hosts():
    def icinga_hosts():
    def alertmanager_hosts():

So that when using the monitoring one actions will be performed in both systems automatically.
The existing HostsStatus and HostStatus classes currently defined in icinga.py should probably generalized a bit to make it work with both systems and be moved to the monitoring module (or elsewhere).

There are some open questions that should probably be discussed a bit more in depth with SRE Observability:

  • Have a consistent way to match a host in alertmanager, ensuring that a given label is consistently either the hostname or the FQDN (but always the same) and can be reliably used to filter alerts for a given host.

Safe to assume the "hostname" (I used quotes because I think it makes sense to keep hostname.mgmt too) will be in instance and not FQDN (librenms is the exception here but we'll fix that). Whether port is/will be in instance I expanded T293198 to include solutions and their tradeoffs; please let me know what you think! Something else to keep in mind is the use of =~ which might be helpful/useful.

I stand corrected: I realized instance for k8s has FQDNs not hostnames, sth we ought to change IMHO

Today @jbond and I joined the office hours of SRE Observability and discussed a bit the plan for the above.

We agreed to split this into 2 phases. A first one were only the support for silences in alertmanager will be provided, and a second one where also the capabilities to check if a host is in an optimal state will be added.

For Phase1 I think we need an interface more or less like the following (existing items are marked with # already exists)

Thank you for the guidance, I've published the basic scaffolding for the alertmanager bits at https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/765480 . I'll iterate on that e.g. adding AlertingHosts and so on, let me know what you think !

Change 765480 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/software/spicerack@master] Introduce 'alertmanager' and 'alerting' modules

https://gerrit.wikimedia.org/r/765480

Change 765480 merged by Filippo Giunchedi:

[operations/software/spicerack@master] Introduce 'alertmanager' and 'alerting' modules

https://gerrit.wikimedia.org/r/765480

The code for silencing itself is merged now, I'd imagine there are other followup steps to get the new code shipped (?)

The code for silencing itself is merged now, I'd imagine there are other followup steps to get the new code shipped (?)

Yep, I'll take care of releasing and deploying a new spicerack release and then start modifying the cookbooks to use the new interface.

Change 769063 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] alertmanager: catch already deleted silence

https://gerrit.wikimedia.org/r/769063

Change 769067 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.downtime: conver to use the new alerting

https://gerrit.wikimedia.org/r/769067

Change 769063 merged by jenkins-bot:

[operations/software/spicerack@master] alertmanager: catch already deleted silence

https://gerrit.wikimedia.org/r/769063

Change 769067 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.downtime: conver to use the new alerting

https://gerrit.wikimedia.org/r/769067

Change 779485 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] alertmanager: fix and improve donwtime

https://gerrit.wikimedia.org/r/779485

Change 779485 merged by jenkins-bot:

[operations/software/spicerack@master] alertmanager: fix and improve donwtime

https://gerrit.wikimedia.org/r/779485

I think we're in a good shape wrt spicerack and alertmanager support, is there anything else left to do for this task ?

Yes, the whole phase 2 mentioned in T293209#7698301 is still a TODO:

  • Although we do allow to specify additional matchers while downtiming, that functionality doesn't have a shared API exposed at the AlertingHosts class API. We should find a way to expose that in a coherent manner so that all services matching some value/regex could be downtimed via the AlertingHosts class independently if they are on icinga, alertmanager or both (as when a check is migrated between the two systems).
  • Have a way to get the host status, basically generalizing what's in https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.HostsStatus and https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.HostStatus, ideally moving them to alerting, if at all possible (some parts might be too Icinga-specific)
  • Have a way to check if the host is in optimal status, so that we can have in AlertingHosts the generalization of https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.IcingaHosts.wait_for_optimal that will do that on both systems. I know that this is a bit more tricky right now due to the nature of Alertmanager, but at the very least we should check that some basic metric that exists on all hosts is present in prometheus with fresh data. This will still not prevent from considering optimal a host in which one of the prometheus exporter is broken or the alert is checking the wrong metric. If we're going to introduce Cloudflare's Pint then it will greatly simplify this check that will just become a check that no alerts are firing for the given hosts, having Pint ensuring the consistency of the checks and the presence and freshness of the data in prometheus.

Thank you @Volans, the items all make sense to me.

Yes, the whole phase 2 mentioned in T293209#7698301 is still a TODO:

  • Although we do allow to specify additional matchers while downtiming, that functionality doesn't have a shared API exposed at the AlertingHosts class API. We should find a way to expose that in a coherent manner so that all services matching some value/regex could be downtimed via the AlertingHosts class independently if they are on icinga, alertmanager or both (as when a check is migrated between the two systems).

I don't know offhand how to best achieve this, off the top of my head I'm imagining whichever API we add to AlertingHosts would be a noop on Icinga and act on AM only (?)

Agreed, Pint will help for sure here too. Checking for host "up" I think could be implemented as checking whether we have (fresh) node-exporter metrics for the host, that could be done today. To more closely match what Icinga does we could start pinging the hosts on their production IPs and check those metrics. IMHO the former option is preferable since it is higher level and node-exporter runs across the fleet, also comes in handy for the next point re: optimal status. To implement "optimal status" we could check (implementable "today") if ALERTS{instance=...} prometheus metric is set to 1, meaning an alert for the host is either alertstate=pending (i.e. might fire) or alertstate=firing meaning it will reach/has reached alertmanager. What do you think?

I don't know offhand how to best achieve this, off the top of my head I'm imagining whichever API we add to AlertingHosts would be a noop on Icinga and act on AM only (?)

No, the Icinga module has already support for this, see:

To implement "optimal status" we could check (implementable "today") if ALERTS{instance=...} prometheus metric is set to 1, meaning an alert for the host is either alertstate=pending (i.e. might fire) or alertstate=firing meaning it will reach/has reached alertmanager. What do you think?

It makes sense to me. It would be nice if we could also have the list of failed services (either alerting or that will alert shortly) like is now reported by Icinga ( https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.HostStatus.failed_services and https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.HostsStatus.failed_services ).
As for the pending state I see pro and cons. IIRC in icinga right now we consider optimal also if there is something in SOFT alerting, but surely is up for discussion and probably it would make sense to have that consistent across the two systems, although their meaning is slightly different due to the more atomic nature of icinga checks vs more frequent datapoints on prometheus.

I don't know offhand how to best achieve this, off the top of my head I'm imagining whichever API we add to AlertingHosts would be a noop on Icinga and act on AM only (?)

No, the Icinga module has already support for this, see:

*nod* thanks!

To implement "optimal status" we could check (implementable "today") if ALERTS{instance=...} prometheus metric is set to 1, meaning an alert for the host is either alertstate=pending (i.e. might fire) or alertstate=firing meaning it will reach/has reached alertmanager. What do you think?

It makes sense to me. It would be nice if we could also have the list of failed services (either alerting or that will alert shortly) like is now reported by Icinga ( https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.HostStatus.failed_services and https://doc.wikimedia.org/spicerack/master/api/spicerack.icinga.html#spicerack.icinga.HostsStatus.failed_services ).

There's no concept of "service" but we could certainly list the alerts related to the host that are firing/about to fire

As for the pending state I see pro and cons. IIRC in icinga right now we consider optimal also if there is something in SOFT alerting, but surely is up for discussion and probably it would make sense to have that consistent across the two systems, although their meaning is slightly different due to the more atomic nature of icinga checks vs more frequent datapoints on prometheus.

Agreed, definitely up for discussion. I'm leaning towards not listing pending alerts but could be convinced otherwise (and it is easy to change anyways)