Page MenuHomePhabricator

Alerts "instance" label and port number
Open, MediumPublic

Description

Intro

The instance label is automatically added by Prometheus and (typically) it is in the form of hostname:port from which metrics have been fetched from. For example node-exporter is on port 9100 thus all of its metrics have instance=HOST:9100. Icinga compatibility alerts (prefixed with Icinga/) don't have the port (though we could change that easily). I have inquired Karma upstream about this situation in https://github.com/prymitive/karma/issues/3938

Problem statement

From the alerts dashboard we'd like to allow for filtering/grouping by host (e.g. to show all active alerts of a single host). In the dashboard UI clicking a label adds said label to the current filters; therefore clicking a host:port label will show all alerts for that specific port and not the host. Showing all alerts per-host would mean changing the filter from instance=HOST:PORT to instance=~^HOST:.* (for example).

Solutions

Below a list of possible solutions and the tradeoffs involved:

1. Strip port from instance at ingestion time

In this case we'd have instance to be without a port at ingestion time (i.e. Prometheus stores metrics without the port). This solution is quite invasive (likely dashboards need to be adapted), we'd have 100% new metrics since the instance label changes, and having port in instance does have its use cases (e.g. when co-hosting multiple instances of the same software).

2. Strip port from instance for outgoing alerts

We would strip the port from instance only when sending alerts to alertmanager. The solution is not invasive and allows for the easy grouping mentioned above. Downsides include the fact that the alert's labels don't reflect the underlying expression labels anymore, leading to potential confusion. Another point of confusion might be when metrics with different ports (but same host) are alerting (e.g. search has multiple ES instances on the same hw)

3. Add a new label host based on instance to alerts

We would add a new label host to alerts (adding it to the metrics is possible but we'd incur in metrics churn described above). The solution has the advantage of a brand new label (i.e. no confusion), however the hostname would be shown twice, once in instance and once in host

4. Keep port in instance

In this case we strive for consistency between alerts and their underlying metrics, and would add a (bogus) port to Icinga / LibreNMS alerts. While the grouping is achieved via a different filter (i.e. non-default from the dashboard UI) this is the least invasive solution and the most "consistent" one. For "quality of life" we could ask the dashboard UI (Karma) upstream if they are willing to implement different filters on click; this way we could still have one-click filtering/grouping to select all alerts for a given host.

5. Add a new label host based on instance to metrics

We would add a new host label based on instance to each job in Prometheus. The upside is that we don't have to worry about discrepancies between alerts and metrics (e.g. in solution 3), downsides include (similar to solution 1): changing each job in puppet, new metrics created (i.e. losing history of the current metrics), in some contexts (e.g. k8s) the instance label might not be a (real) hostname

Event Timeline

lmata triaged this task as Medium priority.Nov 16 2021, 4:46 PM

@fgiunchedi from my reading it seems like the recommended approach[1] is to relabel the instance label to only contain the hostname. Looks like some folks accomplish that with a regex relabel, https://stackoverflow.com/a/63414542, would something like that be possible?

[1]: https://www.robustperception.io/controlling-the-instance-label

@fgiunchedi from my reading it seems like the recommended approach[1] is to relabel the instance label to only contain the hostname. Looks like some folks accomplish that with a regex relabel, https://stackoverflow.com/a/63414542, would something like that be possible?

The SO link mentions the job configuration (i.e. stripping the port before ingestion), are you referring to that solution ? I'm asking because it is possible but quite invasive; for example most/all metrics will be new metrics, and we'd have to adapt/change dashboards most likely.

For more context, the solution I've suggested here is sort of a middle ground in which we'd change the labels (or add a new one) on outgoing alerts only, while the underlying metrics remain the same. This solution has different trade offs of course: for example only alerts would have e.g. host label without port (confusing, a downside IMHO) and one of the advantages for example being able to select all alerts related to an host by clicking the corresponding label in the alerts dashboard (whereas today you need to click on the instance label and then change the filter to read instance=~^foobar)

I keep going back and forth in my mind on what's the solution with the best tradeoffs, any feedback welcome !

This is my first experience with a prometheus setup, so please take all my
suggestions with a grain of salt :).

The SO link mentions the job configuration (i.e. stripping the port before ingestion), are you referring to that solution ? I'm asking because it is possible but quite invasive; for example most/all metrics will be new metrics, and we'd have to adapt/change dashboards most likely.

Yes, I was referring to stripping the port during scraping, using relabel_configs[1]. It is very good point that this would affect any existing queries which expect a host:port combo.

For more context, the solution I've suggested here is sort of a middle ground in which we'd change the labels (or add a new one) on outgoing alerts only, while the underlying metrics remain the same. This solution has different trade offs of course: for example only alerts would have e.g. host label without port (confusing, a downside IMHO) and one of the advantages for example being able to select all alerts related to an host by clicking the corresponding label in the alerts dashboard (whereas today you need to click on the instance label and then change the filter to read instance=~^foobar)

My primary motivation for suggesting the change was the argument made in this blog post, https://www.robustperception.io/controlling-the-instance-label. Namely that the purpose of the instance label is to uniquely identify the target and for vms or physical boxes this can be simplified to the hostname, which is how we typically identify the target. You definitely could add an additional host label, but I have two concerns with that approach. One it makes the instance label largely superfluous. Second, and most important, from my reading, it deviates from what prometheus recommends, which will make our installation more atypical.

I keep going back and forth in my mind on what's the solution with the best tradeoffs, any feedback welcome !

certainly a tough decision!

fgiunchedi renamed this task from Strip port from "instance" label on outgoing alertmanager alerts to Alerts "instance" label and port number.Feb 1 2022, 11:11 AM
fgiunchedi updated the task description. (Show Details)

This is my first experience with a prometheus setup, so please take all my
suggestions with a grain of salt :).

For sure! Thanks for your feedback

The SO link mentions the job configuration (i.e. stripping the port before ingestion), are you referring to that solution ? I'm asking because it is possible but quite invasive; for example most/all metrics will be new metrics, and we'd have to adapt/change dashboards most likely.

Yes, I was referring to stripping the port during scraping, using relabel_configs[1]. It is very good point that this would affect any existing queries which expect a host:port combo.

Agreed, I've listed this as solution #1 in the description (feel free to edit, anyone with access to the task can tweak the description)

For more context, the solution I've suggested here is sort of a middle ground in which we'd change the labels (or add a new one) on outgoing alerts only, while the underlying metrics remain the same. This solution has different trade offs of course: for example only alerts would have e.g. host label without port (confusing, a downside IMHO) and one of the advantages for example being able to select all alerts related to an host by clicking the corresponding label in the alerts dashboard (whereas today you need to click on the instance label and then change the filter to read instance=~^foobar)

My primary motivation for suggesting the change was the argument made in this blog post, https://www.robustperception.io/controlling-the-instance-label. Namely that the purpose of the instance label is to uniquely identify the target and for vms or physical boxes this can be simplified to the hostname, which is how we typically identify the target. You definitely could add an additional host label, but I have two concerns with that approach. One it makes the instance label largely superfluous. Second, and most important, from my reading, it deviates from what prometheus recommends, which will make our installation more atypical.

I agree with your concerns here, even adding host to alerts only seems like a repetition of instance (this is solution #3 above)

I keep going back and forth in my mind on what's the solution with the best tradeoffs, any feedback welcome !

certainly a tough decision!

indeed! I've expanded the solutions I've come up with and their tradeoffs in the description -- I hope that provides some more clarity

@fgiunchedi for number (1) is the stripping option all or none? You mention it might not be wanted in an instance where a host has multiple instances of the same service. Would it be possible to choose when we change the instance label. For instance on physical hosts we change it to the hostname, whereas for kubernetes containers we change it to the pod name. This way the instance label always refers to the unique identity we want to monitor.

@fgiunchedi for number (1) is the stripping option all or none? You mention it might not be wanted in an instance where a host has multiple instances of the same service. Would it be possible to choose when we change the instance label. For instance on physical hosts we change it to the hostname, whereas for kubernetes containers we change it to the pod name. This way the instance label always refers to the unique identity we want to monitor.

Yes technically the relabeling configuration is per-job so we can be selective; my main concern on being selective though would be ending up with the port in some cases but not others :|

In my (admittedly short) thinking about this issue, I prefer option 1 because:

  • In the majority of our dashboards, we match any port number. (e.g. {instance=~"hostname:.+})
  • Port numbers aren't human friendly.
  • Job names should be sufficient for the multi-instance exporter use case and have the added benefit of being more human-friendly.
  • For k8s pods, the instance label is even less important. I would expect more use of kubernetes_namespace and/or kubernetes_pod_name and/or pod_template_hash being the key human-friendly labels.
  • Upstream seems to recommend altering the instance label over adding other labels. @jhathaway's assesment is how I read upstream's recommendations as well:

My primary motivation for suggesting the change was the argument made in this blog post, https://www.robustperception.io/controlling-the-instance-label. Namely that the purpose of the instance label is to uniquely identify the target and for vms or physical boxes this can be simplified to the hostname, which is how we typically identify the target. You definitely could add an additional host label, but I have two concerns with that approach. One it makes the instance label largely superfluous. Second, and most important, from my reading, it deviates from what prometheus recommends, which will make our installation more atypical.

If option 1 is a non-starter for other reasons, then I would choose option 2, or 4 with quality of life improvements. Although the mismatched labels would likely be occasionally confusing, the human interfaces are made much more ergonomic.

Under the assumption that the only issue with having the port in the instance label is filtering in Karma, I'd choose option 3 as it seems to be the least invasive and least confusion prone.

AIUI option 1 is not really an option because we would no longer be able to distinguish co-hosted versions of the same software and that is something we do/need. I think option 2 is out because of the same reason option 1 is, option 4 (patching Karma) does not look promising with the response from upstream and 5 needs a bunch of changes without us actually needing the host label anywhere else than in Karma (if I'm not misunderstanding).

If I'm not misunderstanding things I agree with @JMeybohm here that options (1) and (2) are both loosing information and should be discarded.

One question that I have is if we have considered an additional option, assuming that's possible to do.
Basically it would be similar to (3), but splitting instance into host and port (or monitoring_port) when sending the data to alerts. That would retain all the information, allow to aggregate/filter per host but also by the monitoring port if needed without the duplication of data in the karma UI.

Thank you all for the feedback -- truly appreciate it!

In my (admittedly short) thinking about this issue, I prefer option 1 because:

  • In the majority of our dashboards, we match any port number. (e.g. {instance=~"hostname:.+})
  • Port numbers aren't human friendly.
  • Job names should be sufficient for the multi-instance exporter use case and have the added benefit of being more human-friendly.
  • For k8s pods, the instance label is even less important. I would expect more use of kubernetes_namespace and/or kubernetes_pod_name and/or pod_template_hash being the key human-friendly labels.
  • Upstream seems to recommend altering the instance label over adding other labels. @jhathaway's assesment is how I read upstream's recommendations as well:

My primary motivation for suggesting the change was the argument made in this blog post, https://www.robustperception.io/controlling-the-instance-label. Namely that the purpose of the instance label is to uniquely identify the target and for vms or physical boxes this can be simplified to the hostname, which is how we typically identify the target. You definitely could add an additional host label, but I have two concerns with that approach. One it makes the instance label largely superfluous. Second, and most important, from my reading, it deviates from what prometheus recommends, which will make our installation more atypical.

If option 1 is a non-starter for other reasons, then I would choose option 2, or 4 with quality of life improvements. Although the mismatched labels would likely be occasionally confusing, the human interfaces are made much more ergonomic.

IMHO option 1 is a non-starter indeed, mostly because of the engineering time spent after fixing dashboards (in general the assumption that instance has port). Ack on option 2 and 4 though, I'm definitely warming up to the idea of having port always in instance as a convention

Under the assumption that the only issue with having the port in the instance label is filtering in Karma, I'd choose option 3 as it seems to be the least invasive and least confusion prone.

AIUI option 1 is not really an option because we would no longer be able to distinguish co-hosted versions of the same software and that is something we do/need. I think option 2 is out because of the same reason option 1 is, option 4 (patching Karma) does not look promising with the response from upstream and 5 needs a bunch of changes without us actually needing the host label anywhere else than in Karma (if I'm not misunderstanding).

All correct, except for the last bit that a host label would come handy for silencing/filtering alerts in general (e.g. this question came up when discussing spicerack/alertmanager integration in T293209)

If I'm not misunderstanding things I agree with @JMeybohm here that options (1) and (2) are both loosing information and should be discarded.

One question that I have is if we have considered an additional option, assuming that's possible to do.
Basically it would be similar to (3), but splitting instance into host and port (or monitoring_port) when sending the data to alerts. That would retain all the information, allow to aggregate/filter per host but also by the monitoring port if needed without the duplication of data in the karma UI.

Yes splitting host/port would be possible to do, and we could hide instance label by default in karma.


Overall it seems to me option 3 (and the instance has always port part of option 4) are the least invasive and easiest to try (not necessarily the simplest though), possibly including both host and port as @Volans suggested.

+1 for me for what is worth to what Filippo said here above.

AIUI option 1 is not really an option because we would no longer be able to distinguish co-hosted versions of the same software and that is something we do/need. I think option 2 is out because of the same reason option 1 is, option 4 (patching Karma) does not look promising with the response from upstream and 5 needs a bunch of changes without us actually needing the host label anywhere else than in Karma (if I'm not misunderstanding).

@JMeybohm would it be possible to provide an example of co-hosted software where you would want to retain the <host:port> instance label?

@JMeybohm would it be possible to provide an example of co-hosted software where you would want to retain the <host:port> instance label?

I don't have anything in mind really. My conclusion derived from the description of this task.

@JMeybohm would it be possible to provide an example of co-hosted software where you would want to retain the <host:port> instance label?

I don't have anything in mind really. My conclusion derived from the description of this task.

elasticsearch comes to mind in this case; e.g. https://thanos.wikimedia.org/graph?g0.expr=elasticsearch_breakers_tripped%7Binstance%3D~%22%5Eelastic1051.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

There are other labels to make the metrics unique of course (and ironically the metrics above do include an host label, I'm assuming from the elasticsearch exporter itself)

Update on this: I tried briefly in o11y pontoon stack to add an host label without port but didn't like very much the result offhand (not saying we shouldn't do it though!). I've kicked the can down the road a bit further for now (i.e. punting the problem) though the feedback/ideas are still valid