Migrate mysql icinga alerts to alert manager
Open, MediumPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Aug 22 2022, 10:27 AM

Description

This is an umbrella task to track what was discussed in the design documents, with the subsequent implementation plan:

pt-heartbeat + scaffolding
- Create the prometheus http exporter scaffolding
- Implement the custom pt-heartbeat monitoring
- Create the related alert rule(s)
seconds_behind_master + threads (replication/io)
- Add the show slave status; parsing
- Create the related prometheus-node-exporter alert rule(s)
memory pressure
- *Implement custom memory monitoring if needed*
- Create the related alert rule(s)
disk pressure
- *Implement custom disk monitoring if needed*
- Create the related alert prometheus-node-exporter rule(s)
read only status
- *Implement custom query if needed*
- Create the related mysqld-exporter alert rule(s)
process monitoring
- *Implement custom system probe if needed*
- Create the related systemd unit alert rule(s)
mariadb errors
- decide upon feature parity
  - Implement a proof of concept of error message passing
  - Productionize the POC
- Create the related alert rule(s)

The end goal of that migration is to reach full feature parity with our current icinga monitoring situation

Details

	Subject	Repo	Branch	Lines +/-
	MySQL: adding rule set to iterate over	operations/alerts	master	+91 -4
	data-persistence: Add alert for replication lag	operations/alerts	master	+35 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T321808 Port most/all Icinga checks to Prometheus/Alertmanager
Open	• ABran-WMF	T315866 Migrate mysql icinga alerts to alert manager
Resolved	• ABran-WMF	T367278 Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding
Open	None	T371049 prometheus-mysqld-exporter doesn't fully support multi-instances for pt-heartbeat
In Progress	• ABran-WMF	T367279 Migrate mysql icinga alerts to alert manager - seconds_behind_master + threads (replication/io)
Declined	• ABran-WMF	T367280 Migrate mysql icinga alerts to alert manager - memory pressure
Resolved	• ABran-WMF	T367281 Migrate mysql icinga alerts to alert manager - disk pressure
Open	• ABran-WMF	T367282 Migrate mysql icinga alerts to alert manager - read only status
Open	• ABran-WMF	T367283 Migrate mysql icinga alerts to alert manager - process monitoring
Open	• ABran-WMF	T367284 Migrate mysql icinga alerts to alert manager - mariadb errors

Event Timeline

Ladsgroup created this task.Aug 22 2022, 10:27 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 22 2022, 10:27 AM

Change 825294 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/alerts@master] data-persistence: Add alert for replication lag

https://gerrit.wikimedia.org/r/825294

gerritbot added a project: Patch-For-Review.Aug 22 2022, 10:28 AM

• Marostegui moved this task from Triage to Ready on the DBA board.Aug 23 2022, 8:22 AM

As mentioned on IRC, and per @Ladsgroup, I rised a few issues with the *current* replication prometheus metrics- migrating to prometheus-based alerting is very possible and desirable IMHO, I just rised the issue about the previous patch as it is at the moment (or more concretely, with the current prometheus-mysqld-exporter obtained metrics):

The alert being present, and its severity (warning, critical, paging) is currently a relatively complex set of options controled from hiera, its role/profile and the instance status (being a master or a replica, being a standarlone host, etc.). I believe the alert will be applied to all hosts, but the config should depend -as it does now- on the puppet config. For example, we don't care about lag on s1-master eqiad, because it shouldn't replicate (at the moment). A production host, such as an s1 mediawiki replica could page (or not) but certainly it would be a critical metric, while for something like cloud, a passive misc host or a backup source, those hosts lag all the time (within reason) and we don't care much. The current status on icinga is far from perfect, but at least it catches some of these different behaviours- an alert on all hosts will create lot of spam.
Most important issue: prometheus lag only works when replication is running, as it is based on Seconds_behind_master parameter from SHOW REPLICATION STATUS, instead of the accurate pt-heartbeat method used on icinga and mediawiki;. If mariadb server crashes, or it is restarted and its replication not up to date, or replication is not correctly setup, or it is just stopped for any reason and we don't forget to enable it, it will show <no data> on the graphs. This is a known defect documented on T141968. The reason why this hasn't been yet implemented is because it may require reconfiguration of the prometheus-mysqld-exporter, and most likely either maintaining a fork of its go code, or implement it as a separate exporter to follow mediawiki and WMF conventions (e.g. per-datacenter tracking). This can be seen on the graphs that when mysql replication is stopped, there are blank areas in the graphs: https://grafana.wikimedia.org/goto/7m9I9yZ4z?orgId=1 where in reality (and from the point of view of icinga) there should be an inclined step of 1:1 seconds. Totally possible (and desired) to be on prometheus, but not yet a reality :-(.
Reliability: the check_replication.pl script has been iterated over years to have a lot of failbacks and bug fixes- e.g. if pt-heartbeat checking fails, it tries to use Seconds_behind_master automatically, it warns rather than send a critical, it snipes queries from failures in STATEMENT based replication, etc and other subtleties based on bugs found over the year. This doesn't mean the current script has to be continued to be used, but Prometheus exporter has been -many times in the past- quite unreliable (e.g. failing to be restarted during extended downtime by mysqld, and other issues) that may not be too problematic if general metrics have been lost for hours and days, but it is for an important aspect of alert gathering.
Features- the old check replication script has into account the replication status and has a relatively complex behaviour depending if replication is running or not, this is not captured in the current metrics monitoring.

None of this are blockers that prevent the work- but they are missed features that create a regression if the proposed method is not iterated further (which will require some work, and that is why it has not been done earlier). As someone that receives also the effects of pages, this is my €0.02 to making sure alerting is as reliable or more then the current method.

Change 825294 merged by jenkins-bot:

[operations/alerts@master] data-persistence: Add alert for replication lag

https://gerrit.wikimedia.org/r/825294

Maintenance_bot removed a project: Patch-For-Review.Sep 8 2022, 7:30 AM

Thank you @jcrespo for the extensive write up and the insights! I'll comment inline below

In T315866#8194791, @jcrespo wrote:

As mentioned on IRC, and per @Ladsgroup, I rised a few issues with the *current* replication prometheus metrics- migrating to prometheus-based alerting is very possible and desirable IMHO, I just rised the issue about the previous patch as it is at the moment (or more concretely, with the current prometheus-mysqld-exporter obtained metrics):

The alert being present, and its severity (warning, critical, paging) is currently a relatively complex set of options controled from hiera, its role/profile and the instance status (being a master or a replica, being a standarlone host, etc.). I believe the alert will be applied to all hosts, but the config should depend -as it does now- on the puppet config. For example, we don't care about lag on s1-master eqiad, because it shouldn't replicate (at the moment). A production host, such as an s1 mediawiki replica could page (or not) but certainly it would be a critical metric, while for something like cloud, a passive misc host or a backup source, those hosts lag all the time (within reason) and we don't care much. The current status on icinga is far from perfect, but at least it catches some of these different behaviours- an alert on all hosts will create lot of spam.

One solution I can think of for this is write/configure the alerts (which basically boils down to yaml) directly from puppet. We already do this for prometheus::blackbox::check::http since alerts can vary by team. It will be for sure a good occasion to clean up / simplify some of the logic (hopefully!). Similarly we should be already able to restrict the alert logic (e.g. severity) based on metric labels (e.g. like @Ladsgroup did in https://gerrit.wikimedia.org/r/c/operations/alerts/+/835117)

Most important issue: prometheus lag only works when replication is running, as it is based on Seconds_behind_master parameter from SHOW REPLICATION STATUS, instead of the accurate pt-heartbeat method used on icinga and mediawiki;. If mariadb server crashes, or it is restarted and its replication not up to date, or replication is not correctly setup, or it is just stopped for any reason and we don't forget to enable it, it will show <no data> on the graphs. This is a known defect documented on T141968. The reason why this hasn't been yet implemented is because it may require reconfiguration of the prometheus-mysqld-exporter, and most likely either maintaining a fork of its go code, or implement it as a separate exporter to follow mediawiki and WMF conventions (e.g. per-datacenter tracking). This can be seen on the graphs that when mysql replication is stopped, there are blank areas in the graphs: https://grafana.wikimedia.org/goto/7m9I9yZ4z?orgId=1 where in reality (and from the point of view of icinga) there should be an inclined step of 1:1 seconds. Totally possible (and desired) to be on prometheus, but not yet a reality :-(.

+1 on checking heartbeat, I'll comment on the related task on what we could do on the mysqld-exporter side nowadays.

Reliability: the check_replication.pl script has been iterated over years to have a lot of failbacks and bug fixes- e.g. if pt-heartbeat checking fails, it tries to use Seconds_behind_master automatically, it warns rather than send a critical, it snipes queries from failures in STATEMENT based replication, etc and other subtleties based on bugs found over the year. This doesn't mean the current script has to be continued to be used, but Prometheus exporter has been -many times in the past- quite unreliable (e.g. failing to be restarted during extended downtime by mysqld, and other issues) that may not be too problematic if general metrics have been lost for hours and days, but it is for an important aspect of alert gathering.

Features- the old check replication script has into account the replication status and has a relatively complex behaviour depending if replication is running or not, this is not captured in the current metrics monitoring.

Thank you for pointing this out, the current logic is certainly something to be taken into account when porting alerts over

None of this are blockers that prevent the work- but they are missed features that create a regression if the proposed method is not iterated further (which will require some work, and that is why it has not been done earlier). As someone that receives also the effects of pages, this is my €0.02 to making sure alerting is as reliable or more then the current method.

+1 on making sure we are improving the situation with the new alerting, thank you again for your feedback

HTH!

fgiunchedi mentioned this in T141968: Display lag on grafana (prometheus) from pt-heartbeat instead (or in addition) of Seconds_Behind_Master.Sep 26 2022, 12:43 PM

fgiunchedi added a parent task: T321808: Port most/all Icinga checks to Prometheus/Alertmanager.Oct 27 2022, 2:12 PM

• jcrespo mentioned this in T321808: Port most/all Icinga checks to Prometheus/Alertmanager.Feb 13 2023, 3:51 PM

• jcrespo mentioned this in T334925: ToolsDB: setup pt-heartbeat replication monitor.Apr 24 2023, 11:40 AM

• ABran-WMF subscribed.Oct 6 2023, 1:36 PM

Change 963980 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/alerts@master] MySQL: adding rule set to iterate over

https://gerrit.wikimedia.org/r/963980

gerritbot added a project: Patch-For-Review.Oct 9 2023, 12:22 PM

In T315866#8194791, @jcrespo wrote:

As mentioned on IRC, and per @Ladsgroup, I rised a few issues with the *current* replication prometheus metrics- migrating to prometheus-based alerting is very possible and desirable IMHO, I just rised the issue about the previous patch as it is at the moment (or more concretely, with the current prometheus-mysqld-exporter obtained metrics):

The alert being present, and its severity (warning, critical, paging) is currently a relatively complex set of options controled from hiera, its role/profile and the instance status (being a master or a replica, being a standarlone host, etc.). I believe the alert will be applied to all hosts, but the config should depend -as it does now- on the puppet config. For example, we don't care about lag on s1-master eqiad, because it shouldn't replicate (at the moment). A production host, such as an s1 mediawiki replica could page (or not) but certainly it would be a critical metric, while for something like cloud, a passive misc host or a backup source, those hosts lag all the time (within reason) and we don't care much. The current status on icinga is far from perfect, but at least it catches some of these different behaviours- an alert on all hosts will create lot of spam.

One solution I can think of for this is write/configure the alerts (which basically boils down to yaml) directly from puppet. We already do this for prometheus::blackbox::check::http since alerts can vary by team. It will be for sure a good occasion to clean up / simplify some of the logic (hopefully!). Similarly we should be already able to restrict the alert logic (e.g. severity) based on metric labels (e.g. like @Ladsgroup did in https://gerrit.wikimedia.org/r/c/operations/alerts/+/835117)

This seems to be a proper solution, I've added an example of what critical could look like

Most important issue: prometheus lag only works when replication is running, as it is based on Seconds_behind_master parameter from SHOW REPLICATION STATUS, instead of the accurate pt-heartbeat method used on icinga and mediawiki;. If mariadb server crashes, or it is restarted and its replication not up to date, or replication is not correctly setup, or it is just stopped for any reason and we don't forget to enable it, it will show <no data> on the graphs. This is a known defect documented on T141968. The reason why this hasn't been yet implemented is because it may require reconfiguration of the prometheus-mysqld-exporter, and most likely either maintaining a fork of its go code, or implement it as a separate exporter to follow mediawiki and WMF conventions (e.g. per-datacenter tracking). This can be seen on the graphs that when mysql replication is stopped, there are blank areas in the graphs: https://grafana.wikimedia.org/goto/7m9I9yZ4z?orgId=1 where in reality (and from the point of view of icinga) there should be an inclined step of 1:1 seconds. Totally possible (and desired) to be on prometheus, but not yet a reality :-(.

+1 on checking heartbeat, I'll comment on the related task on what we could do on the mysqld-exporter side nowadays.

Upon comparing our pt-heartbeat with the 2.2.9 upstream branch, it seems that most of the added logic is designed to support datacenter and sections. We can still access those info through labels in the metrics. Discarding our fork of pt-heartbeat would also allow us to bump to a more recent version in production (current our $VERSION = '3.5.5';). If we need to add more information to that scraping (here is how mysqld-exporter does it), it seems doable to maintain a trivial fork with that logic.

Reliability: the check_replication.pl script has been iterated over years to have a lot of failbacks and bug fixes- e.g. if pt-heartbeat checking fails, it tries to use Seconds_behind_master automatically, it warns rather than send a critical, it snipes queries from failures in STATEMENT based replication, etc and other subtleties based on bugs found over the year. This doesn't mean the current script has to be continued to be used, but Prometheus exporter has been -many times in the past- quite unreliable (e.g. failing to be restarted during extended downtime by mysqld, and other issues) that may not be too problematic if general metrics have been lost for hours and days, but it is for an important aspect of alert gathering.

Features- the old check replication script has into account the replication status and has a relatively complex behaviour depending if replication is running or not, this is not captured in the current metrics monitoring.

Thank you for pointing this out, the current logic is certainly something to be taken into account when porting alerts over

Here is a breakdown of the different aspects of those questions:

Reliability: the check_replication.pl script has been iterated over years to have a lot of failbacks and bug fixes- e.g. if pt-heartbeat checking fails, it tries to use Seconds_behind_master automatically, it warns rather than send a critical,

This seems doable with the heartbeat collector on mysqld-exporter as it does not seem to prevent from collecting Seconds_behind_master while scraping heartbeat data. There should be a way to compose with several different metrics different nuances of a synthetic indicator that evaluates replication health.

it snipes queries from failures in STATEMENT based replication, etc and other subtleties based on bugs found over the year.

I have not been able to read about this in the script, but I think this can be an opportunity for us to reimplement that logic for prometheus in a homegrown exporter

This doesn't mean the current script has to be continued to be used, but Prometheus exporter has been -many times in the past- quite unreliable (e.g. failing to be restarted during extended downtime by mysqld, and other issues) that may not be too problematic if general metrics have been lost for hours and days, but it is for an important aspect of alert gathering.

We have a real need for reliability indeed, we could spin up the two approach in parallel at first, to ensure we're able to reach the same level of granularity and analytical depth. prometheus exporter has been used for quite some time now, I think most caveat are either fixed or fixable.

Features- the old check replication script has into account the replication status and has a relatively complex behaviour depending if replication is running or not, this is not captured in the current metrics monitoring.

This could also be handled by a homegrown prometheus exporter dedicated to our specifics

None of this are blockers that prevent the work- but they are missed features that create a regression if the proposed method is not iterated further (which will require some work, and that is why it has not been done earlier). As someone that receives also the effects of pages, this is my €0.02 to making sure alerting is as reliable or more then the current method.

+1 on making sure we are improving the situation with the new alerting, thank you again for your feedback

HTH!

I hope this contributes to the general conversation

• ABran-WMF mentioned this in T350943: [toolsdb] no alert if replication stops because of IO error.Nov 10 2023, 2:17 PM

We should also add a new check to monitor for depooled replicas

• ABran-WMF mentioned this in T367277: MariaDB monitoring transition out of icinga.Jun 12 2024, 10:44 AM

• ABran-WMF updated the task description. (Show Details)

• ABran-WMF updated the task description. (Show Details)Jun 12 2024, 10:51 AM

Aklapper merged a task: T367277: MariaDB monitoring transition out of icinga.Jun 12 2024, 10:57 AM

• ABran-WMF changed the status of subtask T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding from Open to In Progress.Jun 19 2024, 9:12 AM

• jcrespo mentioned this in T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.Jun 19 2024, 8:41 PM

• ABran-WMF closed subtask T367281: Migrate mysql icinga alerts to alert manager - disk pressure as Resolved.Mon, Jul 8, 1:34 PM

• ABran-WMF mentioned this in T369715: Gather all mariadb host under the same prometheus label.Wed, Jul 10, 1:22 PM

• ABran-WMF closed subtask T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding as Resolved.Wed, Jul 10, 1:38 PM

Adding a note to mention that currently Icinga alerts related to clouddb* hosts are getting tagged with team=wmcs when they are forwarded from icinga to alertmanager. I tried to figure out where that tagging happens but I haven't found it. We should aim to maintain that tagging when the alerts are migrated to alertmanager.

An example alert can be seen here.

In T315866#9984611, @fnegri wrote:

Adding a note to mention that currently Icinga alerts related to clouddb* hosts are getting tagged with team=wmcs when they are forwarded from icinga to alertmanager. I tried to figure out where that tagging happens but I haven't found it. We should aim to maintain that tagging when the alerts are migrated to alertmanager.

That's a great point, thank you @fnegri ! For reference the configuration driving icinga alerts to team mapping is hieradata/common/profile/prometheus/icinga_exporter.yaml in puppet. For alertmanager we should be able to base the selection on either the host or mysql metric labels for correct routing

dcaro subscribed.Tue, Jul 16, 2:56 PM

• ABran-WMF closed subtask T367280: Migrate mysql icinga alerts to alert manager - memory pressure as Declined.Wed, Jul 17, 2:12 PM

• ABran-WMF changed the status of subtask T367279: Migrate mysql icinga alerts to alert manager - seconds_behind_master + threads (replication/io) from Open to In Progress.Wed, Jul 17, 3:34 PM

Migrate mysql icinga alerts to alert managerOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Migrate mysql icinga alerts to alert manager
Open, MediumPublic
Actions

Related Objects
Search...