Page MenuHomePhabricator

Migrate mysql icinga alerts to alert manager
Open, MediumPublic

Description

Modernizing our alerting infra.

Event Timeline

Change 825294 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/alerts@master] data-persistence: Add alert for replication lag

https://gerrit.wikimedia.org/r/825294

As mentioned on IRC, and per @Ladsgroup, I rised a few issues with the *current* replication prometheus metrics- migrating to prometheus-based alerting is very possible and desirable IMHO, I just rised the issue about the previous patch as it is at the moment (or more concretely, with the current prometheus-mysqld-exporter obtained metrics):

  • The alert being present, and its severity (warning, critical, paging) is currently a relatively complex set of options controled from hiera, its role/profile and the instance status (being a master or a replica, being a standarlone host, etc.). I believe the alert will be applied to all hosts, but the config should depend -as it does now- on the puppet config. For example, we don't care about lag on s1-master eqiad, because it shouldn't replicate (at the moment). A production host, such as an s1 mediawiki replica could page (or not) but certainly it would be a critical metric, while for something like cloud, a passive misc host or a backup source, those hosts lag all the time (within reason) and we don't care much. The current status on icinga is far from perfect, but at least it catches some of these different behaviours- an alert on all hosts will create lot of spam.
  • Most important issue: prometheus lag only works when replication is running, as it is based on Seconds_behind_master parameter from SHOW REPLICATION STATUS, instead of the accurate pt-heartbeat method used on icinga and mediawiki;. If mariadb server crashes, or it is restarted and its replication not up to date, or replication is not correctly setup, or it is just stopped for any reason and we don't forget to enable it, it will show <no data> on the graphs. This is a known defect documented on T141968. The reason why this hasn't been yet implemented is because it may require reconfiguration of the prometheus-mysqld-exporter, and most likely either maintaining a fork of its go code, or implement it as a separate exporter to follow mediawiki and WMF conventions (e.g. per-datacenter tracking). This can be seen on the graphs that when mysql replication is stopped, there are blank areas in the graphs: https://grafana.wikimedia.org/goto/7m9I9yZ4z?orgId=1 where in reality (and from the point of view of icinga) there should be an inclined step of 1:1 seconds. Totally possible (and desired) to be on prometheus, but not yet a reality :-(.
  • Reliability: the check_replication.pl script has been iterated over years to have a lot of failbacks and bug fixes- e.g. if pt-heartbeat checking fails, it tries to use Seconds_behind_master automatically, it warns rather than send a critical, it snipes queries from failures in STATEMENT based replication, etc and other subtleties based on bugs found over the year. This doesn't mean the current script has to be continued to be used, but Prometheus exporter has been -many times in the past- quite unreliable (e.g. failing to be restarted during extended downtime by mysqld, and other issues) that may not be too problematic if general metrics have been lost for hours and days, but it is for an important aspect of alert gathering.
  • Features- the old check replication script has into account the replication status and has a relatively complex behaviour depending if replication is running or not, this is not captured in the current metrics monitoring.

None of this are blockers that prevent the work- but they are missed features that create a regression if the proposed method is not iterated further (which will require some work, and that is why it has not been done earlier). As someone that receives also the effects of pages, this is my €0.02 to making sure alerting is as reliable or more then the current method.

Change 825294 merged by jenkins-bot:

[operations/alerts@master] data-persistence: Add alert for replication lag

https://gerrit.wikimedia.org/r/825294

Thank you @jcrespo for the extensive write up and the insights! I'll comment inline below

As mentioned on IRC, and per @Ladsgroup, I rised a few issues with the *current* replication prometheus metrics- migrating to prometheus-based alerting is very possible and desirable IMHO, I just rised the issue about the previous patch as it is at the moment (or more concretely, with the current prometheus-mysqld-exporter obtained metrics):

  • The alert being present, and its severity (warning, critical, paging) is currently a relatively complex set of options controled from hiera, its role/profile and the instance status (being a master or a replica, being a standarlone host, etc.). I believe the alert will be applied to all hosts, but the config should depend -as it does now- on the puppet config. For example, we don't care about lag on s1-master eqiad, because it shouldn't replicate (at the moment). A production host, such as an s1 mediawiki replica could page (or not) but certainly it would be a critical metric, while for something like cloud, a passive misc host or a backup source, those hosts lag all the time (within reason) and we don't care much. The current status on icinga is far from perfect, but at least it catches some of these different behaviours- an alert on all hosts will create lot of spam.

One solution I can think of for this is write/configure the alerts (which basically boils down to yaml) directly from puppet. We already do this for prometheus::blackbox::check::http since alerts can vary by team. It will be for sure a good occasion to clean up / simplify some of the logic (hopefully!). Similarly we should be already able to restrict the alert logic (e.g. severity) based on metric labels (e.g. like @Ladsgroup did in https://gerrit.wikimedia.org/r/c/operations/alerts/+/835117)

  • Most important issue: prometheus lag only works when replication is running, as it is based on Seconds_behind_master parameter from SHOW REPLICATION STATUS, instead of the accurate pt-heartbeat method used on icinga and mediawiki;. If mariadb server crashes, or it is restarted and its replication not up to date, or replication is not correctly setup, or it is just stopped for any reason and we don't forget to enable it, it will show <no data> on the graphs. This is a known defect documented on T141968. The reason why this hasn't been yet implemented is because it may require reconfiguration of the prometheus-mysqld-exporter, and most likely either maintaining a fork of its go code, or implement it as a separate exporter to follow mediawiki and WMF conventions (e.g. per-datacenter tracking). This can be seen on the graphs that when mysql replication is stopped, there are blank areas in the graphs: https://grafana.wikimedia.org/goto/7m9I9yZ4z?orgId=1 where in reality (and from the point of view of icinga) there should be an inclined step of 1:1 seconds. Totally possible (and desired) to be on prometheus, but not yet a reality :-(.

+1 on checking heartbeat, I'll comment on the related task on what we could do on the mysqld-exporter side nowadays.

  • Reliability: the check_replication.pl script has been iterated over years to have a lot of failbacks and bug fixes- e.g. if pt-heartbeat checking fails, it tries to use Seconds_behind_master automatically, it warns rather than send a critical, it snipes queries from failures in STATEMENT based replication, etc and other subtleties based on bugs found over the year. This doesn't mean the current script has to be continued to be used, but Prometheus exporter has been -many times in the past- quite unreliable (e.g. failing to be restarted during extended downtime by mysqld, and other issues) that may not be too problematic if general metrics have been lost for hours and days, but it is for an important aspect of alert gathering.
  • Features- the old check replication script has into account the replication status and has a relatively complex behaviour depending if replication is running or not, this is not captured in the current metrics monitoring.

Thank you for pointing this out, the current logic is certainly something to be taken into account when porting alerts over

None of this are blockers that prevent the work- but they are missed features that create a regression if the proposed method is not iterated further (which will require some work, and that is why it has not been done earlier). As someone that receives also the effects of pages, this is my €0.02 to making sure alerting is as reliable or more then the current method.

+1 on making sure we are improving the situation with the new alerting, thank you again for your feedback

HTH!

Change 963980 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/alerts@master] MySQL: adding rule set to iterate over

https://gerrit.wikimedia.org/r/963980

As mentioned on IRC, and per @Ladsgroup, I rised a few issues with the *current* replication prometheus metrics- migrating to prometheus-based alerting is very possible and desirable IMHO, I just rised the issue about the previous patch as it is at the moment (or more concretely, with the current prometheus-mysqld-exporter obtained metrics):

  • The alert being present, and its severity (warning, critical, paging) is currently a relatively complex set of options controled from hiera, its role/profile and the instance status (being a master or a replica, being a standarlone host, etc.). I believe the alert will be applied to all hosts, but the config should depend -as it does now- on the puppet config. For example, we don't care about lag on s1-master eqiad, because it shouldn't replicate (at the moment). A production host, such as an s1 mediawiki replica could page (or not) but certainly it would be a critical metric, while for something like cloud, a passive misc host or a backup source, those hosts lag all the time (within reason) and we don't care much. The current status on icinga is far from perfect, but at least it catches some of these different behaviours- an alert on all hosts will create lot of spam.

One solution I can think of for this is write/configure the alerts (which basically boils down to yaml) directly from puppet. We already do this for prometheus::blackbox::check::http since alerts can vary by team. It will be for sure a good occasion to clean up / simplify some of the logic (hopefully!). Similarly we should be already able to restrict the alert logic (e.g. severity) based on metric labels (e.g. like @Ladsgroup did in https://gerrit.wikimedia.org/r/c/operations/alerts/+/835117)

This seems to be a proper solution, I've added an example of what critical could look like

  • Most important issue: prometheus lag only works when replication is running, as it is based on Seconds_behind_master parameter from SHOW REPLICATION STATUS, instead of the accurate pt-heartbeat method used on icinga and mediawiki;. If mariadb server crashes, or it is restarted and its replication not up to date, or replication is not correctly setup, or it is just stopped for any reason and we don't forget to enable it, it will show <no data> on the graphs. This is a known defect documented on T141968. The reason why this hasn't been yet implemented is because it may require reconfiguration of the prometheus-mysqld-exporter, and most likely either maintaining a fork of its go code, or implement it as a separate exporter to follow mediawiki and WMF conventions (e.g. per-datacenter tracking). This can be seen on the graphs that when mysql replication is stopped, there are blank areas in the graphs: https://grafana.wikimedia.org/goto/7m9I9yZ4z?orgId=1 where in reality (and from the point of view of icinga) there should be an inclined step of 1:1 seconds. Totally possible (and desired) to be on prometheus, but not yet a reality :-(.

+1 on checking heartbeat, I'll comment on the related task on what we could do on the mysqld-exporter side nowadays.

Upon comparing our pt-heartbeat with the 2.2.9 upstream branch, it seems that most of the added logic is designed to support datacenter and sections. We can still access those info through labels in the metrics. Discarding our fork of pt-heartbeat would also allow us to bump to a more recent version in production (current our $VERSION = '3.5.5';). If we need to add more information to that scraping (here is how mysqld-exporter does it), it seems doable to maintain a trivial fork with that logic.

  • Reliability: the check_replication.pl script has been iterated over years to have a lot of failbacks and bug fixes- e.g. if pt-heartbeat checking fails, it tries to use Seconds_behind_master automatically, it warns rather than send a critical, it snipes queries from failures in STATEMENT based replication, etc and other subtleties based on bugs found over the year. This doesn't mean the current script has to be continued to be used, but Prometheus exporter has been -many times in the past- quite unreliable (e.g. failing to be restarted during extended downtime by mysqld, and other issues) that may not be too problematic if general metrics have been lost for hours and days, but it is for an important aspect of alert gathering.
  • Features- the old check replication script has into account the replication status and has a relatively complex behaviour depending if replication is running or not, this is not captured in the current metrics monitoring.

Thank you for pointing this out, the current logic is certainly something to be taken into account when porting alerts over

Here is a breakdown of the different aspects of those questions:

  • Reliability: the check_replication.pl script has been iterated over years to have a lot of failbacks and bug fixes- e.g. if pt-heartbeat checking fails, it tries to use Seconds_behind_master automatically, it warns rather than send a critical,

This seems doable with the heartbeat collector on mysqld-exporter as it does not seem to prevent from collecting Seconds_behind_master while scraping heartbeat data. There should be a way to compose with several different metrics different nuances of a synthetic indicator that evaluates replication health.

it snipes queries from failures in STATEMENT based replication, etc and other subtleties based on bugs found over the year.

I have not been able to read about this in the script, but I think this can be an opportunity for us to reimplement that logic for prometheus in a homegrown exporter

This doesn't mean the current script has to be continued to be used, but Prometheus exporter has been -many times in the past- quite unreliable (e.g. failing to be restarted during extended downtime by mysqld, and other issues) that may not be too problematic if general metrics have been lost for hours and days, but it is for an important aspect of alert gathering.

We have a real need for reliability indeed, we could spin up the two approach in parallel at first, to ensure we're able to reach the same level of granularity and analytical depth. prometheus exporter has been used for quite some time now, I think most caveat are either fixed or fixable.

  • Features- the old check replication script has into account the replication status and has a relatively complex behaviour depending if replication is running or not, this is not captured in the current metrics monitoring.

This could also be handled by a homegrown prometheus exporter dedicated to our specifics

None of this are blockers that prevent the work- but they are missed features that create a regression if the proposed method is not iterated further (which will require some work, and that is why it has not been done earlier). As someone that receives also the effects of pages, this is my €0.02 to making sure alerting is as reliable or more then the current method.

+1 on making sure we are improving the situation with the new alerting, thank you again for your feedback

HTH!

I hope this contributes to the general conversation

ABran-WMF triaged this task as Medium priority.

We should also add a new check to monitor for depooled replicas