Page MenuHomePhabricator

Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding
Open, In Progress, MediumPublic

Description

  • Create the prometheus http exporter scaffolding
  • Implement the custom pt-heartbeat monitoring
  • Create the related alert rule(s)

Event Timeline

The alerts should be configurable by lag and by role from puppet- that means: I don't want alerts for backup sources that are < 4h, as I regularly stop those while taking the backups. E.g. core db hosts vs misc vs test hosts, etc.

will suggest a hierarchy and ask for validation @jcrespo @Marostegui @Ladsgroup → lets try to keep a good signal/noise ratio

Change #1046712 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: prometheus config tweak for db1125

https://gerrit.wikimedia.org/r/1046712

Change #1046712 merged by Arnaudb:

[operations/puppet@production] mariadb: prometheus config tweak for db1125

https://gerrit.wikimedia.org/r/1046712

ABran-WMF changed the task status from Open to In Progress.Wed, Jun 19, 9:12 AM
ABran-WMF triaged this task as Medium priority.

@Ladsgroup had a neat suggestion we just try the current exporter. I think it'll save a few customs (if not all) down the line: https://grafana.wikimedia.org/goto/-AH4B_8SR?orgId=1 here is pt-heartbeat as seen from the exporter point of view. I don't see a clear difference with the current icinga/perl implementation. We could maybe add the section label directly through the monitoring config script to keep the exporter's config as generic as possible.

We'll need to tweak its account's ACL: msg="Error from scraper" scraper=slave_status err="Error 1227: Access denied; you need (at least one of) the SUPER, SLAVE MONITOR privilege(s) for this operation" to validate that PS.

I don't see a clear difference with the current icinga/perl implementation.

In the past there was 2 additional fields for the section and datacenter- so there were 2 rows being written at the same time from both masters. I don't know if that changed for orchestator setup needs, I wasn't involved since its original setup..

so there were 2 rows being written at the same time from both masters.

if you check that screengrab of the query afaict, we're seeing the same info added to the metric but from the config standpoint. That's why I was a bit perplex!

I don't know if that changed for orchestator setup needs, I wasn't involved since its original setup..

I haven't seen anything related to this specifically, but I'll keep it in mind

I don't see a clear difference with the current icinga/perl implementation.

In the past there was 2 additional fields for the section and datacenter- so there were 2 rows being written at the same time from both masters. I don't know if that changed for orchestator setup needs, I wasn't involved since its original setup..

That hasn't changed - both masters write their own heartbeats and they get replicated.

If eqiad is primary:
eqiad hosts get eqiad's heartbeat
codfw hosts get eqiads and codfw heartbeats

If codfw is primary:
eqiad hosts get eqiad's and codfw heartbeat
codfw hosts get codfw heartbeats

I don't see a clear difference with the current icinga/perl implementation.

In the past there was 2 additional fields for the section and datacenter- so there were 2 rows being written at the same time from both masters. I don't know if that changed for orchestator setup needs, I wasn't involved since its original setup..

That hasn't changed - both masters write their own heartbeats and they get replicated.

If eqiad is primary:
eqiad hosts get eqiad's heartbeat
codfw hosts get eqiads and codfw heartbeats

If codfw is primary:
eqiad hosts get eqiad's and codfw heartbeat
codfw hosts get codfw heartbeats

@ABran-WMF If that is true, then the prometheus metric and the one used by the app would differ, for example, in the case of a network cut (or lag) between dcs (e.g. master down). That doesn't mean that the metric will not be useful (although it would be nice to see how it decides when it has several rows to calculate from), but may still require additionally a custom check for e.g. local dc lag vs mediawiki observed lag. It is under failure where those differences are important. :-D

To clarify: mediawiki always checks the lag compared to the primary dc.

I don't see a clear difference with the current icinga/perl implementation.

In the past there was 2 additional fields for the section and datacenter- so there were 2 rows being written at the same time from both masters. I don't know if that changed for orchestator setup needs, I wasn't involved since its original setup..

That hasn't changed - both masters write their own heartbeats and they get replicated.

If eqiad is primary:
eqiad hosts get eqiad's heartbeat
codfw hosts get eqiads and codfw heartbeats

If codfw is primary:
eqiad hosts get eqiad's and codfw heartbeat
codfw hosts get codfw heartbeats

Based on the above, I would add another edge case:

When round replication is setup, just before and after a dc switchover, both dcs will get the heartbeats from both.

These subtleties are non trivial, Arnaud, please help us document those (or change them, there is no good or bad option, just decisions) :-)

I don't see a clear difference with the current icinga/perl implementation.

In the past there was 2 additional fields for the section and datacenter- so there were 2 rows being written at the same time from both masters. I don't know if that changed for orchestator setup needs, I wasn't involved since its original setup..

That hasn't changed - both masters write their own heartbeats and they get replicated.

If eqiad is primary:
eqiad hosts get eqiad's heartbeat
codfw hosts get eqiads and codfw heartbeats

If codfw is primary:
eqiad hosts get eqiad's and codfw heartbeat
codfw hosts get codfw heartbeats

Based on the above, I would add another edge case:

When round replication is setup, just before and after a dc switchover, both dcs will get the heartbeats from both.

Correct, that will happen for a few days and after the DC is done, the non active heartbeat are cleaned up from the active dc

Correct me if I'm wrong but the heartbeat updates are coming from this script which is called by that service.
So, that would not change (or at least not in this iteration, nor that group of tasks 😄) at all and pt-heartbeat ←→ mediawiki relationship would stay 100% the same. What I'm aiming at here is the way we're alarming on those metrics, if you check mysqld-exporter's code, it does not update at all that ts as it's not his job. I think there was some confusion around my intentions on that comment, I hope I'm a bit clearer now

@ABran-WMF It very likely change it, because as you shared, the exporter does:

SELECT UNIX_TIMESTAMP(ts), UNIX_TIMESTAMP(%s), server_id from `heartbeat`.`heartbeat`

And then a series of logic to discard some:

https://github.com/prometheus/mysqld_exporter/blob/dd8afce2a46663a5112e53469f53ca56d50a63e2/collector/heartbeat.go#L116

Which at the very least discards the dc and shard columns.

While the current check does:

https://github.com/wikimedia/operations-puppet/blob/production/modules/icinga/files/check_mariadb.pl#L208

SELECT 1 as ok, greatest(0, TIMESTAMPDIFF(MICROSECOND, ts, UTC_TIMESTAMP(6)) - 500000) AS lag FROM heartbeat.heartbeat WHERE 1=1 AND datacenter = ? ORDER BY ts DESC LIMIT 1

Sure, you could later implement the logic with server_id which is similar, but already on collection you are getting something logically different.

For example, for codfw, you will get now 2 rows and won't know which is the right to use, while the perl check should get you only 1.

Meanwhile, Mediawiki does its own thing (!):

https://github.com/wikimedia/mediawiki/blob/61d41a0309e036b159e0a7938d373805fa5e80ef/includes/libs/rdbms/database/replication/MysqlReplicationReporter.php#L147

"SELECT TIMESTAMPDIFF(MICROSECOND,ts,UTC_TIMESTAMP(6)) AS us_ago " .
"FROM heartbeat.heartbeat WHERE server_id != @@server_id ORDER BY ts DESC LIMIT 1"

Again, I am not saying what you are doing is bad or not useful, I think it is. I am just saying that it is different, and will behave differently from the existing check (as it will register different rows than the ones originally checked). All of that without changing the pt-heartbeat-wikimedia, which I wouldn't have have any issue if that was needed.

I think this should be enabled- it will be useful to have this in prometheus (the same way it is useful to have seconds behind master). I don't think it will behave by its own the same as the existing checks.

My only and whole point is that this is a difficult task with many edge cases!

@jcrespo Thank you for the precision, I clearly see the point you were making! Indeed I was missing the metric aggregation part. I think the best angle will be then to enable this probe on the exporter and to also implement the query as we have it in check_mariadb.pl then. @Ladsgroup @Marostegui feel free to challenge this idea as well.

I wasn't worried about immediate alerts. I know those won't change for now.

It is also ok to simplify and try to embrace a more standard model, with some coordination at MediaWiki layer (with the help of Amir, I think at least that is a possibility- at the time how we were asked to implement it was rather complex, there was different people driving it and multi-dc was not a thing). I would start by rolling that to all hosts and then comparing, even during outages and simulated master failures, the differences in behavior over a long period of time/simulations. There is a lot of hidden knowledge on the nuances of the existing scripts and my worst possible outcome would be to lose those and not have into account the years of tweaking and tuning that hand been implemented there, making monitoring worse.

In any case, I believe we will still need some extra metrics like latency, event setup monitoring, and failover metrics that are not on the exporter. For example, right now if heartbeat fails, it uses seconds behind master as failover, and that is not always simple to do with just a formula.

I met with Abran on a quick call and we both shared thoughts and I think understood each other. I also encouraged him to talk to the rest of the DBA team for decisions that are open and future improvements.

I was asked to give my opinion on this:

  • First thing: I don't have a lot of context regarding mariadb alerting and its infrastructure and the planned changes. I tried my best to read documents and comments but I might be missing something super obvious. My apologies if I says something stupid.
  • Second: I'm aware of mediawiki replication lag measurement issues: T327852: MediaWiki replication lags are inflated with average of half a second I hope to fix it but so many things.
  • My main point: I want to take a step back and ask what is the purpose of measuring replication lag in here? If it's only for alerting (and paging), then it doesn't matter how we measure it. We are trying to measure the lag with a pretty high precision but we are picking a random arbitrary number as the threshold of paging (5s/10s) Because in practice even 1s above or below the actual replag doesn't make a real difference in alerting. What happens in almost all cases is that replication breaks and we get page for it eventually in span of half a minute after the event so trying to count for 36ms latency of cross-dc replag doesn't add much value IMHO.
    • We definitely should strive for more accuracy in graphs or in mediawiki where it's the basis of forcing bots to wait before next query and so on. But for alerting? Not really
    • So when it doesn't matter which way, my personal preferred way to pick the simplest one and here (for alerting) I think seconds behind master is the best one (and completely ignoring pt-heartbeat, as I said it has its own usecases but I'm not sure this is one). Unless there are cases that seconds behind master would give a wildly off value in which case, pt heartbeat is good too but using something that requires least amount of engineering effort.
  • Last thing is that we already alert based on seconds behind master by querying Prometheus in icinga: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034124/2/modules/profile/manifests/mariadb/replication_lag.pp This can be moved to alertmanager 1:1 with basically no change (a lot of our alerting is custom stuff querying mariadb but this doesn't seem to be one of them?). I think the biggest complexity is to make sure this alerting role is only applied to databases that are set via puppet (when a host has profile::mariadb::replication_lag added). I understand the need to improve things but I personally prefer doing one change at a time to avoid having too many parts specially for something that's as scary as alerting.

We are trying to measure the lag with a pretty high precision

None of this is something I worry about- in fact, right now we don't alert/page until a server has failed its replication check 10 times over several seconds. The largest concern is that prometheus model imposes some restrictions on how the alerting logic works- instead of being a custom check with branching decisions based on configuration and state, it now has to be -essentially- a float or a formula based on one or a few floats. This creates some logical issues I already discussed here: T315866#8194791 that are not precision-based, but accuracy-related (alerting when they shouldn't, and not alerting when they should): on secondary dcs, on primary servers, on failed heartbeat table deployments, on connection overloads, etc. This can be already seen by the fact that we get sometimes spammed with false alerts but not notified on prometheus or others, while icinga alerts have been tuned for years- however, the model has to change due to the tooling change, and that is why we still have to figure out the best path forward.

I understand the need to improve things but I personally prefer doing one change at a time to avoid having too many parts specially for something that's as scary as alerting.

I am not asking to improve things, I am actually asking to keep (more or less) the existing logic for alerting (or if it has to be modified, doing it in an informed, meditated way). A simple "we start using prometheus" without all the tweaks and edge cases that was implemented on icinga over the years will lead to regressions, excessive paging and unnoticed outages. The change is forced by the workflow change.

Last thing is that we already alert based on seconds behind master by querying Prometheus in icinga

That check was created for alerting of sustained, small lag (2 seconds of lag for 1 hour), and has not done a great job at it. It should be removed because it never worked for the initial gap that it was going to fix (T253120, T252952) - in part because of what I am mentioning now, in part because it wasn't follow up afterwards, in part because it was a hard job to do. This is literally the mistake I want to avoid.

thanks @Ladsgroup @jcrespo for those considerations. This speaks volumes to help defining alerting thresholds. I was unaware of T253120 and T252952 in that context. I find it relevant to first test a more vanilla approach then.

We have customs in place that don't need to go away until the whole icinga stack is migrated. We can discard the alerting rules we're 100% confident in and let the ones that would need refining in test for a longer period of time. What I'm meaning here is that we're in the proper context to assess the need for those tweaks to keep on existing. Given the importance of replication monitoring, I'll still implement the check in the exporter (it's already quite done in a rough prototype), but this wont be a priority in this task. This will be handled along the way with the next tasks. I'll take note of this in T367284 where I'll be continuing on implementing custom stuff anyways.

Instead, I'll first focus on tweaking mysqld-exporter's configuration and add some alerting rules.

Change #1047938 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/alerts@master] mysql: replication lag monitoring threshold and severity change

https://gerrit.wikimedia.org/r/1047938

These are the kind of things I wanted to avoid on a naive approach:

  • Alerts based on seconds_behind_master (regression over current alerts, which use pt-heartbeat and seconds behind master only as a failover). I dare you to stop replication on a host (or have its replication stalled/failed) and get a lag alert, which is the big flaw of seconds_behind_master, because how unreliable it is.
  • Alerts based on script defaults, hardcoded, and which cannot be controlled on puppet per-role and differ from those setup on hiera (another regression)
  • The behavior is very different and not compared with current setup, as it would have been noticed that current replicas only alert after 10 failures had happened in a row (retries => 10,), to avoid false positives "spikes", which means it will not just warn after 30 seconds/300 seconds. This is important because sometimes mariadb will show a randome large integer here (not important why that happens, but it does, it has to do with GTID), and you don't want to wake up because of this.

Completely agreeing with you, we'll pay attention to avoid such regressions and approximations during the migration! The good news is that we can iterate and compose our alert thresholds as much as we need before deciding on this migration being "done".
The first patch I've sent:

tries to offer at least as much as sensitivity as we currently have. We might have to tune those thresholds a bit, depending on the noise they make or fail to. I've defined the suggested thresholds by querying our metrics of the last few weeks.

Change #1047983 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/alerts@master] mysql: pt-heartbeat alerting rules

https://gerrit.wikimedia.org/r/1047983

Change #1048006 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: enable pt-heartbeat monitoring through mysqld-exporter

https://gerrit.wikimedia.org/r/1048006