Page MenuHomePhabricator

Reduce false positives on database pages
Closed, ResolvedPublic

Description

On September 2017, 50% of ops-related pages came from databases; the only realistic outage didn't page. We have to do a deep review of our monitoring best practices to understand how we monitor our databases, and if we should change our model, not only on technology, but on the human model behind it.

For databases, some actionables we can do are:

  • Application-bad states (code problems) should be better exposed to those people that can do something about it, not to ops T177778. E.g. if a server starts lagging and it is not a MySQL issue (replication broken, hardware problem, strange query patterns) it makes no sense to send that information to operators, but to code releasers and the many code owners (for trivial issues causing lag, such a a maintenance)
  • Single server issues should not page- Mediawiki should be reliable enough so that if a single server starts lagging or disapperas, it should use the rest of the servers to continue operation. This is not 100% right right now -servers that are lagged or unaccesible are kept being tried for every query (or a percentage of them), and in some cases, requests start timing out (like when there was network maintenance). We can understand that there could be some small SPOF for the time being (like on the master), but we shouldn't have a SPOF for each database server- they should be really depooled. Independently of that, single servers should never page. Can we use prometheus for those agregations? Should we create scripts that are topology-aware? We can keep paging for masters, due to its importance, but we definitely should do something about the large number of replicas we have.
  • Move paging checks to service checks. On a distributed enthronement, this is not trivial (due to network splits, etc.), but even an imperfect solution will be better than the current model. Single checks based on mediawiki checks of every role, of every shard would be enough to check that the databases are working correctly, even if some of it has issues. Checks like number of "mysql" processes can be kept, but they are normally useless- when they are down, it is normally a new host/restart, while crashing hosts are usually restarted automatically, so those are not caught.
  • Once individual checks are made not paging, we can have more subtle checks like the ones proposed at T172489#3501437
  • We have to review how each individual person reacts to pages (aka human factor). If we have a perfect alert coverage but no one reacts to it, it is useless- we should know how people react to them? why they get ignored (not one's ownership, not having enought knowledge of other services, not having enough documentation. With the growth on the ops team maybe it doesn't make sense to make all ops paging for all alerts, maybe it makes sense to have a large enough pool of people to make sure at least someone can react. Could we have some kind of escalation? E.g. owners or random subgroup > ops on the right timezone > all ops; could we have some kind of rotation? (this is a general idea, and not only for databases). Should we have some kind of "post-mortem" of monthly alerts- not to blame anyone- but to identify weaknesses (e.g. "there is instability on Varnish 5.0, we shoud pour more resources there", "there are too many false positives, can we do something about it?").
  • We should review how actuable *in practice* alerts are. I do not care if the datacenter is on fire if I cannot do anything about it, no mater how large is the issue. I do not think many of our checks are actuable, and in some cases, they can lead to worse issues if acted literally. WB cache being in WT mode is bad for replication performance, but a BBU broken and forcing WB is even worse.

Event Timeline

Single server issues should not page- Mediawiki should be reliable enough so that if a single server starts lagging or disapperas, it should use the rest of the servers to continue operation. This is not 100% right right now -servers that are lagged or unaccesible are kept being tried for every query (or a percentage of them), and in some cases, requests start timing out (like when there was network maintenance). We can understand that there could be some small SPOF for the time being (like on the master), but we shouldn't have a SPOF for each database server- they should be really depooled. Independently of that, single servers should never page.

Fully agreed. I don't think this is contentious. :) We avoid that pretty much everywhere else in our infrastructure (i.e. single server issues get critical alerts, but do not page), and we should implement the same for DB paging as much as possible. Some of this may require changes inside MediaWiki, so we should discuss with the MediaWiki platform team.

Can we use prometheus for those agregations? Should we create scripts that are topology-aware? We can keep paging for masters, due to its importance, but we definitely should do something about the large number of replicas we have.

I think the only reason this was originally setup the way it (still) is today, was insufficient time to implement something better with a much smaller team, and it seemed better than nothing at all at the time. But I don't think there's a reason we can't fix/prioritize this now, especially for something that generates so much noise for many people outside work hours.

And indeed, newer technologies like Prometheus may help here as well. I'm pretty sure it can be used to alert on aggregate conditions.

Move paging checks to service checks. On a distributed enthronement, this is not trivial (due to network splits, etc.), but even an imperfect solution will be better than the current model. Single checks based on mediawiki checks of every role, of every shard would be enough to check that the databases are working correctly, even if some of it has issues. Checks like number of "mysql" processes can be kept, but they are normally useless- when they are down, it is normally a new host/restart, while crashing hosts are usually restarted automatically, so those are not caught.

Agreed also. I think this should probably be our first step, actually. Once we have this that by itself removes the need for a lot of the current single-server checks, and those can then be made non-paging or removed entirely.

Do you have suggestion(s) on what functionality (or URL) in MediaWiki to do a monitoring check against, which would uncover most or all DB issues while having few false positives?

Once individual checks are made not paging, we can have more subtle checks like the ones proposed at T172489#3501437

Why not? :)

Application-bad states (code problems) should be better exposed to those people that can do something about it, not to ops T177778. E.g. if a server starts lagging and it is not a MySQL issue (replication broken, hardware problem, strange query patterns) it makes no sense to send that information to operators, but to code releasers and the many code owners (for trivial issues causing lag, such a a maintenance)

Yes and no... that should be an interaction between both teams I think, and ideally be received by both teams, so both parties are aware. Some things can be mitigated short-term by Ops/DBAs and addressed more long-term in code/architecture. (One side ignoring such problems would be a different matter which should be fixed at the management level, not in alerting strategy. :)

And of course it's usually tricky to make a clean and failure-proof separation of the two cases - we're dealing with unplanned error conditions in the first place.

We have to review how each individual person reacts to pages (aka human factor). If we have a perfect alert coverage but no one reacts to it, it is useless- we should know how people react to them? why they get ignored (not one's ownership, not having enought knowledge of other services, not having enough documentation. With the growth on the ops team maybe it doesn't make sense to make all ops paging for all alerts, maybe it makes sense to have a large enough pool of people to make sure at least someone can react. Could we have some kind of escalation? E.g. owners or random subgroup > ops on the right timezone > all ops; could we have some kind of rotation? (this is a general idea, and not only for databases). Should we have some kind of "post-mortem" of monthly alerts- not to blame anyone- but to identify weaknesses (e.g. "there is instability on Varnish 5.0, we shoud pour more resources there", "there are too many false positives, can we do something about it?").

Yes, I think this in particular will need discussion with the entire team. Our original paging strategy may need to be revisited now the team has grown to be much bigger, people become more specialized and some sub teams have formed or split off. This will be tricky.

We should review how actuable *in practice* alerts are. I do not care if the datacenter is on fire if I cannot do anything about it, no mater how large is the issue. I do not think many of our checks are actuable, and in some cases, they can lead to worse issues if acted literally. WB cache being in WT mode is bad for replication performance, but a BBU broken and forcing WB is even worse.

I'm not entirely sure what your point is here, or at least with this example. Surely knowing that a RAID controller is in another mode than intended is actionable and useful? There may be individual cases where that /is/ intended, but then we should make sure that's not an alert but an explicitly acknowledged/supported condition. :)

Do you have suggestion(s) on what functionality (or URL) in MediaWiki to do a monitoring check against, which would uncover most or all DB issues while having few false positives?

Using a mediawiki function to check the database lag, and another to get the last 50 recentchanges will hit most of the database issues, on each of the replica sets, each of the roles. I think there are exposed api calls for both. But first mediawiki has to be "fixed".

Change 384183 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] base/icinga: if on labs, don't page for mysql procs

https://gerrit.wikimedia.org/r/384183

The change i uploaded above intents to disable paging for the specific check "mysql procs running" if a host is in labs/labtest. It's using existing regexes in Hiera that should match all labs hosts (the same that also sets the cluster). This is just for mysql process and just for labs, but i think it's a start. It's a reaction to the recent SMS we got when mysql/maria role was used on labtest hosts and then the process was stopped. I think it should be easy to agree that something called "labs" or "test" means it can't be important enough to send actual SMS, right? A more specific ticket for that is T178008.

Application-bad states (code problems) should be better exposed to those people that can do something about it, not to ops T177778. E.g. if a server starts lagging and it is not a MySQL issue (replication broken, hardware problem, strange query patterns) it makes no sense to send that information to operators, but to code releasers and the many code owners (for trivial issues causing lag, such a a maintenance)

There are certain cases where Ops would have to be paged, as it is good to let code releasers know that "they" broke something, but they might not be able to fix it.
I do agree that they need to receive the page, but Ops should probably also receive it too.
There are multiple exceptions where this doesn't apply: ie: new code released generates lots and lots of new queries and traffic increase in our DBs, there is not much Ops can do about it, they just need to revert the code.

Single server issues should not page- Mediawiki should be reliable enough so that if a single server starts lagging or disapperas, it should use the rest of the servers to continue operation. This is not 100% right right now -servers that are lagged or unaccesible are kept being tried for every query (or a percentage of them), and in some cases, requests start timing out (like when there was network maintenance). We can understand that there could be some small SPOF for the time being (like on the master), but we shouldn't have a SPOF for each database server- they should be really depooled. Independently of that, single servers should never page. Can we use prometheus for those agregations? Should we create scripts that are topology-aware? We can keep paging for masters, due to its importance, but we definitely should do something about the large number of replicas we have.

Agreed. We should page for masters and we should not page for a single replica. Instead, we should page based on a % of the replicas.
Ie: if a replica goes down, I don't care. And I don't care whether its network went down, MySQL, a disk, the storage. The service is down. OK. The code (or whatever we use in-between) should be smart enough to depool it, and I will take care of it when I wake up or when I have time.
If 25% (or whatever % we agreed on) of the replicas are out, maybe we need to start caring and we need to get a page before it is too late.

This could be even smarter and should probably check whether we have recently done a mediawiki deploy, ie: replication broken on all the hosts and there has been a deployed in the last 5 minutes (or whatever) page also the code releasers.

We have to review how each individual person reacts to pages (aka human factor). If we have a perfect alert coverage but no one reacts to it, it is useless- we should know how people react to them? why they get ignored (not one's ownership, not having enought knowledge of other services, not having enough documentation. With the growth on the ops team maybe it doesn't make sense to make all ops paging for all alerts, maybe it makes sense to have a large enough pool of people to make sure at least someone can react. Could we have some kind of escalation? E.g. owners or random subgroup > ops on the right timezone > all ops; could we have some kind of rotation? (this is a general idea, and not only for databases). Should we have some kind of "post-mortem" of monthly alerts- not to blame anyone- but to identify weaknesses (e.g. "there is instability on Varnish 5.0, we shoud pour more resources there", "there are too many false positives, can we do something about it?").

This is hard to address. As if you are not quite familiar with the service you might be worried that you can make things worse.
Escalating it is probably safer than touching it, but again, that depends on each individual and his/her experience and confidence. As the infrastructure grows, this will be harder and harder.

Ideally, we should have auto-remedies in place for most of the basic things. Not everything can be fixed with an auto-remedy, but it should only page if it is _really_ important. Otherwise, it would either fix itself or some mechanism should be in place to make it irrelevant till the owner comes online.
There are things that doesn't apply to these things, ie: db primary masters.
But ideally everything should be autonomous that no one needs to be disrupted (or it should be pretty rare) out of core hours.

We should review how actuable *in practice* alerts are. I do not care if the datacenter is on fire if I cannot do anything about it, no mater how large is the issue. I do not think many of our checks are actuable, and in some cases, they can lead to worse issues if acted literally. WB cache being in WT mode is bad for replication performance, but a BBU broken and forcing WB is even worse.

Agreed, but that also depends on the situation, who is managing, it, if it is a weekend, if it is not, the time of the day, if you can call someone, if you can do something about it.
This comes again to the fact that we should have enough power/systems not to care about a single host :-)

Change 384183 merged by Rush:
[operations/puppet@production] base/icinga: if mysql is in labtest never send pages

https://gerrit.wikimedia.org/r/384183

Change 538837 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: make core_test hosts not page on replication/process issues

https://gerrit.wikimedia.org/r/538837

Change 538837 merged by Jcrespo:
[operations/puppet@production] mariadb: make core_test hosts not page on replication/process issues

https://gerrit.wikimedia.org/r/538837

Change 539064 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Disable paging for mariadb disk space on core test hosts

https://gerrit.wikimedia.org/r/539064

Change 539064 merged by Jcrespo:
[operations/puppet@production] mariadb: Disable paging for mariadb disk space on core test hosts

https://gerrit.wikimedia.org/r/539064

jcrespo claimed this task.

Whit this https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/595153/ I think we have the current alerting system as we want. I am going to declare this resolved, not because no further improvements are possible, but because for the first time in years we have the coverage and the paging that we prefer.

More checks will be needed, and fix of existing ones, but based on the scope of the original writing, I think we are done.

Further improvements will be done on separate tickets, but there are no concrete ideas now.