On September 2017, 50% of ops-related pages came from databases; the only realistic outage didn't page. We have to do a deep review of our monitoring best practices to understand how we monitor our databases, and if we should change our model, not only on technology, but on the human model behind it.
For databases, some actionables we can do are:
- Application-bad states (code problems) should be better exposed to those people that can do something about it, not to ops T177778. E.g. if a server starts lagging and it is not a MySQL issue (replication broken, hardware problem, strange query patterns) it makes no sense to send that information to operators, but to code releasers and the many code owners (for trivial issues causing lag, such a a maintenance)
- Single server issues should not page- Mediawiki should be reliable enough so that if a single server starts lagging or disapperas, it should use the rest of the servers to continue operation. This is not 100% right right now -servers that are lagged or unaccesible are kept being tried for every query (or a percentage of them), and in some cases, requests start timing out (like when there was network maintenance). We can understand that there could be some small SPOF for the time being (like on the master), but we shouldn't have a SPOF for each database server- they should be really depooled. Independently of that, single servers should never page. Can we use prometheus for those agregations? Should we create scripts that are topology-aware? We can keep paging for masters, due to its importance, but we definitely should do something about the large number of replicas we have.
- Move paging checks to service checks. On a distributed enthronement, this is not trivial (due to network splits, etc.), but even an imperfect solution will be better than the current model. Single checks based on mediawiki checks of every role, of every shard would be enough to check that the databases are working correctly, even if some of it has issues. Checks like number of "mysql" processes can be kept, but they are normally useless- when they are down, it is normally a new host/restart, while crashing hosts are usually restarted automatically, so those are not caught.
- Once individual checks are made not paging, we can have more subtle checks like the ones proposed at T172489#3501437
- We have to review how each individual person reacts to pages (aka human factor). If we have a perfect alert coverage but no one reacts to it, it is useless- we should know how people react to them? why they get ignored (not one's ownership, not having enought knowledge of other services, not having enough documentation. With the growth on the ops team maybe it doesn't make sense to make all ops paging for all alerts, maybe it makes sense to have a large enough pool of people to make sure at least someone can react. Could we have some kind of escalation? E.g. owners or random subgroup > ops on the right timezone > all ops; could we have some kind of rotation? (this is a general idea, and not only for databases). Should we have some kind of "post-mortem" of monthly alerts- not to blame anyone- but to identify weaknesses (e.g. "there is instability on Varnish 5.0, we shoud pour more resources there", "there are too many false positives, can we do something about it?").
- We should review how actuable *in practice* alerts are. I do not care if the datacenter is on fire if I cannot do anything about it, no mater how large is the issue. I do not think many of our checks are actuable, and in some cases, they can lead to worse issues if acted literally. WB cache being in WT mode is bad for replication performance, but a BBU broken and forcing WB is even worse.