Page MenuHomePhabricator

Improve database application performance monitoring visibility
Closed, ResolvedPublic

Description

I got sometimes from platform and performance members (specially @Legoktm and @aaron) that they need to know when servers are lagging as in some cases it is not an infrastructure issue, but a new code causing too much load, concurrency issues or other logical-level reason. I suppose it is specially interesting for them to make multi-datacenter possible.

This ticket, with yet not clear actionables, is a way to start a conversation and coordinate with them to produce interesting alerts/metrics for them; however, we cannot send a page to all ops every time the application layer has a temporary hiccup. We can use existing technology (like Prometheus), introduce new ones, or change our model of monitoring/metrics/alerting/observability. Specially important is to separate "things are down", from "things are up but degraded" to "things are down".

Too many metrics/alerts will make people ignore them rather than focus on the important stuff. Also we have to make clear false positives (for example, when server goes down it is normal to create lag, but it is also depooled). This has been partially done with the mediwiki-based monitoring on graphite.

Event Timeline

Aklapper renamed this task from Improve database aplication performance monitoring visibility to Improve database application performance monitoring visibility.Oct 12 2017, 10:26 AM
jcrespo assigned this task to aaron.

Some actionables were done- graphite now has lag visible on grafana. And of course, we have MySQL prometheus dashboards. Logstash monitoring also improved, with increased visibility.

I think we can resolve this and continue working on database monitoring metrics T143896, unless someone else wants to bring up more concrete actionable.