I got sometimes from platform and performance members (specially @Legoktm and @aaron) that they need to know when servers are lagging as in some cases it is not an infrastructure issue, but a new code causing too much load, concurrency issues or other logical-level reason. I suppose it is specially interesting for them to make multi-datacenter possible.
This ticket, with yet not clear actionables, is a way to start a conversation and coordinate with them to produce interesting alerts/metrics for them; however, we cannot send a page to all ops every time the application layer has a temporary hiccup. We can use existing technology (like Prometheus), introduce new ones, or change our model of monitoring/metrics/alerting/observability. Specially important is to separate "things are down", from "things are up but degraded" to "things are down".
Too many metrics/alerts will make people ignore them rather than focus on the important stuff. Also we have to make clear false positives (for example, when server goes down it is normal to create lag, but it is also depooled). This has been partially done with the mediwiki-based monitoring on graphite.