Page MenuHomePhabricator

Create dashboards for stat servers
Closed, ResolvedPublic

Description

Per parent ticket, there's been a lot of instability on the stat servers. A dashboard will provide a lot of useful information that can help us stabilize the servers.

Creating this ticket to build a dashboard that gives us more visibility into the stat servers' existing metrics.

Event Timeline

I've created the dashboard, but it's very much a WIP. Feel free to add suggestions here!

bking triaged this task as Medium priority.Aug 27 2024, 10:11 PM

I've made some adjustments to the dashboard; this link displays the cgroup metrics during the outage (approx 1000 - 1300 UTC). Note that while the runaway cgroups arinaigum, trokhymovych, rmaung are correctly identified, the much higher spike attributed to ifrahkh doesn't actually represent a problem.

All this to say that we don't have full confidence we're using the Jupyter notebook cgroup metrics correctly at the moment. Regardless, we believe that the combination of system load metrics and the (admittedly imperfect) notebook metrics should significantly help visibility and troubleshooting. As such, I'm closing out this ticket. Work on stabilizing the hosts continues in T372416 ...