Per this Slack thread, stat1011 went unresponsive today. As such, we need to continue improving the developer experience on stat hosts (as started in T373446). This includes:
- Tuning alert thresholds (we did not get an alert even after the host was unresponsive for some time). It might be a good idea to send alerts to our shared Slack channels as well.
- Performance optimizations and workload protections via cgroups (resource controls).
Update dashboards to include user-specific I/O stats Removed as AC, because it was memory, not I/O, that turned out to be the scare resource.