The stats servers are frequently becoming unresponsive (see this Slack thread for a summary of all the recent outages ). Each time a host gets unresponsive, the disruption affects multiple developers' workflows.
As of now, we are addressing this reactively, with developers pinging SREs, who then login to a very unresponsive box and manually seek out and kill processes.
Creating this ticket to add alerts for high load, the SRE team will at least be aware of the situation and perhaps be able to mitigate more quickly.