The current status quo is that when every individual frack server dies, a page is sent to every opsen. What usually happens is that a bunch of people show up on IRC and say "what is alnilam?" and when they eventually realize that it's a frack host, ignore it as they can't do much about it anyway.
This paging strategy made sense years ago, but nowadays the org has evolved in many ways: the TechOps team has grown to the point that most people don't even know much about frack and don't have access to it, fr-tech is a vertical and fr-tech-ops is now a team with some basic redundancy.
I think we should revisit our paging strategy for frack:
- Should we page for individual server failures rather than service failures?
- Should we page differently during periods of the year (e.g. end-of-year busy fundraising period)
- Who should we page? Are random opsens more valuable than e.g. fr-tech software engineers at this point?
My goal is to having something that makes sense and that causes people to consider pages (fr-tech or not) as serious actionable events and not train them that it's something they can ignore (i.e. alert fatigue).