Followup from T404047: ssh login.toolforge.org was failing due to nfs on bastion being wedged, and there we no alerts raised for the issue. This task tracks improving the situation.
Possible approaches:
- Extend NFS checks (via D process count) from nfs-workers to bastions (i.e. add to https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/blob/main/kubernetes/worker_stuck.yaml?ref_type=heads)
- Cover "users can't ssh into bastion" as part of toolforge-wide checks (cfr T313030, T357977 and older attempt https://gerrit.wikimedia.org/r/c/operations/puppet/+/755321/ too) via toolschecker
I (Filippo) think going for option 2 will be more future-proof since it covers actual user experience, albeit requiring more work up front.