Yesterday during ganeti outage (https://wikitech.wikimedia.org/wiki/Incident_documentation/20170629-ganeti) it was discovered many management interfaces were not reachable. There's several levels of monitoring possible, in increasing order of implementation difficulty:
- ICMP ping
- Port 22 reachable
- SSH working (without actually authenticating)
- Remote console working (i.e. issue console com2 or vsp)
The easiest is probably 3. via an icinga check (e.g. every half an hour) that ssh handshake happens and authentication would be possible. 4. would be very desirable but definitely more difficult as the management password is needed, or public key authentication needs to be implemented (T113557)