We have a kinda steady rate of alerts firing every day for PROBLEM - SSH on$HOSTNAME.mgmt is CRITICAL.
I've extracted some statistics from IRC logs since 2022-01-01:
- 378 total alerts
- 4,7 per day in average
- 17 days without any alert
- 63 days with at least one alert -> 6 per day with alerts in average
- 33 unique hostnames (so seems to affect some specific hosts, see below) with re-occurrences between 1 and 73:
$ grep -o '.*\.mgmt is CRITICAL' \#wikimedia-operations.2022-*.log | grep PROBLEM | cut -d" " -f8 | sort | uniq -c| sort -n 1 db1161.mgmt 1 ganeti1023.mgmt 1 ms-fe2008.mgmt 1 mw1453.mgmt 1 mw1454.mgmt 1 mw1455.mgmt 1 mw1456.mgmt 1 thumbor2003.mgmt 1 thumbor2005.mgmt 1 wtp1041.mgmt 2 aqs1007.mgmt 2 aqs1009.mgmt 3 analytics1067.mgmt 4 kubernetes1002.mgmt 4 kubernetes2001.mgmt 5 db2086.mgmt 5 thumbor2004.mgmt 8 aqs1008.mgmt 8 db2083.mgmt 8 wtp1026.mgmt 9 dumpsdata1002.mgmt 9 restbase2011.mgmt 10 wtp1027.mgmt 12 restbase2010.mgmt 16 db2090.mgmt 17 mw2252.mgmt 17 mw2254.mgmt 23 contint1001.mgmt 25 analytics1063.mgmt 30 mw2257.mgmt 32 dns5001.mgmt 45 mw2258.mgmt 73 kubernetes1004.mgmt
Same hostnames grouped by hostname:
$ grep -o '.*\.mgmt is CRITICAL' \#wikimedia-operations.2022-*.log | grep PROBLEM | cut -d" " -f8 | sort | uniq analytics1063.mgmt analytics1067.mgmt aqs1007.mgmt aqs1008.mgmt aqs1009.mgmt contint1001.mgmt db1161.mgmt db2083.mgmt db2086.mgmt db2090.mgmt dns5001.mgmt dumpsdata1002.mgmt ganeti1023.mgmt kubernetes1002.mgmt kubernetes1004.mgmt kubernetes2001.mgmt ms-fe2008.mgmt mw1453.mgmt mw1454.mgmt mw1455.mgmt mw1456.mgmt mw2252.mgmt mw2254.mgmt mw2257.mgmt mw2258.mgmt restbase2010.mgmt restbase2011.mgmt thumbor2003.mgmt thumbor2004.mgmt thumbor2005.mgmt wtp1026.mgmt wtp1027.mgmt wtp1041.mgmt
I can see if via redfish I can get the iDRAC versions of those hosts automatically, if they are recent enough to support the required redfish API.