Page MenuHomePhabricator

Management interface SSH icinga alerts
Open, MediumPublic

Description

We have a kinda steady rate of alerts firing every day for PROBLEM - SSH on$HOSTNAME.mgmt is CRITICAL.

I've extracted some statistics from IRC logs since 2022-01-01:

  • 378 total alerts
  • 4,7 per day in average
  • 17 days without any alert
  • 63 days with at least one alert -> 6 per day with alerts in average
  • 33 unique hostnames (so seems to affect some specific hosts, see below) with re-occurrences between 1 and 73:
$ grep -o '.*\.mgmt is CRITICAL' \#wikimedia-operations.2022-*.log  | grep PROBLEM | cut -d" " -f8 | sort | uniq -c| sort -n
   1 db1161.mgmt
   1 ganeti1023.mgmt
   1 ms-fe2008.mgmt
   1 mw1453.mgmt
   1 mw1454.mgmt
   1 mw1455.mgmt
   1 mw1456.mgmt
   1 thumbor2003.mgmt
   1 thumbor2005.mgmt
   1 wtp1041.mgmt
   2 aqs1007.mgmt
   2 aqs1009.mgmt
   3 analytics1067.mgmt
   4 kubernetes1002.mgmt
   4 kubernetes2001.mgmt
   5 db2086.mgmt
   5 thumbor2004.mgmt
   8 aqs1008.mgmt
   8 db2083.mgmt
   8 wtp1026.mgmt
   9 dumpsdata1002.mgmt
   9 restbase2011.mgmt
  10 wtp1027.mgmt
  12 restbase2010.mgmt
  16 db2090.mgmt
  17 mw2252.mgmt
  17 mw2254.mgmt
  23 contint1001.mgmt
  25 analytics1063.mgmt
  30 mw2257.mgmt
  32 dns5001.mgmt
  45 mw2258.mgmt
  73 kubernetes1004.mgmt

Same hostnames grouped by hostname:

$ grep -o '.*\.mgmt is CRITICAL' \#wikimedia-operations.2022-*.log  | grep PROBLEM | cut -d" " -f8 | sort | uniq
analytics1063.mgmt
analytics1067.mgmt
aqs1007.mgmt
aqs1008.mgmt
aqs1009.mgmt
contint1001.mgmt
db1161.mgmt
db2083.mgmt
db2086.mgmt
db2090.mgmt
dns5001.mgmt
dumpsdata1002.mgmt
ganeti1023.mgmt
kubernetes1002.mgmt
kubernetes1004.mgmt
kubernetes2001.mgmt
ms-fe2008.mgmt
mw1453.mgmt
mw1454.mgmt
mw1455.mgmt
mw1456.mgmt
mw2252.mgmt
mw2254.mgmt
mw2257.mgmt
mw2258.mgmt
restbase2010.mgmt
restbase2011.mgmt
thumbor2003.mgmt
thumbor2004.mgmt
thumbor2005.mgmt
wtp1026.mgmt
wtp1027.mgmt
wtp1041.mgmt

I can see if via redfish I can get the iDRAC versions of those hosts automatically, if they are recent enough to support the required redfish API.

Event Timeline

Volans triaged this task as Medium priority.Mar 21 2022, 9:43 AM
Volans created this task.

I've run the following code to extract the firmware version (where possible):

requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS += ':HIGH:!DH:!aNULL'  # Needed because some host failed with SSL: DH_KEY_TOO_SMALL

def get_version(hostname):
    netbox = spicerack.netbox_server(hostname)
    redfish = spicerack.redfish(netbox.mgmt_fqdn, 'root')
    try:
        version = redfish.request('get', '/redfish/v1/Managers/iDRAC.Embedded.1').json()['FirmwareVersion']
    except Exception as e:
        version = f'unknown - {e}'
    return version


for host in hostnames:
    print(f'{host}: {get_version(host)}')

Those are the results:
ms-fe2008

analytics1063: 2.40.40.40
analytics1067: 2.40.40.40
aqs1007: unknown - Failed to perform GET request to https://aqs1007.mgmt.eqiad.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
aqs1008: unknown - Failed to perform GET request to https://aqs1008.mgmt.eqiad.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
aqs1009: unknown - Failed to perform GET request to https://aqs1009.mgmt.eqiad.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
contint1001: unknown - GET https://contint1001.mgmt.eqiad.wmnet/redfish/v1/Managers/iDRAC.Embedded.1 returned HTTP 404
db1161: 4.40.00.00
db2083: 2.40.40.40
db2086: unknown - Failed to perform GET request to https://db2086.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
db2090: unknown - Failed to perform GET request to https://db2090.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
dns5001: 2.50.50.50
dumpsdata1002: unknown - Failed to perform GET request to https://dumpsdata1002.mgmt.eqiad.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
ganeti1023: 5.00.00.00
kubernetes1002: 2.30.30.30
kubernetes1004: 2.30.30.30
kubernetes2001: unknown - Failed to perform GET request to https://kubernetes2001.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
analytics1063: 2.40.40.40
analytics1067: 2.40.40.40
aqs1007: unknown - Failed to perform GET request to https://aqs1007.mgmt.eqiad.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
aqs1008: unknown - Failed to perform GET request to https://aqs1008.mgmt.eqiad.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
aqs1009: unknown - Failed to perform GET request to https://aqs1009.mgmt.eqiad.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
contint1001: unknown - GET https://contint1001.mgmt.eqiad.wmnet/redfish/v1/Managers/iDRAC.Embedded.1 returned HTTP 404
db1161: 4.40.00.00
db2083: 2.40.40.40
db2086: unknown - Failed to perform GET request to https://db2086.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
db2090: unknown - Failed to perform GET request to https://db2090.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
dns5001: 2.50.50.50
dumpsdata1002: unknown - Failed to perform GET request to https://dumpsdata1002.mgmt.eqiad.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
ganeti1023: 5.00.00.00
kubernetes1002: 2.30.30.30
kubernetes1004: 2.30.30.30
kubernetes2001: unknown - Failed to perform GET request to https://kubernetes2001.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
mw1453: 5.00.00.00
mw1454: 5.00.00.00
mw1455: 5.00.00.00
mw1456: 5.00.00.00
mw2252: 2.40.40.40
mw2254: unknown - Failed to perform GET request to https://mw2254.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
mw2257: 2.40.40.40
mw2258: unknown - Failed to perform GET request to https://mw2258.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
restbase2010: unknown - Failed to perform GET request to https://restbase2010.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
restbase2011: unknown - Failed to perform GET request to https://restbase2011.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
thumbor2003: unknown - Failed to perform GET request to https://thumbor2003.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
thumbor2004: unknown - Failed to perform GET request to https://thumbor2004.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
thumbor2005: unknown - Failed to perform GET request to https://thumbor2005.mgmt.codfw.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
wtp1026: unknown - Failed to perform GET request to https://wtp1026.mgmt.eqiad.wmnet/redfish/v1/Managers/iDRAC.Embedded.1
wtp1027: 2.40.40.40
wtp1041: unknown - Failed to perform GET request to https://wtp1041.mgmt.eqiad.wmnet/redfish/v1/Managers/iDRAC.Embedded.1

To summarize:

2.30.30.30: kubernetes1002, kubernetes1004
2.40.40.40: analytics1063, analytics1067, db2083, mw2252, mw2257, wtp1027
2.50.50.50: dns5001
4.40.00.00: db1161
5.00.00.00: ganeti1023, mw1453, mw1454, mw1455, mw1456
Unknown version: aqs1007, aqs1008, aqs1009, contint1001, db2086, db2090, dumpsdata1002, kubernetes2001, mw2254, mw2258, restbase2010, restbase2011, thumbor2003, thumbor2004, thumbor2005, wtp1026, wtp1041
adding versions to occurrences
 1 db1161.mgmt - 4.40.00.00
 1 ganeti1023.mgmt - 5.00.00.00
 1 ms-fe2008.mgmt - 
 1 mw1453.mgmt - 5.00.00.00
 1 mw1454.mgmt - 5.00.00.00
 1 mw1455.mgmt - 5.00.00.00
 1 mw1456.mgmt - 5.00.00.00
 1 thumbor2003.mgmt - unknown
 1 thumbor2005.mgmt - unknown
 1 wtp1041.mgmt - unknown
 2 aqs1007.mgmt - unknown
 2 aqs1009.mgmt - unknown
 3 analytics1067.mgmt - 2.40.40.40
 4 kubernetes1002.mgmt - 2.30.30.30
 4 kubernetes2001.mgmt - unknown
 5 db2086.mgmt - unknown
 5 thumbor2004.mgmt - unknown
 8 aqs1008.mgmt - unknown
 8 db2083.mgmt - 2.40.40.40
 8 wtp1026.mgmt - unknown
 9 dumpsdata1002.mgmt - 2.30.30.30
 9 restbase2011.mgmt - unknown
10 wtp1027.mgmt - 2.40.40.40
12 restbase2010.mgmt - unknown
16 db2090.mgmt - unknown
17 mw2252.mgmt - 2.40.40.40
17 mw2254.mgmt - unknown
23 contint1001.mgmt - unknown
25 analytics1063.mgmt - 2.40.40.40
30 mw2257.mgmt - 2.40.40.40
32 dns5001.mgmt - 2.50.50.50
45 mw2258.mgmt - unknown
73 kubernetes1004.mgmt - 2.30.30.30

Looks like an iDRAC upgrade would be beneficial to the hosts running 2.50.50.50, unknown and bellow.

HostnameServer typeOld verssionNew versionNote
db2083.mgmtR6302.402.63IDRAC needs reset to get it to 2.82
db2086.mgmtR6302.802.82
db2090.mgmtR630IDRAC needs reset before upgrade
kubernetes2001.mgmtServer decom
ms-fe2008.mgmtServer decom
mw2252.mgmtR4302.402.63IDRAC needs reset
mw2254.mgmtR430IDRAC needs reset
mw2257.mgmtR4302.402.63IDRAC needs reset
mw2258.mgmtR430IDRAC needs reset
restbase2010.mgmtServer decom
restbase2011.mgmtServer decom
thumbor2003.mgmt
thumbor2004.mgmt
thumbor2005.mgmt

I don't know whether this approach has already been tried, but in case it helps I can share that I've had some success resolving this alert on aqs1008.mgmt recently : T311042: aqs1008.mgmt interface SSH check flapping

I installed the ipmitool package on the host and executed sudo ipmitool mc reset cold locally. The alert been stable for almost a week since then and I've subsequently purged the package.

It's possible that this technique would work more widely across the fleet, but maybe there are lots of different failure modes involved so it may not be that helpful. I thought I'd share in case we want to try it.

I installed the ipmitool package on the host and executed sudo ipmitool mc reset cold locally.

fwiw, ipmitool is installed on cumin and puppetmasters:

https://debmonitor.wikimedia.org/packages/ipmitool

and it should work to reset the DRACs from remote:

https://wikitech.wikimedia.org/wiki/Management_Interfaces#From_remote_IPMI

That way you would not have to install software and remove it again.

Also freeipmi is installed fleetwide

Cmjohnson claimed this task.
Cmjohnson added a subscriber: Cmjohnson.

Management flapping will be an ongoing issue, no need to keep this ticket open. If problems persist please create tags for the appropriate DC ops location.

@Cmjohnson Could you please add some more information why mgmt flapping will be an ongoing issue ?

Also, I see this is tagged ops-eqiad, ops-codfw, we could remove the tags until there are more concrete and specific actions per DC. Would that work ?

Btw, kubernetes1004, which is #1 in the results, has been decommissioned.

Some fresh stats (May, June, July):

$ grep -o '.*SSH.*\.mgmt is CRITICAL' \#wikimedia-operations.2022-0{5,6,7}.log  | grep PROBLEM | cut -d" " -f8 | sort | uniq -c| sort -n
   1 db1110.mgmt
   1 ms-be1059.mgmt
   1 ms-be2041.mgmt
   2 db1109.mgmt
   3 wtp1026.mgmt
   5 furud.mgmt
   5 mw1321.mgmt
   8 restbase1018.mgmt
   8 restbase2012.mgmt
  11 analytics1061.mgmt
  11 wtp1037.mgmt
  20 wtp1044.mgmt
  26 wtp1045.mgmt
  27 labweb1002.mgmt
  31 wtp1048.mgmt
  32 wtp1046.mgmt
  35 druid1006.mgmt
  44 wtp1039.mgmt
  48 wtp1025.mgmt
  51 wtp1038.mgmt
  60 wtp1036.mgmt
  70 cp5012.mgmt
  91 pki2001.mgmt
  92 wtp1040.mgmt
  97 aqs1008.mgmt

Rack distribution shows it's not related to a specific location:

'604': ['cp5012'],
'A6':  ['wtp1025', 'pki2001'],
'B6':  ['aqs1008'],
'B8':  ['analytics1061', 'wtp1036'],
'C5':  ['wtp1037', 'wtp1039', 'wtp1038'],
'C7':  ['wtp1040'],
'D3':  ['wtp1044', 'wtp1045'],
'D4':  ['labweb1002', 'wtp1048', 'wtp1046'],
'D6':  ['druid1006'],

In codfw we have seen flapping mgmt being fixed by one of 2 actions:

  • firmware / DRAC upgrades
  • DRAC hard resets

Let's try this with restbase2012, ms-be2041 and pki2001 (cc: @Papaul ) then the remaining ones will be just ops-eqiad and this ticket will be specific to a DC except that cp5012.

akosiaris removed projects: ops-codfw, ops-eqiad, DC-Ops.

@Cmjohnson Could you please add some more information why mgmt flapping will be an ongoing issue ?

Also, I see this is tagged ops-eqiad, ops-codfw, we could remove the tags until there are more concrete and specific actions per DC. Would that work ?

Btw, kubernetes1004, which is #1 in the results, has been decommissioned.

@Cmjohnson, I 'll be bold and reopen the task, while removing the ops-eqiad, ops-codfw tags. Leaving the DC-Ops one for visibility. Once the investigation in this task comes up with more concrete actionables, we can come up with specific tasks tagged respectively.

Note that wtp10XX hosts will be resolved by T307220.

Also freeipmi is installed fleetwide

Thanks @Volans - I've confirmed that this worked on an unresponsive druid1006.mgmt.

sudo bmc-device --cold-reset; echo $?

...without having to install ipmitool and the uninstall it.
ref: https://wikitech.wikimedia.org/wiki/Management_Interfaces#From_local_IPMI

I did try the remote ipmitool option as suggested by @Dzahn - but it failed to connect in order to send the reset command. Having the freeipmi package already installed and its bmc-device command already available was a help.