Page MenuHomePhabricator

Remote IPMI doesn't work for ~2% of the fleet
Closed, ResolvedPublic

Description

We rely on remote IPMI in a lot of cases but still often have issues with it.

An audit of the reachability of IPMI across the fleet found numerous hosts for which remote IPMI is not working, wher one is not able to perform a chassis status from puppetmaster1001 via ipmitool.

Several issues have been identified:

  • an IPMI misconfiguration where in the Lan_Channel section the Volatile_Access_Mode and Non_Volatile_Access_Mode (runtime value and value after next reboot) are set to Disabled instead of Always_Available.
  • an IPMI misconfiguration where in the Lan_Channel section the Volatile_Channel_Privilege_Limit and Non_Volatile_Channel_Privilege_Limit (runtime value and value after next reboot) are set to Operator instead of Administrator.
  • IPMI passwords getting out of sync with their iDRAC passwords. An ssh root@$hostname racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 $password fixes this usually
  • BMCs being unresponsive to IPMI but responsive to SSH (a racadm racreset usually fixes this)
  • BMCs being responsive to ping but unresponsive to SSH (this needs a power drain/cycle)
  • BMCs being unresponsive to ping (this either needs a power drain/cycle, or network debugging, e.g. bad cable)

The list of remaining hosts right now is:

  • conf1003.mgmt.eqiad.wmnet: unresponsive to ping, but responsive BMC from within the machine (bad cable?)
  • db1063.mgmt.eqiad.wmnet: unresponsive to ping, but responsive BMC from within the machine (bad cable?)
  • kafka1018.mgmt.eqiad.wmnet: unresponsive to ping, but responsive BMC from within the machine (bad cable?)
  • kafka1020.mgmt.eqiad.wmnet: unresponsive to ping, but responsive BMC from within the machine (bad cable?)
  • db1053.mgmt.eqiad.wmnet: responsive to ping but unresponsive to SSH & bmc-config
  • restbase-dev1003.mgmt.eqiad.wmnet, see T169696
  • mw1196.mgmt.eqiad.wmnet, see T169360#3395989
  • sodium.mgmt.eqiad.wmnet, see T169360
  • labsdb1001.mgmt.eqiad.wmnet: Cisco, ignore
  • labsdb1003.mgmt.eqiad.wmnet: Cisco, ignore

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Updating the host provisioning docs in combination with a daily Icinga check sounds like the best approach to me.

Change 320246 had a related patch set uploaded (by Dzahn):
base: also install freeipmi on trusty hosts

https://gerrit.wikimedia.org/r/320246

Change 320246 merged by Dzahn:
base/ipmi: install freeipmi globally, move to ipmi module

https://gerrit.wikimedia.org/r/320246

@Volans ^ freeipmi-tools, freeipmi-ipmidetect and freeipmi-bmc-watchdog should now also be installed on all trusty hosts (everywhere except VMs really, but i also did not remove it from existing VMs). (after next puppet run).

Aklapper renamed this task from Remote IPMI doens't work for ~17% of the fleet to Remote IPMI doesn't work for ~17% of the fleet.Dec 6 2016, 7:26 PM

Mentioned in SAL (#wikimedia-operations) [2016-12-16T11:26:41Z] <volans> enabling remote IPMI where it's not enabled T150160

@Dzahn Thanks for the work!
Unfortunately there are a bunch of hosts on which the freeipmi tools are not yet installed and some where although the remote IPMI is already enabled it doesn't work and needs to be checked one by one.

I've enabled remote IPMI on all the other hosts from the list and verified it was working there.
I've updated P4379 to contain the latest failing list of hosts that I'm aware of.
Once fixed those I'll do again a sweep in the whole fleet.

@Dzahn could you please check why freeipmi was not installed on the ones listed in P4379 following your latest patches?

Change 331555 had a related patch set uploaded (by Dzahn):
mediawiki videoscaler: include mediawiki::common role

https://gerrit.wikimedia.org/r/331555

Change 331555 abandoned by Dzahn:
mediawiki videoscaler: include mediawiki::common role

Reason:
i don't see a pattern at all. for example analytics1027 has the issue and clearly includes standard. or iridium and phab2001, they share a node regex and one has it,one does not...

https://gerrit.wikimedia.org/r/331555

I checked P4379 for distro version and all the affected hosts are trusty, so there is that pattern.

@Volans looked more, it's like it is simply not installed on _any trusty_, that's a lot more than in P4379 though, it's 272 per salt ..'G@lsb_distrib_codename:trusty' and none of them have it..

I tested replacing require_package with package, no difference. Then i tried removing the "is_virtual == false" check, and oh look that does it:

http://puppet-compiler.wmflabs.org/5075/

some non VMs think they are VMs. ?!

but when i manually run facter.. i can't even confirm that

[phab2001:~] $ facter | grep virtual
is_virtual => false
virtual => physical

[iridium:~] $ facter | grep virtual
is_virtual => false
virtual => physical

wth..

Change 331579 had a related patch set uploaded (by Dzahn):
base: missing quotes around is_virtual 'false' for ipmi

https://gerrit.wikimedia.org/r/331579

see change above in context of:

19:43 < paravoid> mutante: it's "false", not false, be careful
19:43 < paravoid> the facts are strings still
19:43 < paravoid> and "false" != false

Change 331579 merged by Dzahn:
base: missing quotes around is_virtual 'false' for ipmi

https://gerrit.wikimedia.org/r/331579

@Volans After the merge above, i could confirm that the 3 freeipmi packages got installed on iridium (which was on the list) and nothing broke on phab2001 which already had them. After the next puppet run it should be on all trusty machines.

Do you still have the commands you ran for P4379? That should now be empty.

@Dzahn I've fixed all the ones that were fixable with this method. From a full run across the whole fleet I found some that are still failing IPMI and need more investigation.

Chassis status

The failing hosts can be divided into two groups when trying to get the status of the chassis.

1) Local run sucessful, remote failing

Those hosts can run successfully ipmi-chassis --get-chassis-status locally, but fail when called remotely with Error: Unable to establish IPMI v2 / RMCP+ session

cp1046.eqiad.wmnet
cp1047.eqiad.wmnet
cp1066.eqiad.wmnet
cp2009.codfw.wmnet
db1070.eqiad.wmnet
db1071.eqiad.wmnet
dbstore1001.eqiad.wmnet
es2019.codfw.wmnet
ganeti2001.codfw.wmnet
iridium.eqiad.wmnet
kafka1020.eqiad.wmnet
labsdb1001.eqiad.wmnet
labsdb1003.eqiad.wmnet
ms-be1007.eqiad.wmnet
ms-fe2005.codfw.wmnet
mw1189.eqiad.wmnet
mw1302.eqiad.wmnet
praseodymium.eqiad.wmnet
sarin.codfw.wmnet

2) Both local and remote run failing

Those hosts fail the local run of ipmi-chassis --get-chassis-status with ipmi_cmd_get_chassis_status: internal system error and the remote one with Error: Unable to establish IPMI v2 / RMCP+ session

es1019.eqiad.wmnet
ms-be2002.codfw.wmnet
ms-be2010.codfw.wmnet
mw2098.codfw.wmnet
ocg1001.eqiad.wmnet

Remote Lan Channel

The failing hosts can be divided into two groups when getting the Lan Channel configuration (ipmi-config--section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff, the command to be used on trusty is bmc-config instead).

1) The configuration is already correct

cp1046.eqiad.wmnet
cp1047.eqiad.wmnet
cp1066.eqiad.wmnet
cp2009.codfw.wmnet
db1070.eqiad.wmnet
db1071.eqiad.wmnet
dbstore1001.eqiad.wmnet
es2019.codfw.wmnet
ganeti2001.codfw.wmnet
iridium.eqiad.wmnet
kafka1020.eqiad.wmnet
ms-be1007.eqiad.wmnet
ms-fe2005.codfw.wmnet
mw1189.eqiad.wmnet
mw1302.eqiad.wmnet
praseodymium.eqiad.wmnet
sarin.codfw.wmnet

2) The call to get the configuration fails with Unable to get Number of Users

es1019.eqiad.wmnet
labsdb1001.eqiad.wmnet
labsdb1003.eqiad.wmnet
ms-be2002.codfw.wmnet
ms-be2008.codfw.wmnet
ms-be2010.codfw.wmnet
mw2098.codfw.wmnet
ocg1001.eqiad.wmnet
Dzahn renamed this task from Remote IPMI doesn't work for ~17% of the fleet to Remote IPMI doesn't work for ~2% of the fleet.Jan 19 2017, 1:39 AM

For the record, when reinstalling dbstore1001 (T153768) which is mentioned here: T150160#2951190 as one of the affected hosts, we ran into this issue and tried to troubleshoot the issue.
Along with Chris we tried several things:

  • Cold reset of the idrac
  • Update idrac firmware
  • Update Bios firmware
  • Again cold reset

Nothing worked and we were still getting:

root@neodymium:~# ipmitool -I lanplus -H dbstore1001.mgmt.eqiad.wmnet -U root -E chassis power status
Unable to read password from environment
Password:
Error: Unable to establish IPMI v2 / RMCP+ session

Some debugging showed nothing really relevant:

root@neodymium:~# ipmitool -I lanplus -H 10.65.6.64 -U root -E chassis power status -vvvvvvvvvvv
Unable to read password from environment
Password:
>>    data    : 0x8e 0x04

>> sending packet (23 bytes)
 06 00 ff 07 00 00 00 00 00 00 00 00 00 09 20 18
 c8 81 00 38 8e 04 b5
<< received packet (30 bytes)
 06 00 ff 07 00 00 00 00 00 00 00 00 00 10 81 1c
 63 20 00 38 00 01 86 1c 03 00 00 00 00 02
>> sending packet (48 bytes)
 06 00 ff 07 06 10 00 00 00 00 00 00 00 00 20 00
 00 00 00 00 a4 a3 a2 a0 00 00 00 08 01 00 00 00
 01 00 00 08 01 00 00 00 02 00 00 08 01 00 00 00
<< received packet (52 bytes)
 06 00 ff 07 06 11 00 00 00 00 00 00 00 00 24 00
 00 00 04 00 a4 a3 a2 a0 00 12 00 02 00 00 00 08
 01 00 00 00 01 00 00 08 01 00 00 00 02 00 00 08
 01 00 00 00
<<OPEN SESSION RESPONSE
<<  Message tag                        : 0x00
<<  RMCP+ status                       : no errors
<<  Maximum privilege level            : admin
<<  Console Session ID                 : 0xa0a2a3a4
<<  BMC Session ID                     : 0x02001200
<<  Negotiated authenticatin algorithm : hmac_sha1
<<  Negotiated integrity algorithm     : hmac_sha1_96
<<  Negotiated encryption algorithm    : aes_cbc_128

>> Console generated random number (16 bytes)
 3e dc ec f6 2c 0f f0 02 51 49 3f 8f 11 4a 17 41
>> sending packet (48 bytes)
 06 00 ff 07 06 12 00 00 00 00 00 00 00 00 20 00
 00 00 00 00 00 12 00 02 3e dc ec f6 2c 0f f0 02
 51 49 3f 8f 11 4a 17 41 14 00 00 04 72 6f 6f 74
<< received packet (76 bytes)
 06 00 ff 07 06 13 00 00 00 00 00 00 00 00 3c 00
 00 00 00 00 a4 a3 a2 a0 93 30 ac 7c d9 e0 dc fa
 2d 63 18 73 ca 20 37 f4 44 45 4c 4c 52 00 10 37
 80 43 b4 c0 4f 48 30 32 35 60 68 e0 1d 51 06 e7
 58 46 62 5f e5 ea 87 c1 8b f6 8e a1
<<RAKP 2 MESSAGE
<<  Message tag                   : 0x00
<<  RMCP+ status                  : no errors
<<  Console Session ID            : 0xa0a2a3a4
<<  BMC random number             : 0x9330ac7cd9e0dcfa2d631873ca2037f4
<<  BMC GUID                      : 0x44454c4c520010378043b4c04f483032
<<  Key exchange auth code [sha1] : 0x356068e01d5106e75846625fe5ea87c18bf68ea1

bmc_rand (16 bytes)
 93 30 ac 7c d9 e0 dc fa 2d 63 18 73 ca 20 37 f4
>> rakp2 mac input buffer (62 bytes)
 a4 a3 a2 a0 00 12 00 02 3e dc ec f6 2c 0f f0 02
 51 49 3f 8f 11 4a 17 41 93 30 ac 7c d9 e0 dc fa
 2d 63 18 73 ca 20 37 f4 44 45 4c 4c 52 00 10 37
 80 43 b4 c0 4f 48 30 32 14 04 72 6f 6f 74
>> rakp2 mac key (20 bytes)
 77 72 34 21 45 70 72 75 32 43 35 00 00 00 00 00
 00 00 00 00
>> rakp2 mac as computed by the remote console (20 bytes)
 14 3c 74 84 a0 5f 96 5f 08 0f 2c 2d 55 ea 3e 22
 40 17 8c 3b
Error: Unable to establish IPMI v2 / RMCP+ session

Same command works for another host:

root@neodymium:~# ipmitool -I lanplus -H db1092.mgmt.eqiad.wmnet -U root -E chassis power status
Unable to read password from environment
Password:
Chassis Power is on

However, it works locally:

root@dbstore1001:~# ipmi-chassis --get-chassis-status
System Power                        : on
Power overload                      : false
<snip>

dbstore1001 has the remote issue, opened a ticket with Dell to troubleshoot. Updating F/W to see if that will fix the issue

Dzahn moved this task from Up next to Inbox on the observability board.

Mentioned in SAL (#wikimedia-operations) [2017-07-10T21:06:54Z] <volans> running IPMI auditing to update status of T150160

I've run the audit again with a small script on neodymium in my home using cumin to grab the list of hostnames. The only requirement is to have exported IPMI_PASSWORD with the right password in the current environment (I use a space when setting it so that it doesn't get saved at all into the bash history because I'm using HISTCONTROL=ignoreboth).

Remote ipmitool chassis status

Here what I've run in a screen:

$ sudo ./get_mgmt_hosts.py | xargs -n1 -I "mgmt" sudo -E bash -c "echo -n 'mgmt: '; ipmitool -I lanplus -H mgmt -U root -E chassis status > /dev/null 2>&1 && echo PASS || echo FAIL" > ipmi_audit.log

And here the failed ones:

$ grep -c FAIL ipmi_audit.log
32
$ grep FAIL ipmi_audit.log | sort
bast3002.mgmt.esams.wmnet: FAIL
conf1003.mgmt.eqiad.wmnet: FAIL
cp1046.mgmt.eqiad.wmnet: FAIL
cp1047.mgmt.eqiad.wmnet: FAIL
cp2009.mgmt.codfw.wmnet: FAIL
cp4021.mgmt.ulsfo.wmnet: FAIL
db1053.mgmt.eqiad.wmnet: FAIL
db1063.mgmt.eqiad.wmnet: FAIL
db1071.mgmt.eqiad.wmnet: FAIL
db2044.mgmt.codfw.wmnet: FAIL
db2082.mgmt.codfw.wmnet: FAIL
dbstore1001.mgmt.eqiad.wmnet: FAIL
es2019.mgmt.codfw.wmnet: FAIL
ganeti2001.mgmt.codfw.wmnet: FAIL
gerrit2001.mgmt.codfw.wmnet: FAIL
iridium.mgmt.eqiad.wmnet: FAIL
kafka1018.mgmt.eqiad.wmnet: FAIL
kafka1020.mgmt.eqiad.wmnet: FAIL
labsdb1001.mgmt.eqiad.wmnet: FAIL
labsdb1003.mgmt.eqiad.wmnet: FAIL
ms-fe2005.mgmt.codfw.wmnet: FAIL
mw1189.mgmt.eqiad.wmnet: FAIL
mw1196.mgmt.eqiad.wmnet: FAIL
mw1302.mgmt.eqiad.wmnet: FAIL
naos.mgmt.codfw.wmnet: FAIL
ocg1001.mgmt.eqiad.wmnet: FAIL
praseodymium.mgmt.eqiad.wmnet: FAIL
restbase-dev1003.mgmt.eqiad.wmnet: FAIL
sarin.mgmt.codfw.wmnet: FAIL
scb2005.mgmt.codfw.wmnet: FAIL
sodium.mgmt.eqiad.wmnet: FAIL
stat1003.mgmt.eqiad.wmnet: FAIL

I've then re-run the audit only on the failed ones grabbing the output:

$ grep FAIL ipmi_audit.log | sort | cut -d':' -f1 | xargs -n1 -I "mgmt" sudo -E bash -c "echo -e '#----- mgmt'; ipmitool -I lanplus -H mgmt -U root -E chassis status 2>&1" > ipmi_failed.log

And here's the output:

$ cat ipmi_failed.log
#----- bast3002.mgmt.esams.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- conf1003.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- cp1046.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- cp1047.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- cp2009.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- cp4021.mgmt.ulsfo.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- db1053.mgmt.eqiad.wmnet
Error: Received an Unexpected Open Session Response
> Error: no response from RAKP 1 message
Set Session Privilege Level to ADMINISTRATOR failed
Error: Unable to establish IPMI v2 / RMCP+ session
#----- db1063.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- db1071.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- db2044.mgmt.codfw.wmnet
> Error: no response from RAKP 1 message
Set Session Privilege Level to ADMINISTRATOR failed
Error: Unable to establish IPMI v2 / RMCP+ session
#----- db2082.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- dbstore1001.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- es2019.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- ganeti2001.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- gerrit2001.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- iridium.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- kafka1018.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- kafka1020.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- labsdb1001.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- labsdb1003.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- ms-fe2005.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- mw1189.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- mw1196.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- mw1302.mgmt.eqiad.wmnet
Set Session Privilege Level to ADMINISTRATOR failed: Unknown (0x81)
Error: Unable to establish IPMI v2 / RMCP+ session
#----- naos.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- ocg1001.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- praseodymium.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- restbase-dev1003.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- sarin.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- scb2005.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- sodium.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- stat1003.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session

Local ipmi-chassis --get-chassis-status

Checking if we can get a chassis status locally on all but 2 it succeeds.
Excluded hosts because are down:

$ HOSTS="$(grep FAIL ipmi_audit.log | sort | cut -d'.' -f1 | sed 's/$/.*,/' | tr -d '\n' | sed 's/,$//')"
$ sudo cumin -b 2 -p 1 "${HOSTS}" "ipmi-chassis --get-chassis-status > /dev/null"

...SNIP...

===== NODE GROUP =====
(1) sodium.wikimedia.org
----- OUTPUT of 'ipmi-chassis --g...atus > /dev/null' -----
ipmi_cmd_get_chassis_status: driver timeout
===== NODE GROUP =====
(1) db2044.codfw.wmnet
----- OUTPUT of 'ipmi-chassis --g...atus > /dev/null' -----
ipmi_cmd_get_chassis_status: BMC busy

...SNIP...

87.5% (28/32) success ratio (>= 1.0% threshold) for command: 'ipmi-chassis --g...atus > /dev/null'.: bast3002.wikimedia.org,conf1003.eqiad.wmnet,cp2009.codfw.wmnet,cp[1046-1047].eqiad.wmnet,cp4021.ulsfo.wmnet,db2082.codfw.wmnet,db[1053,1063,1071].eqiad.wmnet,dbstore1001.eqiad.wmnet,es2019.codfw.wmnet,ganeti2001.codfw.wmnet,gerrit2001.wikimedia.org,iridium.eqiad.wmnet,kafka[1018,1020].eqiad.wmnet,labsdb[1001,1003].eqiad.wmnet,ms-fe2005.codfw.wmnet,mw[1189,1302].eqiad.wmnet,naos.codfw.wmnet,ocg1001.eqiad.wmnet,praseodymium.eqiad.wmnet,sarin.codfw.wmnet,scb2005.codfw.wmnet,stat1003.eqiad.wmnet

Local ipmi-config --diff

Checking if the configuration is correct regarding the remote access with ipmi-config (bmc-config on trusty hosts), on 5 hosts is wrong, and on other 2 is unable to get the number of users,
Excluded hosts because are down:

$ sudo cumin -b 2 -p 1 "${HOSTS}" "ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff"

...SNIP...

===== NODE GROUP =====
(5) bast3002.wikimedia.org,cp4021.ulsfo.wmnet,db2082.codfw.wmnet,gerrit2001.wikimedia.org,naos.codfw.wmnet
----- OUTPUT of 'ipmi-config --se...Available --diff' -----
Lan_Channel:Volatile_Access_Mode - input=`Always_Available':actual=`Disabled'
Lan_Channel:Non_Volatile_Access_Mode - input=`Always_Available':actual=`Disabled'
===== NODE GROUP =====
(2) labsdb[1001,1003].eqiad.wmnet
----- OUTPUT of 'bmc-config --sec...Available --diff' -----
Unable to get Number of Users
================

I've try to fix the 5 hosts with the remote wrong remote config:

sudo cumin -b 1 "bast3002.wikimedia.org,cp4021.ulsfo.wmnet,db2082.codfw.wmnet,gerrit2001.wikimedia.org,naos.codfw.wmnet" "ipmi-config --category=core --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --commit"

And indeed the diff is shown as empty after that and now they are a PASS:

cp4021.mgmt.ulsfo.wmnet: PASS
bast3002.mgmt.esams.wmnet: PASS
gerrit2001.mgmt.codfw.wmnet: PASS
db2082.mgmt.codfw.wmnet: PASS
naos.mgmt.codfw.wmnet: PASS

@faidon I'm wondering if those configs got reset somehow in the recent reboots...

So I did the following:

  • mw1302: had Volatile_Channel_Privilege_Limit and Non_Volatile_Channel_Privilege_Limit set to Operator instead of Administrator; fixed with bmc-config
  • stat1003: had wrong DNS, fixed that
  • a bunch of the rest had the issue that I described in T160392 (IPMI password had gotten out of sync with iDRAC password); fixed with sshpass -e ssh root@$hostname racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 $password

I'll update the list at the task's description.

And indeed the diff is shown as empty after that and now they are a PASS:

cp4021.mgmt.ulsfo.wmnet: PASS
bast3002.mgmt.esams.wmnet: PASS
gerrit2001.mgmt.codfw.wmnet: PASS
db2082.mgmt.codfw.wmnet: PASS
naos.mgmt.codfw.wmnet: PASS

@faidon I'm wondering if those configs got reset somehow in the recent reboots...

db2082 wasn't rebooted, though.

db1053.mgmt.eqiad.wmnet seems to work now, I can both ssh and get an chassis status from neodymium. Transient issue?

Chris fixed the cables for conf1003, kafka1018, kafka1020 and db1063. All fixed!

All listed here and most of the T169360's are fixed now. What isn't fixed is due to hardware troubles that is tracked separately (and it's just 5 now, instead of ~2% :). Resolving!