⚓ T150160 Remote IPMI doesn't work for ~2% of the fleet

Subject	Repo	Branch	Lines +/-
base: missing quotes around is_virtual 'false' for ipmi	operations/puppet	production	+3 -1
mediawiki videoscaler: include mediawiki::common role	operations/puppet	production	+1 -0
base/ipmi: install freeipmi globally, move to ipmi module	operations/puppet	production	+26 -21

Status	Assigned	Task
Resolved	Volans	T150160 Remote IPMI doesn't work for ~2% of the fleet
Resolved	Papaul	T155688 mw2098 drac offline - system unreachable
Resolved	Papaul	T155689 ms-be2002.codfw.wmnet has drac issues
Declined	RobH	T155690 troubleshoot drac on ms-be2010.codfw.wmnet
Resolved	• Cmjohnson	T155691 es1019.eqiad.wmnet drac unresponsive
Resolved	• Cmjohnson	T155692 ocg1001.eqiad.wmnet ipmi error
Resolved	mark	T157537 Broken IPMI/drac on cp3038 and cp3045
Resolved	• Cmjohnson	T158893 dbstore1001 troubleshoot IPMI issue
Duplicate	Volans	T158894 dbstore1001 ipmi issue
Resolved	• Cmjohnson	T160392 Reset db1070 idrac
Resolved	• Cmjohnson	T158969 Degraded RAID on db1070
Resolved	Volans	T163889 cp1066 troubleshoot IPMI issue
Resolved	• Cmjohnson	T168378 IPMI console not working on ms-be1014 / ms-be1015
Resolved	akosiaris	T169321 Monitor all management interfaces
Resolved	faidon	T169360 Unresponsive/misconfigured iDRACs over the host-BMC interface
Resolved	Papaul	T170307 mw2201, mw2202 - contact Dell and replace main board
Resolved	Papaul	T188016 db2037 IPMI not working
Resolved	fgiunchedi	T235234 fix IPMI over LAN on certain HP hosts

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 7 2016, 10:09 AM

Updating the host provisioning docs in combination with a daily Icinga check sounds like the best approach to me.

Change 320246 had a related patch set uploaded (by Dzahn):
base: also install freeipmi on trusty hosts

https://gerrit.wikimedia.org/r/320246

gerritbot added a project: Patch-For-Review.Nov 7 2016, 7:11 PM

amended https://gerrit.wikimedia.org/r/#/c/320246/9

Change 320246 merged by Dzahn:
base/ipmi: install freeipmi globally, move to ipmi module

https://gerrit.wikimedia.org/r/320246

@Volans ^ freeipmi-tools, freeipmi-ipmidetect and freeipmi-bmc-watchdog should now also be installed on all trusty hosts (everywhere except VMs really, but i also did not remove it from existing VMs). (after next puppet run).

Aklapper renamed this task from Remote IPMI doens't work for ~17% of the fleet to Remote IPMI doesn't work for ~17% of the fleet.Dec 6 2016, 7:26 PM

Volans claimed this task.Dec 16 2016, 9:26 AM

Mentioned in SAL (#wikimedia-operations) [2016-12-16T11:26:41Z] <volans> enabling remote IPMI where it's not enabled T150160

@Dzahn Thanks for the work!
Unfortunately there are a bunch of hosts on which the freeipmi tools are not yet installed and some where although the remote IPMI is already enabled it doesn't work and needs to be checked one by one.

I've enabled remote IPMI on all the other hosts from the list and verified it was working there.
I've updated P4379 to contain the latest failing list of hosts that I'm aware of.
Once fixed those I'll do again a sweep in the whole fleet.

@Dzahn could you please check why freeipmi was not installed on the ones listed in P4379 following your latest patches?

Volans removed Volans as the assignee of this task.Dec 16 2016, 4:32 PM

Dzahn claimed this task.Jan 4 2017, 8:05 PM

Change 331555 had a related patch set uploaded (by Dzahn):
mediawiki videoscaler: include mediawiki::common role

https://gerrit.wikimedia.org/r/331555

Change 331555 abandoned by Dzahn:
mediawiki videoscaler: include mediawiki::common role

Reason:
i don't see a pattern at all. for example analytics1027 has the issue and clearly includes standard. or iridium and phab2001, they share a node regex and one has it,one does not...

https://gerrit.wikimedia.org/r/331555

I checked P4379 for distro version and all the affected hosts are trusty, so there is that pattern.

@Volans looked more, it's like it is simply not installed on _any trusty_, that's a lot more than in P4379 though, it's 272 per salt ..'G@lsb_distrib_codename:trusty' and none of them have it..

I tested replacing require_package with package, no difference. Then i tried removing the "is_virtual == false" check, and oh look that does it:

http://puppet-compiler.wmflabs.org/5075/

some non VMs think they are VMs. ?!

but when i manually run facter.. i can't even confirm that

[phab2001:~] $ facter | grep virtual
is_virtual => false
virtual => physical

[iridium:~] $ facter | grep virtual
is_virtual => false
virtual => physical

wth..

Change 331579 had a related patch set uploaded (by Dzahn):
base: missing quotes around is_virtual 'false' for ipmi

https://gerrit.wikimedia.org/r/331579

see change above in context of:

19:43 < paravoid> mutante: it's "false", not false, be careful
19:43 < paravoid> the facts are strings still
19:43 < paravoid> and "false" != false

Change 331579 merged by Dzahn:
base: missing quotes around is_virtual 'false' for ipmi

https://gerrit.wikimedia.org/r/331579

@Volans After the merge above, i could confirm that the 3 freeipmi packages got installed on iridium (which was on the list) and nothing broke on phab2001 which already had them. After the next puppet run it should be on all trusty machines.

Do you still have the commands you ran for P4379? That should now be empty.

@Dzahn I've fixed all the ones that were fixable with this method. From a full run across the whole fleet I found some that are still failing IPMI and need more investigation.

Chassis status

The failing hosts can be divided into two groups when trying to get the status of the chassis.

1) Local run sucessful, remote failing

Those hosts can run successfully ipmi-chassis --get-chassis-status locally, but fail when called remotely with Error: Unable to establish IPMI v2 / RMCP+ session

cp1046.eqiad.wmnet
cp1047.eqiad.wmnet
cp1066.eqiad.wmnet
cp2009.codfw.wmnet
db1070.eqiad.wmnet
db1071.eqiad.wmnet
dbstore1001.eqiad.wmnet
es2019.codfw.wmnet
ganeti2001.codfw.wmnet
iridium.eqiad.wmnet
kafka1020.eqiad.wmnet
labsdb1001.eqiad.wmnet
labsdb1003.eqiad.wmnet
ms-be1007.eqiad.wmnet
ms-fe2005.codfw.wmnet
mw1189.eqiad.wmnet
mw1302.eqiad.wmnet
praseodymium.eqiad.wmnet
sarin.codfw.wmnet

2) Both local and remote run failing

Those hosts fail the local run of ipmi-chassis --get-chassis-status with ipmi_cmd_get_chassis_status: internal system error and the remote one with Error: Unable to establish IPMI v2 / RMCP+ session

es1019.eqiad.wmnet
ms-be2002.codfw.wmnet
ms-be2010.codfw.wmnet
mw2098.codfw.wmnet
ocg1001.eqiad.wmnet

Remote Lan Channel

The failing hosts can be divided into two groups when getting the Lan Channel configuration (ipmi-config--section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff, the command to be used on trusty is bmc-config instead).

1) The configuration is already correct

cp1046.eqiad.wmnet
cp1047.eqiad.wmnet
cp1066.eqiad.wmnet
cp2009.codfw.wmnet
db1070.eqiad.wmnet
db1071.eqiad.wmnet
dbstore1001.eqiad.wmnet
es2019.codfw.wmnet
ganeti2001.codfw.wmnet
iridium.eqiad.wmnet
kafka1020.eqiad.wmnet
ms-be1007.eqiad.wmnet
ms-fe2005.codfw.wmnet
mw1189.eqiad.wmnet
mw1302.eqiad.wmnet
praseodymium.eqiad.wmnet
sarin.codfw.wmnet

2) The call to get the configuration fails with `Unable to get Number of Users`

es1019.eqiad.wmnet
labsdb1001.eqiad.wmnet
labsdb1003.eqiad.wmnet
ms-be2002.codfw.wmnet
ms-be2008.codfw.wmnet
ms-be2010.codfw.wmnet
mw2098.codfw.wmnet
ocg1001.eqiad.wmnet

RobH created subtask T155688: mw2098 drac offline - system unreachable.Jan 19 2017, 1:02 AM

RobH created subtask T155689: ms-be2002.codfw.wmnet has drac issues.Jan 19 2017, 1:17 AM

RobH created subtask T155690: troubleshoot drac on ms-be2010.codfw.wmnet.Jan 19 2017, 1:20 AM

RobH created subtask T155691: es1019.eqiad.wmnet drac unresponsive.Jan 19 2017, 1:27 AM

RobH created subtask T155692: ocg1001.eqiad.wmnet ipmi error.Jan 19 2017, 1:34 AM

RobH subscribed.Jan 19 2017, 1:37 AM

Dzahn renamed this task from Remote IPMI doesn't work for ~17% of the fleet to Remote IPMI doesn't work for ~2% of the fleet.Jan 19 2017, 1:39 AM

Papaul closed subtask T155688: mw2098 drac offline - system unreachable as Resolved.Jan 23 2017, 6:01 PM

• ema created subtask T157537: Broken IPMI/drac on cp3038 and cp3045.Feb 8 2017, 10:02 AM

Papaul closed subtask T155689: ms-be2002.codfw.wmnet has drac issues as Resolved.Feb 22 2017, 8:32 PM

RobH reopened subtask T155689: ms-be2002.codfw.wmnet has drac issues as Open.Feb 22 2017, 10:01 PM

Marostegui subscribed.Feb 23 2017, 4:46 PM

Marostegui mentioned this in T153768: Install and reimage dbstore1001 as jessie.Feb 23 2017, 6:41 PM

For the record, when reinstalling dbstore1001 (T153768) which is mentioned here: T150160#2951190 as one of the affected hosts, we ran into this issue and tried to troubleshoot the issue.
Along with Chris we tried several things:

Cold reset of the idrac
Update idrac firmware
Update Bios firmware
Again cold reset

Nothing worked and we were still getting:

root@neodymium:~# ipmitool -I lanplus -H dbstore1001.mgmt.eqiad.wmnet -U root -E chassis power status
Unable to read password from environment
Password:
Error: Unable to establish IPMI v2 / RMCP+ session

Some debugging showed nothing really relevant:

root@neodymium:~# ipmitool -I lanplus -H 10.65.6.64 -U root -E chassis power status -vvvvvvvvvvv
Unable to read password from environment
Password:
>>    data    : 0x8e 0x04

>> sending packet (23 bytes)
 06 00 ff 07 00 00 00 00 00 00 00 00 00 09 20 18
 c8 81 00 38 8e 04 b5
<< received packet (30 bytes)
 06 00 ff 07 00 00 00 00 00 00 00 00 00 10 81 1c
 63 20 00 38 00 01 86 1c 03 00 00 00 00 02
>> sending packet (48 bytes)
 06 00 ff 07 06 10 00 00 00 00 00 00 00 00 20 00
 00 00 00 00 a4 a3 a2 a0 00 00 00 08 01 00 00 00
 01 00 00 08 01 00 00 00 02 00 00 08 01 00 00 00
<< received packet (52 bytes)
 06 00 ff 07 06 11 00 00 00 00 00 00 00 00 24 00
 00 00 04 00 a4 a3 a2 a0 00 12 00 02 00 00 00 08
 01 00 00 00 01 00 00 08 01 00 00 00 02 00 00 08
 01 00 00 00
<<OPEN SESSION RESPONSE
<<  Message tag                        : 0x00
<<  RMCP+ status                       : no errors
<<  Maximum privilege level            : admin
<<  Console Session ID                 : 0xa0a2a3a4
<<  BMC Session ID                     : 0x02001200
<<  Negotiated authenticatin algorithm : hmac_sha1
<<  Negotiated integrity algorithm     : hmac_sha1_96
<<  Negotiated encryption algorithm    : aes_cbc_128

>> Console generated random number (16 bytes)
 3e dc ec f6 2c 0f f0 02 51 49 3f 8f 11 4a 17 41
>> sending packet (48 bytes)
 06 00 ff 07 06 12 00 00 00 00 00 00 00 00 20 00
 00 00 00 00 00 12 00 02 3e dc ec f6 2c 0f f0 02
 51 49 3f 8f 11 4a 17 41 14 00 00 04 72 6f 6f 74
<< received packet (76 bytes)
 06 00 ff 07 06 13 00 00 00 00 00 00 00 00 3c 00
 00 00 00 00 a4 a3 a2 a0 93 30 ac 7c d9 e0 dc fa
 2d 63 18 73 ca 20 37 f4 44 45 4c 4c 52 00 10 37
 80 43 b4 c0 4f 48 30 32 35 60 68 e0 1d 51 06 e7
 58 46 62 5f e5 ea 87 c1 8b f6 8e a1
<<RAKP 2 MESSAGE
<<  Message tag                   : 0x00
<<  RMCP+ status                  : no errors
<<  Console Session ID            : 0xa0a2a3a4
<<  BMC random number             : 0x9330ac7cd9e0dcfa2d631873ca2037f4
<<  BMC GUID                      : 0x44454c4c520010378043b4c04f483032
<<  Key exchange auth code [sha1] : 0x356068e01d5106e75846625fe5ea87c18bf68ea1

bmc_rand (16 bytes)
 93 30 ac 7c d9 e0 dc fa 2d 63 18 73 ca 20 37 f4
>> rakp2 mac input buffer (62 bytes)
 a4 a3 a2 a0 00 12 00 02 3e dc ec f6 2c 0f f0 02
 51 49 3f 8f 11 4a 17 41 93 30 ac 7c d9 e0 dc fa
 2d 63 18 73 ca 20 37 f4 44 45 4c 4c 52 00 10 37
 80 43 b4 c0 4f 48 30 32 14 04 72 6f 6f 74
>> rakp2 mac key (20 bytes)
 77 72 34 21 45 70 72 75 32 43 35 00 00 00 00 00
 00 00 00 00
>> rakp2 mac as computed by the remote console (20 bytes)
 14 3c 74 84 a0 5f 96 5f 08 0f 2c 2d 55 ea 3e 22
 40 17 8c 3b
Error: Unable to establish IPMI v2 / RMCP+ session

Same command works for another host:

root@neodymium:~# ipmitool -I lanplus -H db1092.mgmt.eqiad.wmnet -U root -E chassis power status
Unable to read password from environment
Password:
Chassis Power is on

However, it works locally:

root@dbstore1001:~# ipmi-chassis --get-chassis-status
System Power                        : on
Power overload                      : false
<snip>

dbstore1001 has the remote issue, opened a ticket with Dell to troubleshoot. Updating F/W to see if that will fix the issue

• Cmjohnson created subtask T158893: dbstore1001 troubleshoot IPMI issue.Feb 23 2017, 7:06 PM

RobH mentioned this in T158894: dbstore1001 ipmi issue.Feb 23 2017, 7:11 PM

RobH created subtask T158894: dbstore1001 ipmi issue.

This comment was removed by RobH.

Marostegui mentioned this in T137191: Defragment db1070, db1082, db1087, db1092.Mar 14 2017, 8:12 AM

Marostegui created subtask T160392: Reset db1070 idrac.Mar 14 2017, 8:19 AM

• ema closed subtask T157537: Broken IPMI/drac on cp3038 and cp3045 as Resolved.Apr 21 2017, 7:54 AM

• ema created subtask T163889: cp1066 troubleshoot IPMI issue.Apr 26 2017, 12:12 PM

• Cmjohnson closed subtask T155691: es1019.eqiad.wmnet drac unresponsive as Resolved.Apr 27 2017, 7:43 PM

• Cmjohnson closed subtask T160392: Reset db1070 idrac as Resolved.Apr 27 2017, 8:11 PM

• Cmjohnson closed subtask T163889: cp1066 troubleshoot IPMI issue as Resolved.Apr 27 2017, 8:26 PM

Marostegui reopened subtask T160392: Reset db1070 idrac as Open.Apr 27 2017, 8:33 PM

• Cmjohnson closed subtask T155692: ocg1001.eqiad.wmnet ipmi error as Resolved.Apr 27 2017, 8:38 PM

Volans mentioned this in T161158: Degraded RAID on ocg1001.Apr 27 2017, 10:26 PM

Papaul closed subtask T155689: ms-be2002.codfw.wmnet has drac issues as Resolved.Jun 13 2017, 4:39 PM

fgiunchedi created subtask T168378: IPMI console not working on ms-be1014 / ms-be1015.Jun 20 2017, 10:26 AM

fgiunchedi closed subtask T168378: IPMI console not working on ms-be1014 / ms-be1015 as Resolved.Jun 20 2017, 2:44 PM

fgiunchedi created subtask T169321: Monitor all management interfaces.Jun 30 2017, 11:32 AM

faidon edited projects, added observability; removed Patch-For-Review.Jul 10 2017, 3:14 PM

Dzahn moved this task from Inbox to Up next on the observability board.Jul 10 2017, 3:30 PM

Dzahn moved this task from Up next to Inbox on the observability board.

faidon added a subtask: T169360: Unresponsive/misconfigured iDRACs over the host-BMC interface.Jul 10 2017, 6:11 PM

faidon closed subtask T155690: troubleshoot drac on ms-be2010.codfw.wmnet as Declined.

Mentioned in SAL (#wikimedia-operations) [2017-07-10T21:06:54Z] <volans> running IPMI auditing to update status of T150160

faidon closed subtask T160392: Reset db1070 idrac as Resolved.Jul 10 2017, 10:32 PM

faidon closed subtask T158893: dbstore1001 troubleshoot IPMI issue as Resolved.Jul 10 2017, 10:36 PM

I've run the audit again with a small script on neodymium in my home using cumin to grab the list of hostnames. The only requirement is to have exported IPMI_PASSWORD with the right password in the current environment (I use a space when setting it so that it doesn't get saved at all into the bash history because I'm using HISTCONTROL=ignoreboth).

Remote `ipmitool chassis status`

Here what I've run in a screen:

$ sudo ./get_mgmt_hosts.py | xargs -n1 -I "mgmt" sudo -E bash -c "echo -n 'mgmt: '; ipmitool -I lanplus -H mgmt -U root -E chassis status > /dev/null 2>&1 && echo PASS || echo FAIL" > ipmi_audit.log

And here the failed ones:

$ grep -c FAIL ipmi_audit.log
32
$ grep FAIL ipmi_audit.log | sort
bast3002.mgmt.esams.wmnet: FAIL
conf1003.mgmt.eqiad.wmnet: FAIL
cp1046.mgmt.eqiad.wmnet: FAIL
cp1047.mgmt.eqiad.wmnet: FAIL
cp2009.mgmt.codfw.wmnet: FAIL
cp4021.mgmt.ulsfo.wmnet: FAIL
db1053.mgmt.eqiad.wmnet: FAIL
db1063.mgmt.eqiad.wmnet: FAIL
db1071.mgmt.eqiad.wmnet: FAIL
db2044.mgmt.codfw.wmnet: FAIL
db2082.mgmt.codfw.wmnet: FAIL
dbstore1001.mgmt.eqiad.wmnet: FAIL
es2019.mgmt.codfw.wmnet: FAIL
ganeti2001.mgmt.codfw.wmnet: FAIL
gerrit2001.mgmt.codfw.wmnet: FAIL
iridium.mgmt.eqiad.wmnet: FAIL
kafka1018.mgmt.eqiad.wmnet: FAIL
kafka1020.mgmt.eqiad.wmnet: FAIL
labsdb1001.mgmt.eqiad.wmnet: FAIL
labsdb1003.mgmt.eqiad.wmnet: FAIL
ms-fe2005.mgmt.codfw.wmnet: FAIL
mw1189.mgmt.eqiad.wmnet: FAIL
mw1196.mgmt.eqiad.wmnet: FAIL
mw1302.mgmt.eqiad.wmnet: FAIL
naos.mgmt.codfw.wmnet: FAIL
ocg1001.mgmt.eqiad.wmnet: FAIL
praseodymium.mgmt.eqiad.wmnet: FAIL
restbase-dev1003.mgmt.eqiad.wmnet: FAIL
sarin.mgmt.codfw.wmnet: FAIL
scb2005.mgmt.codfw.wmnet: FAIL
sodium.mgmt.eqiad.wmnet: FAIL
stat1003.mgmt.eqiad.wmnet: FAIL

I've then re-run the audit only on the failed ones grabbing the output:

$ grep FAIL ipmi_audit.log | sort | cut -d':' -f1 | xargs -n1 -I "mgmt" sudo -E bash -c "echo -e '#----- mgmt'; ipmitool -I lanplus -H mgmt -U root -E chassis status 2>&1" > ipmi_failed.log

And here's the output:

$ cat ipmi_failed.log
#----- bast3002.mgmt.esams.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- conf1003.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- cp1046.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- cp1047.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- cp2009.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- cp4021.mgmt.ulsfo.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- db1053.mgmt.eqiad.wmnet
Error: Received an Unexpected Open Session Response
> Error: no response from RAKP 1 message
Set Session Privilege Level to ADMINISTRATOR failed
Error: Unable to establish IPMI v2 / RMCP+ session
#----- db1063.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- db1071.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- db2044.mgmt.codfw.wmnet
> Error: no response from RAKP 1 message
Set Session Privilege Level to ADMINISTRATOR failed
Error: Unable to establish IPMI v2 / RMCP+ session
#----- db2082.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- dbstore1001.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- es2019.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- ganeti2001.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- gerrit2001.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- iridium.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- kafka1018.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- kafka1020.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- labsdb1001.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- labsdb1003.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- ms-fe2005.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- mw1189.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- mw1196.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- mw1302.mgmt.eqiad.wmnet
Set Session Privilege Level to ADMINISTRATOR failed: Unknown (0x81)
Error: Unable to establish IPMI v2 / RMCP+ session
#----- naos.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- ocg1001.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- praseodymium.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- restbase-dev1003.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- sarin.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- scb2005.mgmt.codfw.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- sodium.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session
#----- stat1003.mgmt.eqiad.wmnet
Error: Unable to establish IPMI v2 / RMCP+ session

Local `ipmi-chassis --get-chassis-status`

Checking if we can get a chassis status locally on all but 2 it succeeds.
Excluded hosts because are down:

restbase-dev1003.eqiad.wmnet, see T169696
mw1196, see T169360#3395989

$ HOSTS="$(grep FAIL ipmi_audit.log | sort | cut -d'.' -f1 | sed 's/$/.*,/' | tr -d '\n' | sed 's/,$//')"
$ sudo cumin -b 2 -p 1 "${HOSTS}" "ipmi-chassis --get-chassis-status > /dev/null"

...SNIP...

===== NODE GROUP =====
(1) sodium.wikimedia.org
----- OUTPUT of 'ipmi-chassis --g...atus > /dev/null' -----
ipmi_cmd_get_chassis_status: driver timeout
===== NODE GROUP =====
(1) db2044.codfw.wmnet
----- OUTPUT of 'ipmi-chassis --g...atus > /dev/null' -----
ipmi_cmd_get_chassis_status: BMC busy

...SNIP...

87.5% (28/32) success ratio (>= 1.0% threshold) for command: 'ipmi-chassis --g...atus > /dev/null'.: bast3002.wikimedia.org,conf1003.eqiad.wmnet,cp2009.codfw.wmnet,cp[1046-1047].eqiad.wmnet,cp4021.ulsfo.wmnet,db2082.codfw.wmnet,db[1053,1063,1071].eqiad.wmnet,dbstore1001.eqiad.wmnet,es2019.codfw.wmnet,ganeti2001.codfw.wmnet,gerrit2001.wikimedia.org,iridium.eqiad.wmnet,kafka[1018,1020].eqiad.wmnet,labsdb[1001,1003].eqiad.wmnet,ms-fe2005.codfw.wmnet,mw[1189,1302].eqiad.wmnet,naos.codfw.wmnet,ocg1001.eqiad.wmnet,praseodymium.eqiad.wmnet,sarin.codfw.wmnet,scb2005.codfw.wmnet,stat1003.eqiad.wmnet

Local `ipmi-config --diff`

Checking if the configuration is correct regarding the remote access with ipmi-config (bmc-config on trusty hosts), on 5 hosts is wrong, and on other 2 is unable to get the number of users,
Excluded hosts because are down:

restbase-dev1003.eqiad.wmnet, see T169696
mw1196, see T169360#3395989

$ sudo cumin -b 2 -p 1 "${HOSTS}" "ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff"

...SNIP...

===== NODE GROUP =====
(5) bast3002.wikimedia.org,cp4021.ulsfo.wmnet,db2082.codfw.wmnet,gerrit2001.wikimedia.org,naos.codfw.wmnet
----- OUTPUT of 'ipmi-config --se...Available --diff' -----
Lan_Channel:Volatile_Access_Mode - input=`Always_Available':actual=`Disabled'
Lan_Channel:Non_Volatile_Access_Mode - input=`Always_Available':actual=`Disabled'
===== NODE GROUP =====
(2) labsdb[1001,1003].eqiad.wmnet
----- OUTPUT of 'bmc-config --sec...Available --diff' -----
Unable to get Number of Users
================

I've try to fix the 5 hosts with the remote wrong remote config:

sudo cumin -b 1 "bast3002.wikimedia.org,cp4021.ulsfo.wmnet,db2082.codfw.wmnet,gerrit2001.wikimedia.org,naos.codfw.wmnet" "ipmi-config --category=core --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --commit"

And indeed the diff is shown as empty after that and now they are a PASS:

cp4021.mgmt.ulsfo.wmnet: PASS
bast3002.mgmt.esams.wmnet: PASS
gerrit2001.mgmt.codfw.wmnet: PASS
db2082.mgmt.codfw.wmnet: PASS
naos.mgmt.codfw.wmnet: PASS

@faidon I'm wondering if those configs got reset somehow in the recent reboots...

So I did the following:

mw1302: had Volatile_Channel_Privilege_Limit and Non_Volatile_Channel_Privilege_Limit set to Operator instead of Administrator; fixed with bmc-config
stat1003: had wrong DNS, fixed that
a bunch of the rest had the issue that I described in T160392 (IPMI password had gotten out of sync with iDRAC password); fixed with sshpass -e ssh root@$hostname racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 $password

I'll update the list at the task's description.

faidon updated the task description. (Show Details)Jul 10 2017, 11:33 PM

faidon updated the task description. (Show Details)Jul 10 2017, 11:38 PM

In T150160#3423143, @Volans wrote:
And indeed the diff is shown as empty after that and now they are a PASS:
cp4021.mgmt.ulsfo.wmnet: PASS
bast3002.mgmt.esams.wmnet: PASS
gerrit2001.mgmt.codfw.wmnet: PASS
db2082.mgmt.codfw.wmnet: PASS
naos.mgmt.codfw.wmnet: PASS
@faidon I'm wondering if those configs got reset somehow in the recent reboots...

db2082 wasn't rebooted, though.

db1053.mgmt.eqiad.wmnet seems to work now, I can both ssh and get an chassis status from neodymium. Transient issue?

elukey subscribed.Jul 11 2017, 11:28 AM

akosiaris mentioned this in T169321: Monitor all management interfaces.Jul 11 2017, 12:28 PM

akosiaris closed subtask T169321: Monitor all management interfaces as Resolved.

faidon updated the task description. (Show Details)Jul 11 2017, 5:17 PM

Chris fixed the cables for conf1003, kafka1018, kafka1020 and db1063. All fixed!

All listed here and most of the T169360's are fixed now. What isn't fixed is due to hardware troubles that is tracked separately (and it's just 5 now, instead of ~2% :). Resolving!

akosiaris mentioned this in T183349: lawrencium's iDRAC misbehaving IPMI wise.Dec 20 2017, 11:34 AM

• ema mentioned this in T191956: Document how to fix IPMI issues on Wikitech .Apr 11 2018, 7:30 AM

jcrespo mentioned this in T191977: remote ipmi doesn't work for es2013.Apr 11 2018, 2:30 PM

Dzahn added a subtask: T235234: fix IPMI over LAN on certain HP hosts.Oct 11 2019, 12:38 AM

fgiunchedi closed subtask T235234: fix IPMI over LAN on certain HP hosts as Resolved.Apr 29 2022, 1:57 PM

Remote IPMI doesn't work for ~2% of the fleet
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Chassis status

1) Local run sucessful, remote failing

2) Both local and remote run failing

Remote Lan Channel

1) The configuration is already correct

2) The call to get the configuration fails with `Unable to get Number of Users`

Remote `ipmitool chassis status`

Local `ipmi-chassis --get-chassis-status`

Local `ipmi-config --diff`

Remote IPMI doesn't work for ~2% of the fleetClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Chassis status

1) Local run sucessful, remote failing

2) Both local and remote run failing

Remote Lan Channel

1) The configuration is already correct

2) The call to get the configuration fails with Unable to get Number of Users

Remote ipmitool chassis status

Local ipmi-chassis --get-chassis-status

Local ipmi-config --diff

Remote IPMI doesn't work for ~2% of the fleet
Closed, ResolvedPublic
Actions

Related Objects
Search...

2) The call to get the configuration fails with `Unable to get Number of Users`

Remote `ipmitool chassis status`

Local `ipmi-chassis --get-chassis-status`

Local `ipmi-config --diff`