ms-be2002.codfw.wmnet has drac issues
Open, NormalPublic

Description

When attempting to enable IPMI, host ms-be2002.codfw.wmnet has issues.

Attempting to login remotely resulted in:

robh@puppetmaster1001:~$ ssh root@ms-be2002.mgmt.codfw.wmnet
The authenticity of host 'ms-be2002.mgmt.codfw.wmnet (10.193.1.38)' can't be established.
ECDSA key fingerprint is 06:2e:c0:d9:4e:03:41:e2:7a:cc:66:be:2d:6c:28:de.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ms-be2002.mgmt.codfw.wmnet,10.193.1.38' (ECDSA) to the list of known hosts.
root@ms-be2002.mgmt.codfw.wmnet's password: 

License Violation: "Dedicated NIC"

It then closed the connection. It seems this hasn't gotten the updated license installed, or it has now entered a bad state.

This system will need the following troubleshooting:

    • full removal of power cables, and then booted back up
  • if that does't work, we'll need to flash it with a new drac license. Please test the first step, and assign back to @RobH if it doesn'tw work.

Downtime for the host will need to be coordinated with @fgiunchedi as its a swift backend.

RobH created this task.Jan 19 2017, 1:17 AM

Host can be taken down at any time with a clean shutdown to make sure all services are stopped

System IDRAC license has expired

I switch the IDRAC from Dedicated to NIC2 to access the server in case there is something to do. This is just a temporary fix.

RobH added a comment.Feb 21 2017, 6:20 PM

is the zip of the license info. @Papaul: Next time you need me to pull this, please assign it to me so I won't miss it.

Please update the license on the system. While this can be done remotely, I fear it may cause it to become unreachable (due to the network cable needing to be moved back to the dedicated port after update), so it seems best to leave for the on-site to do while they are on-site.

That zip is set for only members of #acl*operations-team to be able to download it.

Papaul closed this task as "Resolved".Feb 22 2017, 8:32 PM

License and firmware update complete. I switch back the IDRAC to Dedicated

RobH reopened this task as "Open".Feb 22 2017, 10:01 PM
RobH reassigned this task from Papaul to fgiunchedi.
RobH added a subscriber: Papaul.

Re-opening, because this was just part of the issues on this host.

Now that the drac is back online, the host still fails the following when run in the os:

robh@ms-be2002:~$ sudo ipmi-chassis --get-chassis-status
ipmi_cmd_get_chassis_status: internal system error

This may be due to IPMI not being enabled in the bios. So we'll need to reboot this into bios.

Double checking with @fgiunchedi that this isn't repooled and we can take it back offline as needed? Please advise and assign back to me.

fgiunchedi reassigned this task from fgiunchedi to RobH.Feb 23 2017, 9:46 AM

thanks for checking! ms-be hosts can be taken down, one at a time, at any time for brief periods (e.g. one day) via graceful shutdown to make sure all swift daemons and rsync are stopped.

IPMI was disable it is now enable.

RobH reassigned this task from RobH to Papaul.Mar 10 2017, 9:37 PM

This system still has the ipmi issue when run on the local OS:

robh@ms-be2002:~$ sudo ipmi-chassis --get-chassis-status
ipmi_cmd_get_chassis_status: internal system error

Please update the firmware of the idrac as well as the bios. I'll leave this for the onsite to do, since it can result in requiring onsite power cycling.

Alternatively, I'm happy to push the uploaded firmware, just need @Papaul to confirm being onsite when I do. (Whatever is easier!)

FWIW this host is slated for decom in some weeks, I wouldn't spend too much time on its idrac especially if there's other hosts not to be decom with broken idrac

Mentioned in SAL (#wikimedia-operations) [2017-03-14T15:39:18Z] <godog> shut ms-be2002 for idrac / bios troubleshooting T155689