Page MenuHomePhabricator

ms-be1034 not powering on
Open, Needs TriagePublic

Description

As per title, host isn't powering on

</>hpiLO-> power  
                
status=0
status_tag=COMMAND COMPLETED
Thu Feb 11 08:21:14 2021
                        


power: server power is currently: Off


</>hpiLO-> power on
                   
status=0
status_tag=COMMAND COMPLETED
Thu Feb 11 08:21:15 2021
                        


Server powering on .......



</>hpiLO-> power
                
status=0
status_tag=COMMAND COMPLETED
Thu Feb 11 08:21:32 2021
                        


power: server power is currently: Off


</>hpiLO->

Event Timeline

Cmjohnson closed this task as Declined.Thu, Feb 11, 4:42 PM
Cmjohnson added subscribers: wiki_willy, Cmjohnson.

@fgiunchedi I tried pulling the power and resetting, that would be the typical fix but it didn't work. Historically when this has happened we've had to replace the motherboard. This server is out of warranty and unfortunately will have to be decommissioned. On the bright side, there are now some spare 4TB disks including @wiki_willy

Hi @fgiunchedi - since this server is at the 4yr mark, are you ok with decommissioning it? Thanks, Willy

wiki_willy added a parent task: Unknown Object (Task).Thu, Feb 11, 5:34 PM
fgiunchedi mentioned this in Unknown Object (Task).Fri, Feb 12, 8:58 AM
wiki_willy reopened this task as Open.Fri, Feb 12, 7:29 PM
wiki_willy assigned this task to Jclark-ctr.

Hi @fgiunchedi - @Jclark-ctr is going to use some parts from decommissioned servers to try and get the server back up. Thanks, Willy

@fgiunchedi Was able to get server to boot with minimal configurations 1cpu 1 dimm.

swapped both cpu's (so they will be matching speed) with a recently decommissioned server (ms-be1018) reinstalled all memory.

host is back up feel free to ping me if you have any questions.

Nice work @Jclark-ctr, much appreciated.

>>! In T274488#6827226, @Jclark-ctr wrote:

@fgiunchedi Was able to get server to boot with minimal configurations 1cpu 1 dimm.

swapped both cpu's (so they will be matching speed) with a recently decommissioned server (ms-be1018) reinstalled all memory.

host is back up feel free to ping me if you have any questions.

Jclark-ctr closed this task as Resolved.Fri, Feb 12, 7:42 PM
elukey reopened this task as Open.Mon, Feb 15, 7:09 AM
elukey added a subscriber: elukey.

ms-be1034 is down again, same issue as the one described by Filippo... :(

Thank you for all the work ! LMK how I can help e.g. if speeding up the decom of one host in T272836 would help (as opposed as decom'ing all hosts at the same time)

Hi @Jclark-ctr - let's just move the hard drives over to the chassis of one of the decom'd hosts. (assuming the decom'd host doesn't have any hw issues) It'll probably save some time trying to figure out if it's the motherboard, CPU, etc. Thanks, Willy

@fgiunchedi would you be ok with chassis swap using ms-be1018 recently decommissioned?

@fgiunchedi would you be ok with chassis swap using ms-be1018 recently decommissioned?

Yes, please proceed

Changed its Netbox status to failed so the Netbox report doesn't alert.

Any updates? Thank you

@fgiunchedi ms-be1017 has been swapped into place of ms-be1034. Will we be resurrecting the name ms-be1017 or renaming to ms-be1034? This will most likely need a re imaged.

@fgiunchedi ms-be1017 has been swapped into place of ms-be1034. Will we be resurrecting the name ms-be1017 or renaming to ms-be1034? This will most likely need a re imaged.

Thank you John! Good question, I think we should keep the ms-be1034 name and update netbox accordingly (cc @Volans @crusnov on the topic of this kind of swap).

Followup question: did all disks (i.e. ssd + hdd) move? I'd assume so but better to double check

Ok, if we go with ms-be1034 as hostname and we kept the disks, those are my thoughts:

  • BIOS settings, set the management IP to 10.65.4.90 (ms-be1034.mgmt.eqiad.wmnet IP)
  • HW RAID configuration in BIOS: double check it's the same of ms-be1034, the correct one that accepts the disks as is.
    • Could it complain that the disk IDs got changed?
  • Update Puppet repo's MAC address for ms-be1034.eqiad.wmnet
  • Let's use ms-be1034's device in Netbox: https://netbox.wikimedia.org/dcim/devices/1537/
  • Decide which Serial Number, Asset Tag, Procurement Ticket and Purchase date to use in Netbox (ms-be1017 vs ms-be1034)
    • Maybe is more correct to set the values to those of the chassis (ms-be1017) as we just kept the disks of ms-be1034?
  • Add to the comment field in Netbox the values in above point of the device that was discarded so that we keep them around.

@fgiunchedi do you plan to reimage it? If not we'll need to run a Netbox script after Puppet has run successfully. Not 100% if anything needs to be changed on the host, maybe nothing.

Ok, if we go with ms-be1034 as hostname and we kept the disks, those are my thoughts:

  • BIOS settings, set the management IP to 10.65.4.90 (ms-be1034.mgmt.eqiad.wmnet IP)
  • HW RAID configuration in BIOS: double check it's the same of ms-be1034, the correct one that accepts the disks as is.
    • Could it complain that the disk IDs got changed?
  • Update Puppet repo's MAC address for ms-be1034.eqiad.wmnet
  • Let's use ms-be1034's device in Netbox: https://netbox.wikimedia.org/dcim/devices/1537/
  • Decide which Serial Number, Asset Tag, Procurement Ticket and Purchase date to use in Netbox (ms-be1017 vs ms-be1034)
    • Maybe is more correct to set the values to those of the chassis (ms-be1017) as we just kept the disks of ms-be1034?
  • Add to the comment field in Netbox the values in above point of the device that was discarded so that we keep them around.

@fgiunchedi do you plan to reimage it? If not we'll need to run a Netbox script after Puppet has run successfully. Not 100% if anything needs to be changed on the host, maybe nothing.

Thank you for the feedback, I wasn't planning on reimaging the host but I don't mind doing it either. If reimage makes things easier then we should do it. re: on host changes, I think in theory none are needed since the hw should match (perhaps NIC names? not sure)

Thank you for the feedback, I wasn't planning on reimaging the host but I don't mind doing it either. If reimage makes things easier then we should do it. re: on host changes, I think in theory none are needed since the hw should match (perhaps NIC names? not sure)

I can't think on anything specific that should be changed in the OS.
If you're not reimaging (and I think it's totally ok), then just run the https://netbox.wikimedia.org/extras/scripts/interface_automation.ImportPuppetDB/ Netbox script once the host is up and running and after Puppet has run successfully. You can check with a dry-run first (not checking the commit changes) and then re-run it committing the changes. That script makes sure that all the interfaces and assigned IPs in Netbox are in sync with what's in PuppetDB.

ayounsi removed a subscriber: ayounsi.Tue, Feb 23, 2:57 PM

@Volans @fgiunchedi
Another question. Could be a easy fix for mac address also.
ms-be1034 was in a 10g rack
ms-be1017 has no 10g nic.

We could benefit a-lot in Eqiad moving devices that don't need 10g to 1g. Like @wiki_willy
has talked about on sre meetings. Is 10g needed for this if so i can move card over. Mac address would possibly stay same. I noticed a lot of ms-be host are racked in 1g host

10G for this host (or all ms-be for that matter) is needed, please move the card over, that will indeed keep the mac address! thank you