Page MenuHomePhabricator

db1162 crashed
Closed, ResolvedPublic

Description

db1162 is a new server that was pooled into production 15th Feb for the first time and it just crashed.
The idrac is also unavailable so I cannot check the HW logs.

We need on-site support to find out what caused the crash and to bring it back to life.

Event Timeline

Change 665646 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1162: Disable notifications

https://gerrit.wikimedia.org/r/665646

Change 665646 merged by Marostegui:
[operations/puppet@production] db1162: Disable notifications

https://gerrit.wikimedia.org/r/665646

I pulled the power and drained flea power, plugged back in and the server will not even power up.

Created a dispatch with Dell SR1052419298

update: this was scheduled for today but when I sent the tech the access ticket I was told it's been re-assigned and someone should've contacted me. That did not happen. I need to figure it out and this will not happen until next week.

Thanks for the update. Much appreciated!

This has been moved to this coming Friday at 10am local time (1500UTC)

This has been moved to this coming Friday at 10am local time (1500UTC)

Was this done past Friday in the end?
Thanks

The motherboard was swapped on friday but did not fix the issue. The Dell tech did more troubleshooting and it was determined the backplane is bad. Waiting on the part and tech to schedule a time with me to replace it.

Dell will be here tomorrow morning to replace the backplane.

db1162 is back online - updated netbox and resolving the task

Thanks Chris - I can access the host now. I will reimage it and populate it with data on Monday.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1162.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103150605_marostegui_10267.log.

It looks like the host isn't rebooting via PXE - trying to force it manually

@Cmjohnson I am not able to PXE boot the host. Neither via the normal reimage process nor forcing PXE manually with:

[06:37:22] marostegui@cumin1001:~$ sudo ipmitool -I lanplus -H db1162.mgmt.eqiad.wmnet -U root -E chassis bootdev pxe
Unable to read password from environment
Password:
Set Boot Device to pxe

[06:37:25] marostegui@cumin1001:~$ ipmitool -I lanplus -H db1162.mgmt.eqiad.wmnet -U root -E chassis bootparam get 5
Unable to read password from environment
Password:
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 8004020000
 Boot Flags :
   - Boot Flag Valid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : Force PXE
   - Console Redirection control : System Default
   - Lock Out Sleep Button
   - BIOS verbosity : Request console redirection be enabled
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

I had to also force a re-sync the normal mgmt password before attempting the reimage, as otherwise the idrac wouldn't work.

The host keeps booting up from disk - can you please take a look?

Completed auto-reimage of hosts:

['db1162.eqiad.wmnet']

Of which those FAILED:

['db1162.eqiad.wmnet']

Change 672426 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/puppet@production] updating mac address for db1162 to reflect motherboard change

https://gerrit.wikimedia.org/r/672426

Change 672426 merged by Cmjohnson:
[operations/puppet@production] updating mac address for db1162 to reflect motherboard change

https://gerrit.wikimedia.org/r/672426

@Marostegui The mac address for the nic changed, just merged the change. The install should work now. Can you try again and resolve this task when it works please.

Thanks @Cmjohnson I will try today or tomorrow morning and will close when done.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1162.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103151454_marostegui_6279.log.

Completed auto-reimage of hosts:

['db1162.eqiad.wmnet']

and were ALL successful.

db1162 was reimaged nicely
Thank you Chris

I will clone and repool this host tomorrow.