Page MenuHomePhabricator

db1173 won't boot up
Closed, ResolvedPublic

Description

While trying to reimage db1173 (old s6 master), the host won't get back up.
I have tried several things (including a hard reset, poweroff and then power back up) etc but even though the ILO says the power is on, I don't see the server booting up or anything.
Could you take a look? If possible this week as this is the host we'd failover to in case of a primary master failure in s6.

If it cannot be done this week, let me know so I can prepare another host as a stand by master

Thanks!

Event Timeline

Marostegui created this task.
Marostegui moved this task from Triage to In progress on the DBA board.

@Marostegui at first glance the server was hanging up during the boot process at memory configuration, I did not get any hardware errors, I put the server down to minimum post requirements (1 CPU and 1DIMM) and put everything back from there, on the last reboot DIMM A7 presented as having failed. A ticket has been placed with DELL.. Part may be here tomorrow or Friday. I replaced it for now with a spare from a decommissioned server so you do not have to set up another backup server.

Marostegui lowered the priority of this task from High to Medium.Jun 15 2022, 5:20 AM

Thank you so much @Cmjohnson! I can indeed access the host now and I have reimaged it sucessfully. Decreasing the priority since the initial issue was triaged (so fast!). So once the part is there just let me know one day/hour and I can leave it off for you.

@Cmjohnson did the part arrive? I don't know if you want to place that new part on this host or if you prefer to leave the DIMM that you already placed? Thank you!

Excellent, can you let me know a day and time that works for you to replace it? I can leave the host offline for you

I can do it today if you can offline it

Mentioned in SAL (#wikimedia-operations) [2022-06-29T09:01:21Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1173 for on-site maintenance T310595', diff saved to https://phabricator.wikimedia.org/P30603 and previous config saved to /var/cache/conftool/dbconfig/20220629-090120-root.json

Change 809540 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1173: Disable notifications

https://gerrit.wikimedia.org/r/809540

@Jclark-ctr host offline, you can proceed whenever you want. Once you are done, please power it back on and I will take it from there.
Thanks a lot!

Change 809540 merged by Marostegui:

[operations/puppet@production] db1173: Disable notifications

https://gerrit.wikimedia.org/r/809540

Replaced Dimm A7 powered host on

@Jclark-ctr completed the task, we will send the broken parts back to Dell