Page MenuHomePhabricator

mw2139 failed to boot - hardware check
Closed, ResolvedPublic

Description

During T174431 (P6996) I noticed that mw2139.codfw.wmnet failed to install and that it was stuck in the middle of the process and i didn't get console output when connecting.

I tried twice to repeat it to no avail.

When i manually reboot and watched mgmt console i saw a few lines of boot text but after

"Scanning for devices. Please wait, this may take several minutes..." it just stopped and got stuck again.

It seems hardware broke. Please verify if you can boot and check hardware.

Event Timeline

Dzahn created this task.May 10 2018, 8:56 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2018, 8:56 PM
Dzahn triaged this task as Normal priority.May 10 2018, 8:56 PM

racadm getsel doesn't have anything:

/admin1-> racadm getsel
Record:      1
Date/Time:   01/15/2015 23:01:17
Source:      system
Severity:    Ok
Description: Log cleared.
------------------------------

After waiting a couple hours and getting back on console it was blank again.

Papaul reassigned this task from Papaul to Dzahn.May 14 2018, 3:09 PM
Papaul added a subscriber: Papaul.

@Dzahn after unplugging the PSU from the server and boot the server I get the error below. I have no option to reset the IDRAC when I go to the IDRAC settings section.

Troubleshooting

  • Replaced the IDRAC card with another IDRAC card from one of the decom server

Result

  • same problem

Conclusion
we are looking at a bad main board
Options
Since the server is out of warranty, (2018-1-19)

  • Replace the main board with one from one of the decom server
  • Decommission the server
Dzahn reassigned this task from Dzahn to Papaul.May 14 2018, 5:40 PM

@Papaul thank you ! We should try the mainboard replacement but only if it's relatively easy thing to do and we have one around. If it causes a considerable amount of work /time or things have to be ordered we should just decom it.

Until then i will keep it deactivated and it doesn't have to be a high priority.

Change 433185 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] update MAC address of mw2139

https://gerrit.wikimedia.org/r/433185

Change 433185 merged by Dzahn:
[operations/puppet@production] update MAC address of mw2139

https://gerrit.wikimedia.org/r/433185

Papaul reassigned this task from Papaul to Dzahn.May 15 2018, 5:47 PM

@Dzahn I replaced the main board., Update the IDRAC and BIOS. it is all yours. I also installed the OS on the system.

Mentioned in SAL (#wikimedia-operations) [2018-05-15T19:03:51Z] <mutante> mw2139 - wmf-auto-reimage --conftoool --no-verify (T194426)

Mentioned in SAL (#wikimedia-operations) [2018-05-15T19:05:20Z] <mutante> mw2139 - wmf-auto-reimage --conftoool --new (because it got "Failed to icinga_downtime" and has a new mainboard (T194426)

Mentioned in SAL (#wikimedia-operations) [2018-05-15T23:10:49Z] <mutante> mw2139 - reimaged, scap pull, apache-fast-test baseurls from naos, repooled with confctl (T194426)

Dzahn closed this task as Resolved.May 15 2018, 11:12 PM
Dzahn added a subscriber: Muehlenhoff.

Thank you @Papaul! Works and is in use again now. Closing ticket as resolved.

(fyi @Muehlenhoff )

Vvjjkkii renamed this task from mw2139 failed to boot - hardware check to l6caaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Dzahn as the assignee of this task.
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from l6caaaaaaa to mw2139 failed to boot - hardware check.Jul 2 2018, 5:52 AM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to Dzahn.
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: gerritbot, Aklapper.