Page MenuHomePhabricator

ms-be1033 not powering up
Closed, ResolvedPublic

Description

Yesterday during T223126: Install new PDUs into b5-eqiad ms-be1033 powered off at the beginning of the window (since 2019-05-16 12:54:56 according to icinga) and after work was completed it couldn't be powered back on by @Cmjohnson . Filing a task for tracking on further diagnosis and next steps.

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptMay 17 2019, 8:30 AM
fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.May 21 2019, 9:59 AM

Since the host is not coming back for another week for sure I'm going to de-weight in swift

Change 511670 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/software/swift-ring@master] eqiad-prod: start depool ms-be1033

https://gerrit.wikimedia.org/r/511670

Mentioned in SAL (#wikimedia-operations) [2019-05-21T13:25:44Z] <godog> swift eqiad-prod: start depool ms-be1033 - T223518

Change 511670 merged by Filippo Giunchedi:
[operations/software/swift-ring@master] eqiad-prod: start depool ms-be1033

https://gerrit.wikimedia.org/r/511670

Mentioned in SAL (#wikimedia-operations) [2019-05-23T13:00:26Z] <godog> swift eqiad-prod: ms-be1033 weight to 1500 - T223518

Mentioned in SAL (#wikimedia-operations) [2019-05-27T13:02:58Z] <godog> swift eqiad-prod: ms-be1033 weight to 0 - T223518

Steps i have taken

  • I took the server down to the bare minimum operating condition 1CPU and 1DIMM and the server will still not boot. I created a support ticket with HP.

5338974069

jijiki added a subscriber: jijiki.May 30 2019, 12:08 PM

I downtimed the host on icinga for another week

The HP technician will be her June 7 @1000 Ashburn time.

Cmjohnson closed this task as Resolved.Jun 7 2019, 4:21 PM

The motherboard was replaced and the server is back up

Mentioned in SAL (#wikimedia-operations) [2019-06-08T11:58:20Z] <godog> stop swift processes on ms-be1033 - T223518

Mentioned in SAL (#wikimedia-operations) [2019-06-11T10:54:13Z] <godog> wipe fs on ms-be1033 data partitions - T223518

fgiunchedi reopened this task as Open.Jun 11 2019, 12:53 PM
fgiunchedi claimed this task.

Mentioned in SAL (#wikimedia-operations) [2019-06-11T12:54:13Z] <godog> swift eqiad-prod: put back ms-be1033 - T223518

Mentioned in SAL (#wikimedia-operations) [2019-06-12T11:55:43Z] <godog> swift eqiad-prod: put back ms-be1033 - T223518

Mentioned in SAL (#wikimedia-operations) [2019-06-25T12:48:40Z] <godog> swift eqiad-prod: put back ms-be1033 - T223518

Mentioned in SAL (#wikimedia-operations) [2019-07-01T09:54:28Z] <godog> swift eqiad-prod eqiad-prod: put back ms-be1033 - T223518

fgiunchedi closed this task as Resolved.Jul 1 2019, 10:02 AM

The last rebalance is underway now to put ms-be1033 fully back in service. Resolving.