Page MenuHomePhabricator

cloudvirt1013: server down for no reason (power issue?)
Closed, ResolvedPublic

Description

Cloudvirt1013 server has spontaneously shut down twice in the last week:

On 2019-12-22 at about 09:30UTC, restarted by iLO
On 2019-12-27 at 10:45UTC, restarted by me at the mgmt console in response to a page

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2019-12-22T09:45:10Z] <arturo> cloudvirt1013 is back (did it alone) T241313

aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

I'm not sure if it is true that I couldn't reach the iLO. I was trying install_console to cloudvirt1013.eqiad.wmnet instead of cloudvirt1013.mgmt.eqiad.wmnet.

Latest events in the iLO console:

$</>hpiLO-> show /system1/log1

status=0
status_tag=COMMAND COMPLETED
Sun Dec 22 05:08:05 2019



/system1/log1
  Targets
    record1
    record2
    record3
    record4
    record5
    record6
    record7
    record8
    record9
    record10
    record11
    record12
    record13
    record14
    record15
    record16
    record17
    record18
    record19
    record20
    record21
    record22
    record23
    record24
    record25
    record26
    record27
  Properties
  Verbs
    cd version exit show delete

$</>hpiLO-> show /system1/log1/record27

status=0
status_tag=COMMAND COMPLETED
Sun Dec 22 05:08:30 2019



/system1/log1/record27
  Targets
  Properties
    number=27
    severity=Critical
    date=12/22/2019
    time=09:38
    description=ASR Detected by System ROM
  Verbs
    cd version exit show


$</>hpiLO-> show /system1/log1/record26 

status=0
status_tag=COMMAND COMPLETED
Sun Dec 22 05:08:40 2019



/system1/log1/record26
  Targets
  Properties
    number=26
    severity=Caution
    date=12/22/2019
    time=04:24
    description=Smart Storage Battery has exceeded the maximum amount of devices supported (Battery 1, service information: 0x07). Action: 1. Remove additional devices. 2. Consult server troubleshooting guide. 3. Gather AHS log and contact Support
  Verbs
    cd version exit show


$</>hpiLO-> show /system1/log1/record25

status=0
status_tag=COMMAND COMPLETED
Sun Dec 22 05:09:06 2019



/system1/log1/record25
  Targets
  Properties
    number=25
    severity=Repaired
    date=10/24/2019
    time=07:10
    description=System Power Supplies Not Redundant
  Verbs
    cd version exit show

This just happened again -- suddenly down, restarted from iLO just fine.

Change 560837 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud-vps: depool cloudvirt1013, pool cloudvirt1024

https://gerrit.wikimedia.org/r/560837

Change 560837 merged by Andrew Bogott:
[operations/puppet@production] cloud-vps: depool cloudvirt1013, pool cloudvirt1024

https://gerrit.wikimedia.org/r/560837

Mentioned in SAL (#wikimedia-cloud) [2019-12-27T11:07:48Z] <andrewbogott> migrating cyberbot-db-01 to cloudvirt1009 in response to T241313

Mentioned in SAL (#wikimedia-cloud) [2019-12-27T11:12:59Z] <andrewbogott> migrating osmit-test to cloudvirt1009 in response to T241313

Mentioned in SAL (#wikimedia-cloud) [2019-12-27T11:13:24Z] <andrewbogott> migrating deployment-aqs03 to cloudvirt1009 in response to T241313

I've drained all VMs off this server and put it in downtime until March 1st for investigation or repair. I don't have any good ideas about how to repair it.

cloudvirt1013, cloudvirt1014, and cloudvirt1023 are the only cloudvirts running

Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11)

cloudvirt1023 is held back as a spare, so not under load.

Kernel is probably unrelated, they're running that new kernel because of the post-crash reboot, were running the standard kernel before that.

Note to DCOps: This system is already drained of VMs. We may simply need to downtime it to shut down for troubleshooting.

Mentioned in SAL (#wikimedia-cloud) [2020-01-23T20:17:52Z] <jeh> cloudvirt1013 set icinga downtime and powering down for hardware maintenance T241313

313-hpe smart storage battery 1 Failure - battery shutdown event code: 0x400
action: restart system

Needs replacement bbu @wiki_willy can we order new one?

wiki_willy added a subtask: Unknown Object (Task).Jan 23 2020, 8:44 PM
wiki_willy added a subscriber: RobH.

Sure, no problem @Jclark-ctr. I've opened up a procurement task via T243547 for @RobH to order a replacement bbu. Thanks, Willy

Replaced bbu no errrors at this time closing procurement task T243547 not needed at this time

Change 567024 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: eqiad1: repool cloudvirt1013

https://gerrit.wikimedia.org/r/567024

Mentioned in SAL (#wikimedia-cloud) [2020-01-24T12:52:52Z] <arturo> repooling cloudvirt1013 after HW got fixed (T241313)

Change 567024 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] wmcs: eqiad1: repool cloudvirt1013

https://gerrit.wikimedia.org/r/567024

Mentioned in SAL (#wikimedia-cloud) [2020-01-24T15:10:53Z] <jeh> remove icinga downtime for cloudvirt1013 T241313

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Feb 25 2020, 10:04 PM