cloudvirt1013: server down for no reason (power issue?)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Dec 22 2019, 9:45 AM

Description

Cloudvirt1013 server has spontaneously shut down twice in the last week:

On 2019-12-22 at about 09:30UTC, restarted by iLO
On 2019-12-27 at 10:45UTC, restarted by me at the mgmt console in response to a page

Details

	Subject	Repo	Branch	Lines +/-
	wmcs: eqiad1: repool cloudvirt1013	operations/puppet	production	+2 -1
	cloud-vps: depool cloudvirt1013, pool cloudvirt1024	operations/puppet	production	+3 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Task)
Resolved	• Cmjohnson	T138509 rack/setup/install/deploy labvirt1012 labvirt1013 labvirt1014 nodes (cloudvirt1012 cloudvirt1013 cloudvirt1014)
Resolved	Jclark-ctr	T241313 cloudvirt1013: server down for no reason (power issue?)
		Unknown Object (Task)

Event Timeline

aborrero created this task.Dec 22 2019, 9:45 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 22 2019, 9:45 AM

Mentioned in SAL (#wikimedia-cloud) [2019-12-22T09:45:10Z] <arturo> cloudvirt1013 is back (did it alone) T241313

aborrero triaged this task as High priority.Dec 22 2019, 9:45 AM

aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

I'm not sure if it is true that I couldn't reach the iLO. I was trying install_console to cloudvirt1013.eqiad.wmnet instead of cloudvirt1013.mgmt.eqiad.wmnet.

aborrero merged a task: T241315: cloudvirt1013 apparent power loss.Dec 22 2019, 10:02 AM

aborrero added a subscriber: Andrew.

Latest events in the iLO console:

$</>hpiLO-> show /system1/log1

status=0
status_tag=COMMAND COMPLETED
Sun Dec 22 05:08:05 2019



/system1/log1
  Targets
    record1
    record2
    record3
    record4
    record5
    record6
    record7
    record8
    record9
    record10
    record11
    record12
    record13
    record14
    record15
    record16
    record17
    record18
    record19
    record20
    record21
    record22
    record23
    record24
    record25
    record26
    record27
  Properties
  Verbs
    cd version exit show delete

$</>hpiLO-> show /system1/log1/record27

status=0
status_tag=COMMAND COMPLETED
Sun Dec 22 05:08:30 2019



/system1/log1/record27
  Targets
  Properties
    number=27
    severity=Critical
    date=12/22/2019
    time=09:38
    description=ASR Detected by System ROM
  Verbs
    cd version exit show


$</>hpiLO-> show /system1/log1/record26 

status=0
status_tag=COMMAND COMPLETED
Sun Dec 22 05:08:40 2019



/system1/log1/record26
  Targets
  Properties
    number=26
    severity=Caution
    date=12/22/2019
    time=04:24
    description=Smart Storage Battery has exceeded the maximum amount of devices supported (Battery 1, service information: 0x07). Action: 1. Remove additional devices. 2. Consult server troubleshooting guide. 3. Gather AHS log and contact Support
  Verbs
    cd version exit show


$</>hpiLO-> show /system1/log1/record25

status=0
status_tag=COMMAND COMPLETED
Sun Dec 22 05:09:06 2019



/system1/log1/record25
  Targets
  Properties
    number=25
    severity=Repaired
    date=10/24/2019
    time=07:10
    description=System Power Supplies Not Redundant
  Verbs
    cd version exit show

This just happened again -- suddenly down, restarted from iLO just fine.

Change 560837 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud-vps: depool cloudvirt1013, pool cloudvirt1024

https://gerrit.wikimedia.org/r/560837

gerritbot added a project: Patch-For-Review.Dec 27 2019, 10:57 AM

Change 560837 merged by Andrew Bogott:
[operations/puppet@production] cloud-vps: depool cloudvirt1013, pool cloudvirt1024

https://gerrit.wikimedia.org/r/560837

Mentioned in SAL (#wikimedia-cloud) [2019-12-27T11:07:48Z] <andrewbogott> migrating cyberbot-db-01 to cloudvirt1009 in response to T241313

Maintenance_bot removed a project: Patch-For-Review.Dec 27 2019, 11:10 AM

Mentioned in SAL (#wikimedia-cloud) [2019-12-27T11:12:59Z] <andrewbogott> migrating osmit-test to cloudvirt1009 in response to T241313

Mentioned in SAL (#wikimedia-cloud) [2019-12-27T11:13:24Z] <andrewbogott> migrating deployment-aqs03 to cloudvirt1009 in response to T241313

aborrero updated the task description. (Show Details)Dec 27 2019, 11:16 AM

I've drained all VMs off this server and put it in downtime until March 1st for investigation or repair. I don't have any good ideas about how to repair it.

Andrew updated the task description. (Show Details)Dec 27 2019, 3:26 PM

Andrew mentioned this in T241492: cloudvirt1014 crash.Dec 27 2019, 5:15 PM

cloudvirt1013, cloudvirt1014, and cloudvirt1023 are the only cloudvirts running

Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11)

cloudvirt1023 is held back as a spare, so not under load.

Kernel is probably unrelated, they're running that new kernel because of the post-crash reboot, were running the standard kernel before that.

aborrero added a parent task: T138509: rack/setup/install/deploy labvirt1012 labvirt1013 labvirt1014 nodes (cloudvirt1012 cloudvirt1013 cloudvirt1014).Jan 4 2020, 3:44 PM

Note to DCOps: This system is already drained of VMs. We may simply need to downtime it to shut down for troubleshooting.

wiki_willy assigned this task to Jclark-ctr.Jan 6 2020, 6:24 PM

wiki_willy added a project: ops-eqiad.

Restricted Application added a project: SRE. · View Herald TranscriptJan 6 2020, 6:24 PM

wiki_willy moved this task from Backlog to Cloud Tasks on the ops-eqiad board.Jan 6 2020, 6:25 PM

bd808 edited projects, added cloud-services-team (Hardware); removed cloud-services-team (Kanban).Jan 9 2020, 10:22 PM

bd808 moved this task from Backlog to Hardware faults on the cloud-services-team (Hardware) board.

• JHedden mentioned this in T242472: Degraded RAID on cloudvirt1013.Jan 10 2020, 10:44 PM

Andrew mentioned this in T243414: relocate/reimage cloudvirt1013 with 10G interfaces.Jan 22 2020, 1:58 PM

Mentioned in SAL (#wikimedia-cloud) [2020-01-23T20:17:52Z] <jeh> cloudvirt1013 set icinga downtime and powering down for hardware maintenance T241313

313-hpe smart storage battery 1 Failure - battery shutdown event code: 0x400
action: restart system

Needs replacement bbu @wiki_willy can we order new one?

Sure, no problem @Jclark-ctr. I've opened up a procurement task via T243547 for @RobH to order a replacement bbu. Thanks, Willy

Replaced bbu no errrors at this time closing procurement task T243547 not needed at this time

Jclark-ctr closed this task as Resolved.Jan 23 2020, 10:11 PM

Change 567024 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: eqiad1: repool cloudvirt1013

https://gerrit.wikimedia.org/r/567024

gerritbot added a project: Patch-For-Review.Jan 24 2020, 12:52 PM

Mentioned in SAL (#wikimedia-cloud) [2020-01-24T12:52:52Z] <arturo> repooling cloudvirt1013 after HW got fixed (T241313)

Change 567024 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] wmcs: eqiad1: repool cloudvirt1013

https://gerrit.wikimedia.org/r/567024

Maintenance_bot removed a project: Patch-For-Review.Jan 24 2020, 1:11 PM

Mentioned in SAL (#wikimedia-cloud) [2020-01-24T15:10:53Z] <jeh> remove icinga downtime for cloudvirt1013 T241313

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Feb 25 2020, 10:04 PM

cloudvirt1013: server down for no reason (power issue?)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

cloudvirt1013: server down for no reason (power issue?)
Closed, ResolvedPublic
Actions

Related Objects
Search...