Page MenuHomePhabricator

cloudvirt1016: sudden reboot
Closed, ResolvedPublic

Description

On 2022-04-06 about 19:00 UTC the host cloudvirt1016 went down for no apparent reason. It was detected by icinga and paged the WMCS team.

Inspection of logs inside the server (/var/log/syslog and friends) revealed no clues, which itself hints at a sudden power loss.

To bring back online the host we had to manually start it via mgmt.

Related Objects

StatusSubtypeAssignedTask
ResolvedAndrew
ResolvedAndrew
ResolvedRequest Cmjohnson

Event Timeline

Slight correction. It did not reboot itself. It went down, and did not respond to the reboot cookbook. I manually power cycled it with ipmi.

Mentioned in SAL (#wikimedia-cloud) [2022-04-07T12:51:11Z] <wm-bot> Set cloudvirt 'cloudvirt1016.eqiad.wmnet' maintenance. (T305631) - cookbook ran by arturo@nostromo

Went down again today.
sudo cookbook sre.hosts.reboot-single cloudvirt1016.eqiad.wmnet
was unsuccessful.
Though IPMI got it back online.

How do we verify if this was still in maintenance mode?

Looks like it failed quite suddenly again. This should be taken out of service for a memory test.

Andrew claimed this task.