Page MenuHomePhabricator

NodeDown cloudvirt1063
Closed, ResolvedPublic

Description

Common information

major outage that requires you to either restore the server or evacuate manually the VMs on it.

  • alertname: NodeDown
  • cluster: wmcs
  • instance: cloudvirt1063:9100
  • job: node
  • prometheus: ops
  • severity: page
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts


major outage that requires you to either restore the server or evacuate manually the VMs on it.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2023-12-14T00:25:02Z] <andrewbogott> evacuating hosts from cloudvirt1063 and depooling. T353406

Mentioned in SAL (#wikimedia-cloud) [2023-12-21T16:51:58Z] <dhinus> puppet node deactivate cloudvirt1063.eqiad.wmnet T353406

fnegri claimed this task.
fnegri subscribed.

I deactivated the node to resolve the alerts until the server is replaced in T353408: Cloudvirt1063.eqiad.wmnet overheating