Page MenuHomePhabricator

Closed, ResolvedPublic


Common information

major outage that requires you to either restore the server or evacuate manually the VMs on it.

  • alertname: NodeDown
  • cluster: wmcs
  • instance: cloudvirt1063:9100
  • job: node
  • prometheus: ops
  • severity: page
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts

major outage that requires you to either restore the server or evacuate manually the VMs on it.

Related Objects


Event Timeline

Record:      6
Date/Time:   12/02/2023 12:14:27
Source:      system
Severity:    Critical
Description: CPU 2 has a thermal trip (over-temperature) event.
taavi@cloudcontrol1006 ~ $ os server list --all --host cloudvirt1063
| ID                                   | Name                          | Status | Networks                               | Image                                        | Flavor                          |
| fcefcce4-c97d-44cf-a077-310612ed0118 | tools-k8s-worker-96           | ACTIVE | lan-flat-cloudinstances2b= | debian-10.0-buster                           | g3.cores8.ram16.disk20.ephem140 |
| 826760c6-d153-4070-9674-f1207c3ec328 | copypatrolbackenddeploytest01 | ACTIVE | lan-flat-cloudinstances2b= | debian-12.0-bookworm (deprecated 2023-11-27) | g3.cores2.ram4.disk20           |
| fd43cf27-3b19-4035-8047-7529c69b3e53 | project-proxy-acme-chief-02   | ACTIVE | lan-flat-cloudinstances2b= | debian-12.0-bookworm (deprecated 2023-11-27) | g3.cores1.ram2.disk20           |
| 81c76451-4919-4105-9142-410add2e4cfb | canary1063-1                  | ACTIVE | lan-flat-cloudinstances2b= | debian-11.0-bullseye                         | g3.cores1.ram1.disk20           |
taavi claimed this task.

This hasn't happened again, so closing.