Page MenuHomePhabricator

NodeDown (cloudvirt1063)
Closed, ResolvedPublic

Description

Common information

  • alertname: NodeDown
  • cluster: wmcs
  • instance: cloudvirt1063:9100
  • job: node
  • prometheus: ops
  • severity: page
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts


Related Objects

Event Timeline

Andrew renamed this task from NodeDown to NodeDown (cloudvirt1063).Jun 20 2024, 2:16 AM

I was unable to ssh to this host. I forced a reboot via racadm at 02:05UTC; it restarted normally and restarted all associated VMs. Affected VMs are:

+--------------------------------------+-------------------------------+--------+
| ID                                   | Name                          | Status |
+--------------------------------------+-------------------------------+--------+
| f0936015-b1ef-4b8c-ac6e-74766ed882ce | tools-checker-5               | ACTIVE |
| 452dc8d3-6ee0-412c-90ba-66f21f9d09c1 | tools-acme-chief-3            | ACTIVE |
| 93eb7e45-7bd5-461b-93b1-e0637e947de0 | owidm-prod-1                  | ACTIVE |
| ae9fa949-e4ce-4ffb-ad9f-4e5a3812d031 | toolsbeta-test-k8s-worker-11  | ACTIVE |
| 77140f83-1a12-43b7-9e47-0e779503a525 | tools-k8s-worker-nfs-24       | ACTIVE |
| 81856648-2bfb-4636-ab19-5aa83b308f13 | tools-k8s-worker-nfs-6        | ACTIVE |
| ae7e8bb8-533a-4b6a-bcb5-020f3b4e0d93 | canary1063-1                  | ACTIVE |
| 47b0da1d-e50a-42c1-8cd9-dc255cc2f1a3 | bastion-restricted-eqiad1-3   | ACTIVE |
| 6616fcbf-a49e-4e03-b735-84d31b535c08 | integration-agent-docker-1055 | ACTIVE |
| 270b6533-dc99-4e5d-a642-c61138b11891 | integration-agent-docker-1046 | ACTIVE |
| 63b82d38-3026-408f-8dcc-0ecbd1a3c870 | wm-bot-pg2-bookworm           | ACTIVE |
| 3df62375-e75d-4068-9c71-13519b6bf927 | wm-bot-bookworm               | ACTIVE |
| 1dc3fcee-cf27-4351-ad60-384478624ea3 | producer                      | ACTIVE |
| 0bd43b56-75ba-411b-980b-b8d8f06837a8 | humaniki-prod                 | ACTIVE |
| 37e23659-4516-4bd8-a9be-4cc55def5560 | gitlab-runner-addshore-1016   | ACTIVE |
| a3e945dc-3548-47fa-8ce3-bf1426ff3b15 | traffic-cpupload              | ACTIVE |
| 6f1171db-9d7d-466f-aaa7-cb14e1a6af41 | traffic-cache-atstext-buster  | ACTIVE |
| 7cb371bb-a53a-4e65-a1cf-f1a8264a9166 | metricsinfra-prometheus-2     | ACTIVE |
| 7da824f3-c9f3-4460-b9b2-2a894268a7ca | kitools                       | ACTIVE |
| e94378df-7a48-4b11-b44a-bf69aaf132bd | enc-1                         | ACTIVE |
| 0258b810-29af-448a-af5e-ed39e19286df | backend                       | ACTIVE |
| 138be95d-93ad-4a85-9245-a0a508711555 | metricsinfra-db-1             | ACTIVE |
| 2d2e8925-b50f-483a-82ee-e6a1c588e5be | wikidata-federated-properties | ACTIVE |
| d0b1d9d5-1aec-4a05-a2d7-6d8522a365dc | recommender                   | ACTIVE |
+--------------------------------------+-------------------------------+--------+

Syslog shows the outage clearly, but not much explanation:

2024-06-20T01:48:38.956157+00:00 cloudvirt1063 neutron-linuxbridge-agent: 2024-06-20 01:48:38.954 1092266 INFO neutron.plugins.ml2.drivers.agent._common_agent [None req-a975fb7e-2de9-4765-97da-e0f032f22072 - - - - - -] Linux bridge agent Agent loop - iteration:51253 completed
2024-06-20T01:48:38.956400+00:00 cloudvirt1063 neutron-linuxbridge-agent: 2024-06-20 01:48:38.955 1092266 INFO neutron.plugins.ml2.drivers.agent._common_agent [None req-a975fb7e-2de9-4765-97da-e0f032f22072 - - - - - -] Linux bridge agent Agent loop - iteration:51254 started
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-20T02:29:13Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1063.eqiad.wmnet' (T368007)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-20T02:37:42Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1063.eqiad.wmnet' (T368007)

This host is now drained and won't get any more VMs scheduled on it. I don't have any theory about why it died -- let's see if it repeats.

-------------------------------------------------------------------------------
Record:      27
Date/Time:   06/20/2024 02:45:43
Source:      system
Severity:    Critical
Description: CPU 2 has a thermal trip (over-temperature) event.
-------------------------------------------------------------------------------