Page MenuHomePhabricator

2024-09-21 NodeDown cloudvirt1063
Closed, ResolvedPublic

Description

Common information

  • alertname: NodeDown
  • cluster: wmcs
  • instance: cloudvirt1063:9100
  • job: node
  • prometheus: ops
  • severity: page
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts


Event Timeline

fnegri changed the task status from Open to In Progress.Sep 21 2024, 9:56 AM
fnegri claimed this task.
fnegri triaged this task as High priority.
fnegri added a project: Cloud-VPS.
fnegri subscribed.

The server stopped responding today at 9:00 UTC (according to Grafana).

The same server failed a few months ago: T368093: reapply thermal paste to processors in cloudvirt1063

I'm setting it to maintenance, and evacuating VMs.

fnegri renamed this task from NodeDown to NodeDown cloudvirt1063.Sep 21 2024, 9:56 AM

Mentioned in SAL (#wikimedia-cloud) [2024-09-21T09:59:14Z] <dhinus> openstack aggregate add host maintenance cloudvirt1063 (T375223)

Mentioned in SAL (#wikimedia-cloud) [2024-09-21T09:59:46Z] <dhinus> openstack aggregate remove host ceph cloudvirt1063 (T375223)

root@cloudcontrol1005:~# openstack server list --host cloudvirt1063 --all-projects
+------------------------------+---------------------------+--------+------------------------------+------------------------------+------------------------------+
| ID                           | Name                      | Status | Networks                     | Image                        | Flavor                       |
+------------------------------+---------------------------+--------+------------------------------+------------------------------+------------------------------+
| f9015696-9010-4a1c-a7cf-     | tools-k8s-worker-nfs-66   | ACTIVE | lan-flat-cloudinstances2b=17 | debian-12.0-bookworm         | g4.cores8.ram16.disk20.ephem |
| 9bc8e8098df4                 |                           |        | 2.16.5.255                   |                              | 140                          |
| 5899163e-bbfb-474e-8a9a-     | deployment-server-phi     | ACTIVE | lan-flat-cloudinstances2b=17 | debian-11.0-bullseye         | g4.cores1.ram2.disk20        |
| 2dae5006de22                 |                           |        | 2.16.4.159                   |                              |                              |
| abaa359b-3468-4b56-be7c-     | test                      | ACTIVE | lan-flat-                    | debian-12.0-bookworm         | g4.cores1.ram1.disk20        |
| 24dc88252c9e                 |                           |        | cloudinstances2b=172.16.0.41 |                              |                              |
| 3585384e-e85f-41da-a014-     | canary1063-2              | ACTIVE | lan-flat-cloudinstances2b=17 | debian-12.0-bookworm         | g4.cores1.ram1.disk20        |
| 448c59d76586                 |                           |        | 2.16.6.136                   | (deprecated 2024-07-03)      |                              |
| 3192a8d0-0fda-4d24-a1b7-     | deployment-puppetserver-1 | ACTIVE | lan-flat-cloudinstances2b=17 | debian-12.0-bookworm         | g4.cores4.ram8.disk20        |
| 2e2dc20fbe0e                 |                           |        | 2.16.1.160                   | (deprecated 2024-04-10)      |                              |
| 1b0a01e4-e8cb-4a68-ad6c-     | wikiwho01                 | ACTIVE | lan-flat-                    | debian-11.0-bullseye         | g4.cores24.ram122.disk20     |
| 306b9869db05                 |                           |        | cloudinstances2b=172.16.6.48 | (deprecated 2023-06-08)      |                              |
| 20f22f9d-5734-49b9-8a19-     | deployment-ms-be08        | ACTIVE | lan-flat-                    | debian-11.0-bullseye         | g4.cores8.ram16.disk20       |
| 9908e588c92f                 |                           |        | cloudinstances2b=172.16.6.12 | (deprecated 2023-01-12)      |                              |
| b615abe0-9cc2-4cee-a224-     | maps-test-2               | ACTIVE | lan-flat-cloudinstances2b=17 | trove-guest-bobcat-ubuntu-   | g4.cores1.ram2.disk20        |
| c1f0b90e7957                 |                           |        | 2.16.6.207                   | jammy                        |                              |
| 15700b39-5773-4fec-9349-     | bastion-eqiad1-03         | ACTIVE | lan-flat-cloudinstances2b=17 | debian-11.0-bullseye         | g4.cores1.ram2.disk20        |
| 47f16bc457bd                 |                           |        | 2.16.3.145, 185.15.56.87     | (deprecated 2022-05-18)      |                              |
+------------------------------+---------------------------+--------+------------------------------+------------------------------+------------------------------+

Mentioned in SAL (#wikimedia-cloud) [2024-09-21T10:17:53Z] <dhinus> nova host-evacuate cloudvirt1063 (T375223)

root@cloudcontrol1005:~# nova --os-username novaadmin --os-project-name admin --os-auth-url "https://openstack.eqiad1.wikimediacloud.org:25357/v3" --os-password XXXXXX --os-user-domain-id default host-evacuate cloudvirt1063
nova CLI is deprecated and will be removed in a future release

+--------------------------------------+-------------------+---------------+
| Server UUID                          | Evacuate Accepted | Error Message |
+--------------------------------------+-------------------+---------------+
| 3585384e-e85f-41da-a014-448c59d76586 | True              |               |
| abaa359b-3468-4b56-be7c-24dc88252c9e | True              |               |
| 5899163e-bbfb-474e-8a9a-2dae5006de22 | True              |               |
| f9015696-9010-4a1c-a7cf-9bc8e8098df4 | True              |               |
| 15700b39-5773-4fec-9349-47f16bc457bd | True              |               |
| b615abe0-9cc2-4cee-a224-c1f0b90e7957 | True              |               |
| 20f22f9d-5734-49b9-8a19-9908e588c92f | True              |               |
| 1b0a01e4-e8cb-4a68-ad6c-306b9869db05 | True              |               |
| 3192a8d0-0fda-4d24-a1b7-2e2dc20fbe0e | True              |               |
+--------------------------------------+-------------------+---------------+

nova host-evacuate cloudvirt1063 did move all the above VMs to other cloudvirts, but they are now in status=SHUTOFF.

I restarted them manually with openstack server start <id> for all the ids listed above, excluding 3585384e-e85f-41da-a014-448c59d76586 which I deleted because it was canary1063-2.

I restarted the server from the mgmt interface, I could ssh to it and check the syslog at the time of the crash. It's not much helpful but it's similar to the log entry on the previous crash (reported in T368093):

2024-09-21T08:59:52.989834+00:00 cloudvirt1063 neutron-openvswitch-agent: 2024-09-21 08:59:52.989 4181396 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neut
ron_agent [None req-79e927a3-e763-4481-9307-07accf9b86e3 - - - - - -] Agent rpc_loop - iteration:129175 completed. Processed ports statistics: {'regular': {'added':
0, 'updated': 0, 'removed': 0}}. Elapsed:0.003
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

The server went unexpectedly down tonight at ~1am UTC (got a page):

Sep 24 01:25:22 cloudvirt1063 systemd-logind[1392]: Power key pressed short.
Sep 24 01:25:22 cloudvirt1063 systemd-logind[1392]: Powering off...
Sep 24 01:25:22 cloudvirt1063 systemd-logind[1392]: System is powering down.
Sep 24 01:25:22 cloudvirt1063 systemd[1]: Removed slice system-modprobe.slice - Slice /system/modprobe.
Sep 24 01:25:22 cloudvirt1063 systemd[1]: Stopped target graphical.target - Graphical Interface.
Sep 24 01:25:22 cloudvirt1063 systemd[1]: Stopped target multi-user.target - Multi-User System.
Sep 24 01:25:22 cloudvirt1063 systemd[1]: Stopped target getty.target - Login Prompts.
Sep 24 01:25:22 cloudvirt1063 systemd[1]: Stopped target machines.target - Containers.
Sep 24 01:25:22 cloudvirt1063 systemd[1]: Stopped target timers.target - Timer Units.

There's no error in the logs before that, so might be a different cause. There's one error after, when trying to stop libvirtd, and openvswitch (though probably expected due to the shutdown):

Sep 24 01:25:23 cloudvirt1063 neutron-openvswitch-agent[2037]: 2024-09-24 01:25:23.321 2037 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] SIGTERM received, capping RPC timeout by 10 seconds.
Sep 24 01:25:23 cloudvirt1063 systemd[1]: Stopping nova-compute.service - OpenStack Nova Compute (nova-compute)...
Sep 24 01:25:23 cloudvirt1063 neutron-openvswitch-agent[2037]: 2024-09-24 01:25:23.322 2037 ERROR neutron.agent.common.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: 2024-09-24T01:25:23Z|00001|fatal_signal|WARN|terminating with signal 15 (Terminated)
Sep 24 01:25:23 cloudvirt1063 systemd[1]: Stopping ovs-record-hostname.service - Open vSwitch Record Hostname...
Sep 24 01:25:23 cloudvirt1063 neutron-openvswitch-agent[2037]: 2024-09-24 01:25:23.323 2037 ERROR neutron.agent.common.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: None
...
Sep 24 01:25:24 cloudvirt1063 libvirtd[2035]: End of file while reading data: Input/output error
...
dcaro renamed this task from NodeDown cloudvirt1063 to 2024-09-21 NodeDown cloudvirt1063.Sep 24 2024, 8:07 AM

Mentioned in SAL (#wikimedia-operations) [2024-09-24T08:36:11Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223

Mentioned in SAL (#wikimedia-operations) [2024-09-24T08:36:25Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223

Sep 24 01:25:22 cloudvirt1063 systemd-logind[1392]: Power key pressed short.

Was it maybe dcops folks powering down to check it?

@dcaro Sorry about the page :( We need to handle alerts better to avoid getting paged about servers we already know are in maintenance. For now I've set a 7-day silence with the downtime cookbook:

fnegri@cumin1002:~$ sudo cookbook sre.hosts.downtime --days 7 -r "cloudvirt1063 needs maintenance T375223" 'cloudvirt1063.eqiad.wmnet'

Mentioned in SAL (#wikimedia-operations) [2024-10-01T09:18:47Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223

Mentioned in SAL (#wikimedia-operations) [2024-10-01T09:19:00Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223

Mentioned in SAL (#wikimedia-operations) [2024-10-14T16:50:48Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223

Mentioned in SAL (#wikimedia-operations) [2024-10-14T16:51:28Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223

fnegri changed the task status from In Progress to Stalled.Oct 22 2024, 5:11 PM

Mentioned in SAL (#wikimedia-operations) [2024-10-28T17:03:44Z] <fnegri@cumin1002> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223

Mentioned in SAL (#wikimedia-operations) [2024-10-28T17:04:50Z] <fnegri@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223

fnegri changed the task status from Stalled to In Progress.Nov 5 2024, 5:48 PM
fnegri changed the task status from In Progress to Stalled.

We are waiting for Dell to replace the mainboard, see the subtask T375372: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet

fnegri changed the task status from Stalled to In Progress.Nov 7 2024, 11:48 AM

Mainboard was replaced. I'm gonna reimage the host before putting it back into service.

Cookbook cookbooks.sre.hosts.reimage was started by fnegri@cumin1002 for host cloudvirt1063.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by fnegri@cumin1002 for host cloudvirt1063.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1063 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202411071701_fnegri_950326_cloudvirt1063.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status failed -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

The host is reimaged and repooled!

We noticed a kernel message that is not present in other cloudvirts:

root@cloudvirt1063:~# journalctl -k -p err
Nov 07 17:18:37 cloudvirt1063 kernel: x86/cpu: SGX disabled by BIOS.

I've created T379351: kernel message: SGX disabled by BIOS to discuss what to do about the kernel message.

This task is now completed.