Page MenuHomePhabricator

cloudvirt1051 crashed
Closed, ResolvedPublic

Description

Common information

major outage that requires you to either restore the server or evacuate manually the VMs on it.

  • alertname: NodeDown
  • cluster: wmcs
  • instance: cloudvirt1051:9100
  • job: node
  • prometheus: ops
  • severity: page
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts


major outage that requires you to either restore the server or evacuate manually the VMs on it.

  • alertname: NodeDown
  • cluster: wmcs
  • instance: cloudvirt1051:9100
  • job: node
  • prometheus: ops
  • severity: page
  • site: eqiad
  • source: prometheus
  • team: wmcs
  • Source

Related Objects

Event Timeline

taavi triaged this task as Unbreak Now! priority.
taavi added a project: Cloud-VPS.

affected VMs:

1taavi@cloudcontrol1005 ~ $ os server list --all --host cloudvirt1051
2+--------------------------------------+--------------------------+---------+------------------------------------------------------+----------------------------------------------+-------------------------------------+
3| ID | Name | Status | Networks | Image | Flavor |
4+--------------------------------------+--------------------------+---------+------------------------------------------------------+----------------------------------------------+-------------------------------------+
5| 0f6b34aa-cef3-4c06-977d-2dc868dcf16d | canary1051-4 | ACTIVE | lan-flat-cloudinstances2b=172.16.3.4 | debian-11.0-bullseye (deprecated 2023-01-12) | g3.cores1.ram1.disk20 |
6| bfad7fbd-53db-4604-aa38-19ffa3e3da02 | harbordb | ACTIVE | lan-flat-cloudinstances2b=172.16.5.95 | trove-master-guest-ubuntu-bionic | g3.cores2.ram4.disk20 |
7| e6bd0bc0-c168-48cf-bc25-374ba0c720bb | tools-prometheus-7 | ACTIVE | lan-flat-cloudinstances2b=172.16.1.224 | debian-11.0-bullseye (deprecated 2023-01-12) | g3.cores8.ram36.disk20 |
8| a3e945dc-3548-47fa-8ce3-bf1426ff3b15 | traffic-cpupload | ACTIVE | lan-flat-cloudinstances2b=172.16.1.247 | debian-10.0-buster | g3.cores1.ram2.disk20 |
9| 7aee032b-e065-4100-b7d8-20dd4db55705 | tools-sgeexec-10-18 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.193 | debian-10.0-buster | g3.cores4.ram8.disk20.swap8.ephem20 |
10| 6f1e5679-5b01-4ecd-acc6-27c700b5273c | puppet1 | ACTIVE | lan-flat-cloudinstances2b=172.16.7.118 | debian-10.0-buster | g3.cores1.ram2.disk20 |
11| 7da824f3-c9f3-4460-b9b2-2a894268a7ca | kitools | ACTIVE | lan-flat-cloudinstances2b=172.16.0.116 | debian-11.0-bullseye (deprecated 2022-05-18) | g3.cores4.ram8.disk20 |
12| 5203247c-7191-4bde-88d4-7dea51a31920 | utrs-database | ACTIVE | lan-flat-cloudinstances2b=172.16.6.23 | debian-11.0-bullseye (deprecated 2022-05-18) | g3.cores1.ram2.disk20 |
13| a8c98594-5b5f-46ac-9bb2-438b6d8484c4 | hashtags-prod1 | ACTIVE | lan-flat-cloudinstances2b=172.16.2.235 | debian-11.0-bullseye (deprecated 2022-05-18) | g3.cores2.ram4.disk20 |
14| 9f2d85f2-9599-4631-8299-89d4c4722124 | pcc-worker1002 | ACTIVE | lan-flat-cloudinstances2b=172.16.5.16 | debian-10.0-buster | g3.cores4.ram8.disk20 |
15| 1c3e4b8a-9076-4c8c-b2e6-51606c0b1fb8 | tools-k8s-haproxy-4 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.136 | debian-10.0-buster (deprecated 2021-07-30) | g3.cores2.ram4.disk20 |
16| 0b5efef1-d5d1-402a-882a-7a27b6ced0a3 | mailman03 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.88, 185.15.56.43 | debian-10.0-buster (deprecated 2021-07-30) | g3.cores1.ram2.disk20 |
17| 57107284-ba21-4464-abef-caca3fc88001 | tools-mail-03 | ACTIVE | lan-flat-cloudinstances2b=172.16.0.126, 185.15.56.63 | debian-10.0-buster (deprecated 2021-07-30) | g3.cores1.ram2.disk20 |
18| 8bb7461b-2cb1-4a23-9405-183955a3fb4e | toolsbeta-sgegrid-shadow | ACTIVE | lan-flat-cloudinstances2b=172.16.6.109 | debian-10.0-buster (deprecated 2021-03-24) | g3.cores1.ram2.disk20 |
19| 2fd430a8-9357-4bf6-ae9f-b31a53a57050 | jitsi04 | ACTIVE | lan-flat-cloudinstances2b=172.16.3.251, 185.15.56.72 | debian-10.0-buster (deprecated 2021-02-22) | g2.cores16.ram16.disk80 |
20| 4fc636c2-6af8-4ce7-a9a9-6c23a73cbd73 | cloudinfra-acme-chief-01 | ACTIVE | lan-flat-cloudinstances2b=172.16.2.91 | debian-10.0-buster (deprecated 2021-02-22) | g2.cores1.ram2.disk20 |
21| 7a45fd55-93d7-47dc-9c06-74e665f0bf2b | tools-k8s-worker-66 | ACTIVE | lan-flat-cloudinstances2b=172.16.1.117 | debian-10.0-buster (deprecated 2020-10-16) | g2.cores8.ram16.disk160 |
22| e97b1532-b9d8-4081-a563-3611b002443f | mailman-db | ACTIVE | lan-flat-cloudinstances2b=172.16.1.66 | debian-10.0-buster (deprecated 2020-10-16) | g2.cores4.ram8.disk80 |
23| dbcb2067-4219-45ae-a828-7f49ce4be9fd | toolsbeta-acme-chief-01 | ACTIVE | lan-flat-cloudinstances2b=172.16.1.165 | debian-10.0-buster (deprecated 2020-10-16) | g2.cores1.ram2.disk20 |
24| 49826598-7ce3-40d0-ac23-a38d9499e51a | meet-auth | ACTIVE | lan-flat-cloudinstances2b=172.16.0.141 | debian-10.0-buster (deprecated 2020-10-16) | g2.cores1.ram2.disk20 |
25| 4f193824-85d8-4369-8be3-c8b96abbd71d | tools-k8s-worker-52 | ACTIVE | lan-flat-cloudinstances2b=172.16.1.96 | debian-10.0-buster (deprecated 2020-10-16) | g2.cores4.ram8.disk80 |
26| e56906c1-2547-44ea-bd55-17958b560159 | tool | ACTIVE | lan-flat-cloudinstances2b=172.16.1.60 | debian-10.0-buster (deprecated 2020-10-16) | g2.cores8.ram16.disk160 |
27| 24a93a3d-cd07-4479-992d-16b1e07d8b56 | pontoon-log-01 | ACTIVE | lan-flat-cloudinstances2b=172.16.0.180 | debian-10.0-buster (deprecated 2020-10-16) | g2.cores1.ram2.disk20 |
28| b07a92bd-7ec5-412a-ac23-16bd758150d4 | wsexport-prod01 | ACTIVE | lan-flat-cloudinstances2b=172.16.1.17 | debian-10.0-buster (deprecated 2020-10-16) | g2.cores4.ram8.disk80 |
29| 5f96add9-ccc8-4308-955d-ec1ce4b32cc8 | cloud-puppetmaster-03 | ACTIVE | lan-flat-cloudinstances2b=172.16.0.38, 185.15.56.64 | debian-10.0-buster (deprecated 2019-12-15) | g2.cores8.ram16.disk160 |
30| 47903a4e-ace1-461a-a342-575c852fb0e0 | rel2 | SHUTOFF | lan-flat-cloudinstances2b=172.16.2.80 | debian-11.0-bullseye (deprecated 2022-05-18) | g2.cores2.ram4.disk40 |
31+--------------------------------------+--------------------------+---------+------------------------------------------------------+----------------------------------------------+-------------------------------------+

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-17T15:28:23Z] <taavi@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain (T349109)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-17T15:29:03Z] <taavi@cloudcumin1001> END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) (T349109)

mysql:root@localhost [nova_eqiad1]> update instances set host = 'cloudvirt1058' where uuid = '1c3e4b8a-9076-4c8c-b2e6-51606c0b1fb8';

taavi@cloudcontrol1006 ~ $ sudo OS_PROJECT_ID=tools wmcs-openstack server reboot tools-k8s-haproxy-4 --hard
mysql:root@localhost [nova_eqiad1]> update instances set host = 'cloudvirt1058' where host = 'cloudvirt1051' and deleted = 0;
Query OK, 25 rows affected (0.007 sec)
Rows matched: 25  Changed: 25  Warnings: 0

mysql:root@localhost [neutron]> update ml2_port_bindings set host = 'cloudvirt1058' where host = 'cloudvirt1051';
Query OK, 24 rows affected (0.004 sec)
Rows matched: 24  Changed: 24  Warnings: 0

mysql:root@localhost [neutron]> update ml2_port_binding_levels set host = 'cloudvirt1058' where host = 'cloudvirt1051';
Query OK, 24 rows affected (0.003 sec)
Rows matched: 24  Changed: 24  Warnings: 0
select group_concat(CONCAT("sudo OS_PROJECT_ID=", project_id, " wmcs-openstack server reboot ", hostname, " --hard") SEPARATOR "\n") from instances where host = 'cloudvirt1058' and deleted = 0;
mysql:root@localhost [cinder]> select nova_eqiad1.instances.uuid as instance_uuid,
    ->        volume_attachment.volume_id, volumes.status,
    ->        volume_attachment.attach_status, volume_attachment.mountpoint,
    ->        volumes.display_name from volume_attachment
    ->        inner join nova_eqiad1.instances on volume_attachment.instance_uuid=nova_eqiad1.instances.uuid
    ->        inner join volumes on volumes.id = volume_attachment.volume_id
    ->        where nova_eqiad1.instances.host = 'cloudvirt1058' and volume_attachment.attach_status = 'attached';
+--------------------------------------+--------------------------------------+--------+---------------+------------+--------------------------------------------+
| instance_uuid                        | volume_id                            | status | attach_status | mountpoint | display_name                               |
+--------------------------------------+--------------------------------------+--------+---------------+------------+--------------------------------------------+
| bfad7fbd-53db-4604-aa38-19ffa3e3da02 | 3db99b7f-7a70-4671-af6a-a56fb4810a39 | in-use | attached      | /dev/sdb   | trove-7ac3c20a-cdde-4fec-b87f-3f3a2fd4ff5e |
| b07a92bd-7ec5-412a-ac23-16bd758150d4 | 554b519b-ca92-4fe2-a70c-290137a72a2a | in-use | attached      | /dev/vdb   | ws-export                                  |
| a8c98594-5b5f-46ac-9bb2-438b6d8484c4 | e524ac1c-65a9-443d-aea9-e5cb00900bdf | in-use | attached      | /dev/sdb   | backups                                    |
| e6bd0bc0-c168-48cf-bc25-374ba0c720bb | e58db70d-7c45-441b-bd89-5104c8ea02d8 | in-use | attached      | /dev/sdb   | prometheus-b                               |
+--------------------------------------+--------------------------------------+--------+---------------+------------+--------------------------------------------+
4 rows in set (0.016 sec)

cloudvirt1051 moved from ceph aggregate to maintenance, cloudvirt1058 moved from spare to ceph

taavi lowered the priority of this task from Unbreak Now! to High.Oct 17 2023, 4:02 PM

From the web console logs, it just says it was turned off:

	2023-10-17 15:41:50 	PSU0800 	Power Supply 1: Status = 0x1, IOUT = 0x0, VOUT= 0x0, TEMP= 0x0, FAN = 0x0, INPUT= 0x0.	
--------
		2023-10-17 16:14:23 	SYS1003 	System CPU Resetting.	
	
Log Sequence Number:
227
Detailed Description:
System is performing a CPU reset because of system power off, power on or a warm reset like CTRL-ALT-DEL.
Recommended Action:
No response action is required.
--------
		2023-10-17 16:14:23 	SYS1001 	System is turning off.	
	
Log Sequence Number:
226
Detailed Description:
System is turning off.
Recommended Action:
No response action is required.
-------
		2023-06-29 13:43:25 	SYS336 	An existing hash value is updated because some system configuration items are changed.

The health report says everything is ok :/

taavi renamed this task from NodeDown to cloudvirt1051 crashed.Oct 18 2023, 7:53 AM