Page MenuHomePhabricator

cloludvirt1035: InterfaceSpeedError: brq7425e328-56
Closed, ResolvedPublic

Description

We had this alert firing today:

FIRING: InterfaceSpeedError: brq7425e328-56 on cloudvirt1053:9100 has the wrong speed: 1.25e+06

Upon checking:

aborrero@cloudvirt1053:~ $ sudo ethtool brq7425e328-56
Settings for brq7425e328-56:
	Supported ports: [  ]
	Supported link modes:   Not reported
	Supported pause frame use: No
	Supports auto-negotiation: No
	Supported FEC modes: Not reported
	Advertised link modes:  Not reported
	Advertised pause frame use: No
	Advertised auto-negotiation: No
	Advertised FEC modes: Not reported
	Speed: 10Mb/s
	Duplex: Unknown! (255)
	Auto-negotiation: off
	Port: Other
	PHYAD: 0
	Transceiver: internal
	Link detected: yes

This is the bridge device used by virtual machines. I wonder if they are getting their speed limited.

Potentially affected VMs:

aborrero@cloudcontrol1005:~ 3s $ sudo wmcs-openstack server list --host cloudvirt1053 --all-projects -c ID -c Name -c Image
+--------------------------------------+----------------------+-----------------------------------+
| ID                                   | Name                 | Image                             |
+--------------------------------------+----------------------+-----------------------------------+
| 00a7c0e5-caaa-4dbc-bd8a-a3ebf13c312c | canary1053-1         | debian-12.0-bookworm              |
| c738d3bb-8bd4-4807-8651-78698923385e | thistle              | trove-antelope-guest-ubuntu-focal |
| 337b88c3-9009-402d-a7d5-182b1b027fbb | copypatrol-dev-db-01 | trove-antelope-guest-ubuntu-focal |
| 0126dc3f-02c9-4b07-8550-de103724fdc3 | ifis                 | trove-antelope-guest-ubuntu-focal |
| 2055f363-fc49-4c8c-b2d2-9908756faf5d | dumps-db2            | trove-master-guest-ubuntu-focal   |
| 8668ceb6-aa00-4745-bef0-19d2511a13cf | terraform            | trove-master-guest-ubuntu-focal   |
| bfad7fbd-53db-4604-aa38-19ffa3e3da02 | harbordb             | trove-master-guest-ubuntu-bionic  |
| 3c53fc19-f19f-4c6a-8b23-ab8edb149994 | libup-db02           | trove-master-guest-ubuntu-bionic  |
| 4b932658-2b4d-4157-b62e-1e73faaed8aa | quarry-db-02         | trove-antelope-guest-ubuntu-focal |
| 871ab13f-51df-4bc8-917f-0828ac98b3c1 | osmit-pgsql          | trove-master-guest-ubuntu-bionic  |
+--------------------------------------+----------------------+-----------------------------------+
``

Event Timeline

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-21T08:28:23Z] <aborrero@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1053.eqiad.wmnet' (T368129)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-06-21T08:31:44Z] <aborrero@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=0) on host 'cloudvirt1053.eqiad.wmnet' (T368129)

Host rebooted by aborrero@cumin1002 with reason: network interface speed

I assume it was some kind of misconfiguration. The server is now up and running after the reimage.

aborrero claimed this task.

In the reimage cookbook I pasted the wrong ticket ID, see:

Server has been pooled into the ceph and network-ovs aggregates.

cc @Andrew @taavi