Page MenuHomePhabricator

openstack galera no recent writes 2025-03-04, suspected network hardware problem
Closed, ResolvedPublic

Description

We got this alert today:

image.png (444×523 px, 62 KB)

Checking with:

aborrero@cloudcontrol1005:~ $ sudo mysql -u root
MariaDB [(none)]> SHOW STATUS LIKE "wsrep_last_committed";
+----------------------+------------+
| Variable_name        | Value      |
+----------------------+------------+
| wsrep_last_committed | 1070411385 |
+----------------------+------------+
1 row in set (0.001 sec)

Seems only cloudcontrol1005 is a bit behind:

cloudcontrol1005 | wsrep_last_committed | 1070411385 |
cloudcontrol1006 | wsrep_last_committed | 1070411666 |
cloudcontrol1007 | wsrep_last_committed | 1070411666 |

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2025-03-04T08:52:04Z] <arturo> T387828 depooled galera on cloudcontrol1005

Host rebooted by aborrero@cumin1002 with reason: galera problem

aborrero changed the task status from Open to In Progress.Mar 4 2025, 8:59 AM
aborrero triaged this task as High priority.
aborrero moved this task from Backlog to Doing on the User-aborrero board.

server wont shutdown for reboot, so we are force-rebooting it

when force-rebooted, the server did not have the correct network configuration

the switch port the server is connected to is somehow down:

image.png (167×743 px, 16 KB)

Not sure if there is a problem in the switch port, the cable or similar.

aborrero renamed this task from openstack galera no recent writes 2025-03-04 to openstack galera no recent writes 2025-03-04, suspected network hardware problem.Mar 4 2025, 9:48 AM
aborrero added subscribers: VRiley-WMF, Jclark-ctr.

Hey @VRiley-WMF and/or @Jclark-ctr, Could you please check on-site if there is a loose cable in either the switch port or in the server NIC for cloudcontrol1005?

Server: https://netbox.wikimedia.org/dcim/devices/2613/
Switch port: https://netbox.wikimedia.org/dcim/interfaces/30016/

thanks!

aborrero changed the task status from In Progress to Open.Mar 4 2025, 11:20 AM

Change #1124421 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/wmcs-cookbooks@main] inventory: place cloudcontrol1005 last on the inventory

https://gerrit.wikimedia.org/r/1124421

Change #1124421 merged by Arturo Borrero Gonzalez:

[cloud/wmcs-cookbooks@main] inventory: place cloudcontrol1005 last on the inventory

https://gerrit.wikimedia.org/r/1124421

Looks like the SFP failed. Swapped it out and it looks like it's communicating again. Please check?

Confirmed that this unit came back online