Page MenuHomePhabricator

Possible hardware issues on wikikube-worker2332.codfw.wmnet
Closed, ResolvedPublic

Description

Unusually high rate of mesh-related mediawiki errors on this node since ~ 15:50 UTC. Also reported in T419746: JobQueueError rates elevated since roughly 15:45 UTC 2026-03-11.

Transient packet loss:

15:46:30 <+icinga-wm> PROBLEM - Host wikikube-worker2332 is DOWN: PING CRITICAL - Packet loss = 100%
15:47:58 <+icinga-wm> RECOVERY - Host wikikube-worker2332 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms

BGP session flap:

$ sudo calicoctl node status
Calico process is running.

IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS |   PEER TYPE   | STATE |  SINCE   |    INFO     |
+--------------+---------------+-------+----------+-------------+
| 10.192.56.1  | node specific | up    | 15:47:19 | Established |
+--------------+---------------+-------+----------+-------------+

IPv6 BGP status
+-------------------+---------------+-------+----------+-------------+
|   PEER ADDRESS    |   PEER TYPE   | STATE |  SINCE   |    INFO     |
+-------------------+---------------+-------+----------+-------------+
| 2620:0:860:12b::1 | node specific | up    | 15:47:19 | Established |
+-------------------+---------------+-------+----------+-------------+

Event Timeline

Cookbook cookbooks.sre.k8s.pool-depool-node started by swfrench@cumin2002 depool for host wikikube-worker2332.codfw.wmnet completed:

  • wikikube-worker2332.codfw.wmnet (PASS)
    • Host wikikube-worker2332.codfw.wmnet depooled from wikikube-codfw

Node is depooled.

In any case, per the system logs, the host restarted at 15:47:08.

SEL events:

$ sudo ipmi-sel
ID   | Date        | Time     | Name                                             | Type              | Event
1    | Jan-15-2026 | 18:21:09 | Sensor #255                                      | OEM Reserved      | Event Offset = 00h
2    | Jan-15-2026 | 18:22:37 | Sensor #255                                      | OEM Reserved      | Event Offset = 03h ; OEM Event Data2 code = 01h
3    | Jan-16-2026 | 15:46:57 | System Chassis Chassis Intru                     | Physical Security | General Chassis Intrusion
4    | Jan-16-2026 | 15:47:06 | Power Supply 88 PS1 Status                       | Power Supply      | Presence detected
5    | Jan-16-2026 | 15:47:06 | Power Supply 87 PS2 Status                       | Power Supply      | Presence detected
6    | Mar-09-2026 | 16:35:29 | Power Supply 88 PS1 Status                       | Power Supply      | Power Supply Failure detected
7    | Mar-09-2026 | 16:35:29 | Power Supply 88 PS1 Status                       | Power Supply      | Power Supply input lost (AC/DC)
8    | Mar-09-2026 | 16:37:31 | Sensor #255                                      | OEM Reserved      | Event Offset = 04h ; OEM Event Data2 code = 01h
9    | Mar-09-2026 | 16:37:58 | Sensor #255                                      | OEM Reserved      | Event Offset = 03h ; OEM Event Data2 code = 01h
10   | Mar-11-2026 | 15:45:42 | Sensor #255                                      | OEM Reserved      | Event Offset = 03h ; OEM Event Data2 code = 01h
11   | Mar-11-2026 | 15:45:48 | Sensor #255                                      | OEM Reserved      | Event Offset = 00h
12   | Mar-11-2026 | 15:46:42 | System Chassis Chassis Intru                     | Physical Security | General Chassis Intrusion
13   | Mar-11-2026 | 15:46:44 | Power Supply 87 PS2 Status                       | Power Supply      | Presence detected
14   | Mar-11-2026 | 15:46:44 | Power Supply 88 PS1 Status                       | Power Supply      | Presence detected

Some sort of PS issue on the 9th, followed by a physical disruption of some sort today, immediately preceding the restart? (not sure if that can false positive)

Also, I'm not seeing evidence of persistent errors / drops on ens1f0np0, so I suppose the networking-related "symptoms" may not be physical networking at all.

Scott_French renamed this task from Possible networking issues on wikikube-worker2332.codfw.wmnet to Possible hardware issues on wikikube-worker2332.codfw.wmnet.Mar 11 2026, 7:09 PM
JMeybohm subscribed.

Hey ops-codfw can you think of any work that could explain the cassis intrusion/power supply messages?

there was a loose power cable earlier this week. it might have powered off during that time. power cables have been secured since. T419462

the chassis intrusion was probably from install. it was back in january when we got these. that one just pops up sometimes when we turn them on.

@Jhancock.wm - Ah, thanks for highlighting that! So, it sounds like the host may have lost power briefly when the power cables were reseated in T419462#11698379? (host restarted at 15:47)

yes, that's the most likely cause.

Great, thanks @Jhancock.wm. Also good to know that the intrusion event seems to be something about the chassis of this particular machine ... I'm wondering if that event will be reported whenever the machine power cycles, if there's something persistently ajar.

@JMeybohm - It looks like this host booted quickly enough that pods weren't migrated away (spot checking some of the pods that emitted mesh-related errors). Do you think it's possible that they were started "too early" (for some definition thereof) and something racy happened with their networking? Basically, I'm thinking there's nothing wrong with the host per se, and we can probably repool it.

@JMeybohm - It looks like this host booted quickly enough that pods weren't migrated away (spot checking some of the pods that emitted mesh-related errors). Do you think it's possible that they were started "too early" (for some definition thereof) and something racy happened with their networking? Basically, I'm thinking there's nothing wrong with the host per se, and we can probably repool it.

In theory they can start before calico has fully started and established the BGP sessions, yes. I would lean towards repooling and revisit this if it behaves strange again.

Cookbook cookbooks.sre.k8s.pool-depool-node started by swfrench@cumin2002 pool for host wikikube-worker2332.codfw.wmnet completed:

  • wikikube-worker2332.codfw.wmnet (PASS)
    • Host wikikube-worker2332.codfw.wmnet pooled in wikikube-codfw

Thanks, @JMeybohm. I wonder if there's some race where a pod can come up before there are learned routes (since no BGP session established) and that condition is "sticky" (i.e., persists after session establishment). In any case, I've gone ahead and repooled the node, and will keep an eye on errors over the next hour or so.

We're now > 90m after the host was repooled, and error rates from mediawiki pods scheduled thereon remain normal.

I think I'm ready to attribute this to some kind of race between pod startup and calico networking. We could try to reproduce it by power-cycling, but I'm not sure it's worth the effort.