Page MenuHomePhabricator

sessionstore1005 doesn't boot
Closed, ResolvedPublic

Description

I attempted to reimage sessionstore1005.eqiad.wmnet, and it doesn't seem to have ever come back up after the reboot.

There is a critical error logged on the DRAC dashboard: The System Configuration Check operation resulted in the following issue: Comm Error: Backplane 0.

image.png (350×1 px, 44 KB)

NOTE: While down, we have no redundancy for sessionstore in the eqiad DC (sessionstore is critical logged-in users), so this is urgent I'm afraid

Event Timeline

Eevans triaged this task as Unbreak Now! priority.Jun 30 2025, 4:55 PM
RobH added subscribers: Jhancock.wm, Jclark-ctr, RobH.

@Jclark-ctr & @Jhancock.wm: Please note this was pinged in IRC as well, if either of you are on-site today/next, please address this issue.

it's a backplane communication error. someone onsite needs to reseat the cables to the backplane. 90% chance that fixes it.

@Jclark-ctr & @Jhancock.wm: Please note this was pinged in IRC as well, if either of you are on-site today/next, please address this issue.

Should this be @VRiley-WMF instead?

VRiley-WMF changed the task status from Open to In Progress.Jul 1 2025, 7:05 PM

@Eevans Taking a look at this now

Pulling sessionstore1005 down now to reseat cables.

Powering up sessionstore1005 now, it looks like there was a BIOS update that was pushed to it at some point. It's installing the BIOS update now.

@Eevans It looks like the unit is back up. Can you please confirm?

@Eevans It looks like the unit is back up. Can you please confirm?

I don't know what to say about the BIOS update —that's weird— that wasn't me. Startup was also interrupted by a "Lifecycle Controller Setup Wizard" that I've never seen before (it was graphical, and required using the virtual console to navigate). But I'm past that now and into the Debian installer, so I think we're OK?

Was it just a re-seating of cables?

So, yes. I did reseat the cables and that seemed to have helped it get to this point.

VRiley-WMF changed the task status from In Progress to Open.Jul 1 2025, 8:04 PM

@Eevans is it okay to close this now?

Yes; Let's. Thanks for your help!

@VRiley-WMF Was a Dell ticket opened for this server? We have two other servers experiencing the same issue, and it has now reoccurred. T383051 T397851 T397829
@Eevans

@wiki_willy tagging you also for visibility. @Jhancock.wm @VRiley-WMF we should be opening tickets for this error with dell for these errors here is a quick list of servers we have seen the same issue
T398225 sessionstore1005 Jun 30 2025 Dell PowerEdge R450
T397851 wikikube-worker1243 6/25/2025 Dell PowerEdge R450 - ConfigC
T397829 wikikube-worker1069 6/25/2025 Dell PowerEdge R450 - ConfigC
T396852 db2212 6/13/2025 Dell PowerEdge R650xs - ConfigE
T391639 cirrussearch2091 Apr 10 2025 Dell PowerEdge R450 - ConfigD
T383051 wikikube-worker1243 Jan 6 2025 Dell PowerEdge R450 - ConfigC
T356919 cloudelastic1008 Feb 7 2024 Dell PowerEdge R450
T383339 wikikube-worker2192 Jan 9 2025 Dell PowerEdge R450 - ConfigC
T381878 wikikube-worker1081 Dec 10 2024 Dell PowerEdge R450 - ConfigC
T381789 wikikube-worker1073 Dec 9 2024 Dell PowerEdge R450 - ConfigC
T381770 wikikube-worker1069 Dec 9 2024 Dell PowerEdge R450 - ConfigC
T381676 wikikube-worker1057 Dec 6 2024 Dell PowerEdge R450 - ConfigC
T382420 wikikube-worker2190 Dec 18 2024 Dell PowerEdge R450 - ConfigC
T374258 wikikube-worker2095 Sep 6 2024 Dell PowerEdge R450 - ConfigC
T374019 wikikube-worker2087 Sep 4 2024 Dell PowerEdge R450 - ConfigC
T355830 elastic2094 Jan 24 2024 Dell PowerEdge R450
T374054 pay-lb2002 Sep 4 2024 Dell PowerEdge R450