Page MenuHomePhabricator

kubernetes2010 down
Closed, ResolvedPublic

Description

kubernetes2010 is down or at least flaky since 2023-09-24 ~10:40Z

I cant reach it via SSH, ipam or mgmt.
The switch port seems up:

jayme@asw-b-codfw> show interfaces descriptions | match ge-6/0/21     
ge-6/0/21       up    up   kubernetes2010

But does not receive anything from the host:

jayme@asw-b-codfw> show ethernet-switching table interface ge-6/0/21 

MAC database for interface ge-6/0/21

MAC database for interface ge-6/0/21.0

Related Objects

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2023-09-25T08:27:44Z] <jayme> draining kubernetes2010.codfw.wmnet - T347267

JMeybohm updated the task description. (Show Details)

Icinga downtime and Alertmanager silence (ID=10429df6-7d65-43c3-8f79-49fdea88c7ca) set by jayme@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: host is down

kubernetes2010.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-09-25T08:43:44Z] <jayme> jayme@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2010.* - T347267

Hey DC-Ops, could you please check on kubernetes2010

Change 960621 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap::dsh: temporarily exclude kubernetes2010

https://gerrit.wikimedia.org/r/960621

Change 960621 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap::dsh: temporarily exclude kubernetes2010

https://gerrit.wikimedia.org/r/960621

server is not getting to POST. starting troubleshooting.

@JMeybohm looks like the system board has died. Server powers on, but even with minimum hardware configuration the server will not actually boot up. Idrac is also inaccessible.

This server is out of warranty. I have one spare from a decommissioned server that might work. Do you want to proceed with that? I am aware part of the recent expansion to Kubernetes was in part to decom some of the older machines. Not sure if this is one of them or if it's too early for that in the process. Please let me know what you decide and I'll proceed.

Thanks! We did not plan to decom immediately, so it would really help us if you could replace the board and we could run the server for a bit longer.

got it replaced. updated the asset tag, idrac IP, bios/idrac firmware, and adjusted some bios settings. the idrac and network addresses are pinging, and there are no alerts that I can see. Please let me know if anything needs adjusting on my end.

Mentioned in SAL (#wikimedia-operations) [2023-09-25T16:51:53Z] <jayme> uncordon kubernetes2010.codfw.wmnet - T347267

Nice, thanks for handling this so quickly! Nothing more to do from your end

Change 960003 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Revert "scap::dsh: temporarily exclude kubernetes2010"

https://gerrit.wikimedia.org/r/960003

Change 960003 merged by JMeybohm:

[operations/puppet@production] Revert "scap::dsh: temporarily exclude kubernetes2010"

https://gerrit.wikimedia.org/r/960003

Hi @Jhancock.wm just a heads up, I rebooted kubernetes2010 to change the CPU power management BIOS setting that was set to BIOS control instead of OS control, which meant we couldn't change the CPU frequency governor. Cheers :)