IR: P43154
asw-b2-codfw crashed over the weekend and needs RMA’ing during the week. Console is dead. No user impact.
IR: P43154
asw-b2-codfw crashed over the weekend and needs RMA’ing during the week. Console is dead. No user impact.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Joe | T327041 Scap deploy failed to depool codfw servers | |||
Open | Papaul | T327001 asw-b2-codfw down |
Following hosts are down:
cp2031
ms-be2046
elastic2041
kafka-logging2002
mc2043
thanos-fe2002
elastic2063
cp2032
elastic2046
elastic2057
lvs2008
elastic2077
elastic2078
mc2042
ms-fe2010
ms-be2041
ml-cache2002
elastic2042
Mentioned in SAL (#wikimedia-operations) [2023-01-14T09:19:00Z] <Emperor> depool ms-fe2010 T327001
Mentioned in SAL (#wikimedia-operations) [2023-01-14T09:19:56Z] <Emperor> depool thanos-fe2002 T327001
from netbox b2 was the master previously I think, master is now b7
filippo@asw-b-codfw> show virtual-chassis Preprovisioned Virtual Chassis Fabric Fabric ID: 5ddb.095b.79f3 Fabric Mode: Mixed Mstr Mixed Route Neighbor List Member ID Status Serial No Model prio Role Mode Mode ID Interface 1 (FPC 1) Prsnt PE3714180129 ex4300-48t 0 Linecard Y F 7 vcp-255/1/0 2 (FPC 2) NotPrsnt TA3713500157 qfx5100-48s-6q 3 (FPC 3) Prsnt PE3714180115 ex4300-48t 0 Linecard Y F 7 vcp-255/1/0 4 (FPC 4) Prsnt TA3717260326 qfx5100-48s-6q 0 Linecard Y F 7 vcp-255/0/52 5 (FPC 5) Prsnt PE3713320083 ex4300-48t 0 Linecard Y F 7 vcp-255/1/2 6 (FPC 6) Prsnt PE3713320002 ex4300-48t 0 Linecard Y F 7 vcp-255/1/2 7 (FPC 7) Prsnt TA3713500207 qfx5100-48s-6q 129 Master* Y F 8 vcp-255/0/43 8 vcp-255/0/44 6 vcp-255/0/49 5 vcp-255/0/51 1 vcp-255/0/53 4 vcp-255/0/48 3 vcp-255/0/50 8 (FPC 8) Prsnt PE3714030350 ex4300-48t 0 Linecard Y F 7 vcp-255/2/2 7 vcp-255/2/3 {master:7}
Mentioned in SAL (#wikimedia-operations) [2023-01-14T09:46:06Z] <godog> issue 'request system reboot member 2' - T327001
The above was a test/hail mary due to lack of other options (i.e. switched PDUs to power cycle the device), though as expected the command didn't have any practical effect
Thanks for the task, quite an eventful week for switches :)
Indeed the switch is dead, console doesn't reply neither.
Everything that can be done remotely is done, next step is to replace it with a spare switch and RMA it.
Monday is a US holiday, so that won't happened before Tuesday unless there is an emergency.
P43154 has a draft IR if needed, collates times and actionables to save any SRE time.
Lowering to deal with on Monday/Tuesday and updated description. Thanks everyone for the response.
@ayounsi
The spare switch is in place. I am using https://netbox-next.wikimedia.org/dcim/devices/3423/. Let me know is you want to set it up now on wait until next week
There is also the link connecting lvs2009 to row B connected to that switch. So the impact is larger than expected.
Specifically, the following servers can't be reached by lvs2009 (excluding the mw servers that we're handling in T327041:
$ grep -F 'WARN: ' /var/log/pybal.log | grep -v mw2 | grep -v parse | awk '{print $9}' | sort | uniq -c | sort -n ... 2291 thumbor2003.codfw.wmnet 2292 thumbor2004.codfw.wmnet 2294 wdqs2005.codfw.wmnet 2295 logstash2024.codfw.wmnet 2295 logstash2025.codfw.wmnet 2295 ores2003.codfw.wmnet 2295 wcqs2001.codfw.wmnet 2296 ores2004.codfw.wmnet 2393 maps2006.codfw.wmnet 2481 prometheus2005.codfw.wmnet 3467 kubernetes2009.codfw.wmnet 3467 kubernetes2020.codfw.wmnet 3470 kubernetes2006.codfw.wmnet 3473 kubernetes2010.codfw.wmnet 4594 ms-fe2010.codfw.wmnet 4595 restbase2013.codfw.wmnet 4599 restbase2014.codfw.wmnet 4599 restbase2024.codfw.wmnet 4600 restbase2021.codfw.wmnet 4601 restbase2019.codfw.wmnet 6886 elastic2079.codfw.wmnet 6892 elastic2063.codfw.wmnet 6893 elastic2042.codfw.wmnet 6894 wdqs2007.codfw.wmnet 6896 elastic2057.codfw.wmnet 6896 elastic2058.codfw.wmnet 6896 elastic2070.codfw.wmnet 6897 elastic2043.codfw.wmnet 6897 elastic2064.codfw.wmnet 6897 elastic2080.codfw.wmnet 6901 thanos-fe2002.codfw.wmnet 6902 elastic2077.codfw.wmnet 6902 elastic2078.codfw.wmnet 6903 elastic2044.codfw.wmnet 6904 elastic2041.codfw.wmnet
Thankfully, only a handful of them are actually down.
Using the safe restart script on any of these will result in something like:
restbase2013:~$ sudo restart-envoyproxy 2023-01-16 07:58:47,162 [INFO] Depooling currently pooled services 2023-01-16 07:58:57,254 [WARNING] Timed out checking http://lvs2009:9090/pools/restbase-https_7443/restbase2013.codfw.wmnet 2023-01-16 07:58:57,262 [INFO] Waiting 3 seconds before restarting the service... 2023-01-16 07:59:00,265 [INFO] Restarting the service 2023-01-16 07:59:01,410 [INFO] Repooling previously pooled services 2023-01-16 07:59:11,502 [WARNING] Timed out checking http://lvs2009:9090/pools/restbase-https_7443/restbase2013.codfw.wmnet
which is a long time to wait, but at least the restart works as expected.
I would leave handling of each service to the respective service owners. The most worrisome situations seem to be thumbor, where we're at half capacity right now (cc @hnowlan) and to a point elasticsearch (cc @Gehel)
Mentioned in SAL (#wikimedia-operations) [2023-01-17T14:52:14Z] <urandom> disabling Cassandra hinted-handoff for codfw -- T327001
Mentioned in SAL (#wikimedia-operations) [2023-01-17T14:56:27Z] <urandom> truncating hints for Cassandra nodes in codfw row b -- T327001
Netbox up to date with the new switch information
https://netbox.wikimedia.org/dcim/devices/1883/
Mentioned in SAL (#wikimedia-operations) [2023-01-18T15:31:46Z] <urandom> re-enabling Cassandra hinted-handoff for codfw -- T327001
This is the return address
Seagrove C/O Celestica
Killam Industrial Park
13701 N Lamar Dr.
Laredo, TX 78045 USA
Project: CLS HUB Laredo, TX
Attn: Juniper Returns
Must show RMA# R200442866
27lbs
29x22x7