Page MenuHomePhabricator

asw-b2-codfw down
Open, HighPublic

Description

IR: P43154

asw-b2-codfw crashed over the weekend and needs RMA’ing during the week. Console is dead. No user impact.

Event Timeline

RhinosF1 triaged this task as Unbreak Now! priority.Sat, Jan 14, 8:43 AM

Following hosts are down:
cp2031
ms-be2046
elastic2041
kafka-logging2002
mc2043
thanos-fe2002
elastic2063
cp2032
elastic2046
elastic2057
lvs2008
elastic2077
elastic2078
mc2042
ms-fe2010
ms-be2041
ml-cache2002
elastic2042

taavi renamed this task from asw-b-codfw virtual chassis crash to asw-b2-codfw down.Sat, Jan 14, 8:57 AM

from netbox b2 was the master previously I think, master is now b7

filippo@asw-b-codfw> show virtual-chassis 

Preprovisioned Virtual Chassis Fabric
Fabric ID: 5ddb.095b.79f3
Fabric Mode: Mixed
                                                Mstr           Mixed Route Neighbor List
Member ID  Status   Serial No    Model          prio  Role      Mode  Mode ID  Interface
1 (FPC 1)  Prsnt    PE3714180129 ex4300-48t       0   Linecard     Y  F    7  vcp-255/1/0
2 (FPC 2)  NotPrsnt TA3713500157 qfx5100-48s-6q
3 (FPC 3)  Prsnt    PE3714180115 ex4300-48t       0   Linecard     Y  F    7  vcp-255/1/0
4 (FPC 4)  Prsnt    TA3717260326 qfx5100-48s-6q   0   Linecard     Y  F    7  vcp-255/0/52
5 (FPC 5)  Prsnt    PE3713320083 ex4300-48t       0   Linecard     Y  F    7  vcp-255/1/2
6 (FPC 6)  Prsnt    PE3713320002 ex4300-48t       0   Linecard     Y  F    7  vcp-255/1/2
7 (FPC 7)  Prsnt    TA3713500207 qfx5100-48s-6q 129   Master*      Y  F    8  vcp-255/0/43
                                                                           8  vcp-255/0/44
                                                                           6  vcp-255/0/49
                                                                           5  vcp-255/0/51
                                                                           1  vcp-255/0/53
                                                                           4  vcp-255/0/48
                                                                           3  vcp-255/0/50
8 (FPC 8)  Prsnt    PE3714030350 ex4300-48t       0   Linecard     Y  F    7  vcp-255/2/2
                                                                           7  vcp-255/2/3

{master:7}

Mentioned in SAL (#wikimedia-operations) [2023-01-14T09:46:06Z] <godog> issue 'request system reboot member 2' - T327001

Mentioned in SAL (#wikimedia-operations) [2023-01-14T09:46:06Z] <godog> issue 'request system reboot member 2' - T327001

The above was a test/hail mary due to lack of other options (i.e. switched PDUs to power cycle the device), though as expected the command didn't have any practical effect

Thanks for the task, quite an eventful week for switches :)

Indeed the switch is dead, console doesn't reply neither.

Everything that can be done remotely is done, next step is to replace it with a spare switch and RMA it.

Monday is a US holiday, so that won't happened before Tuesday unless there is an emergency.

P43154 has a draft IR if needed, collates times and actionables to save any SRE time.

RhinosF1 lowered the priority of this task from Unbreak Now! to High.Sat, Jan 14, 11:18 AM
RhinosF1 added a project: ops-codfw.
RhinosF1 updated the task description. (Show Details)

Lowering to deal with on Monday/Tuesday and updated description. Thanks everyone for the response.

@ayounsi
The spare switch is in place. I am using https://netbox-next.wikimedia.org/dcim/devices/3423/. Let me know is you want to set it up now on wait until next week

I requested RMA with case number 2023-0115-620495 with Juniper

There is also the link connecting lvs2009 to row B connected to that switch. So the impact is larger than expected.

There is also the link connecting lvs2009 to row B connected to that switch. So the impact is larger than expected.

Specifically, the following servers can't be reached by lvs2009 (excluding the mw servers that we're handling in T327041:

$ grep -F 'WARN: ' /var/log/pybal.log | grep -v mw2 | grep -v parse | awk '{print $9}' | sort | uniq -c | sort -n
...
   2291 thumbor2003.codfw.wmnet
   2292 thumbor2004.codfw.wmnet
   2294 wdqs2005.codfw.wmnet
   2295 logstash2024.codfw.wmnet
   2295 logstash2025.codfw.wmnet
   2295 ores2003.codfw.wmnet
   2295 wcqs2001.codfw.wmnet
   2296 ores2004.codfw.wmnet
   2393 maps2006.codfw.wmnet
   2481 prometheus2005.codfw.wmnet
   3467 kubernetes2009.codfw.wmnet
   3467 kubernetes2020.codfw.wmnet
   3470 kubernetes2006.codfw.wmnet
   3473 kubernetes2010.codfw.wmnet
   4594 ms-fe2010.codfw.wmnet
   4595 restbase2013.codfw.wmnet
   4599 restbase2014.codfw.wmnet
   4599 restbase2024.codfw.wmnet
   4600 restbase2021.codfw.wmnet
   4601 restbase2019.codfw.wmnet
   6886 elastic2079.codfw.wmnet
   6892 elastic2063.codfw.wmnet
   6893 elastic2042.codfw.wmnet
   6894 wdqs2007.codfw.wmnet
   6896 elastic2057.codfw.wmnet
   6896 elastic2058.codfw.wmnet
   6896 elastic2070.codfw.wmnet
   6897 elastic2043.codfw.wmnet
   6897 elastic2064.codfw.wmnet
   6897 elastic2080.codfw.wmnet
   6901 thanos-fe2002.codfw.wmnet
   6902 elastic2077.codfw.wmnet
   6902 elastic2078.codfw.wmnet
   6903 elastic2044.codfw.wmnet
   6904 elastic2041.codfw.wmnet

Thankfully, only a handful of them are actually down.
Using the safe restart script on any of these will result in something like:

restbase2013:~$ sudo restart-envoyproxy 
2023-01-16 07:58:47,162 [INFO] Depooling currently pooled services
2023-01-16 07:58:57,254 [WARNING] Timed out checking http://lvs2009:9090/pools/restbase-https_7443/restbase2013.codfw.wmnet
2023-01-16 07:58:57,262 [INFO] Waiting 3 seconds before restarting the service...
2023-01-16 07:59:00,265 [INFO] Restarting the service
2023-01-16 07:59:01,410 [INFO] Repooling previously pooled services
2023-01-16 07:59:11,502 [WARNING] Timed out checking http://lvs2009:9090/pools/restbase-https_7443/restbase2013.codfw.wmnet

which is a long time to wait, but at least the restart works as expected.

I would leave handling of each service to the respective service owners. The most worrisome situations seem to be thumbor, where we're at half capacity right now (cc @hnowlan) and to a point elasticsearch (cc @Gehel)

Mentioned in SAL (#wikimedia-operations) [2023-01-17T14:52:14Z] <urandom> disabling Cassandra hinted-handoff for codfw -- T327001

Mentioned in SAL (#wikimedia-operations) [2023-01-17T14:56:27Z] <urandom> truncating hints for Cassandra nodes in codfw row b -- T327001

@Eevans do you still need to re-enable Cassandra hints in codfw row b?

Mentioned in SAL (#wikimedia-operations) [2023-01-18T15:31:46Z] <urandom> re-enabling Cassandra hinted-handoff for codfw -- T327001

@Eevans do you still need to re-enable Cassandra hints in codfw row b?

{{done}}

This is the return address

Seagrove C/O Celestica
Killam Industrial Park
13701 N Lamar Dr.
Laredo, TX 78045 USA
Project: CLS HUB Laredo, TX
Attn: Juniper Returns
Must show RMA# R200442866

27lbs
29x22x7