Page MenuHomePhabricator

codfw row B recable and add QFX
Closed, ResolvedPublic

Description

The recabling should not cause any service interruption (even though it caused some few seconds downtime for a similar recabling in eqiad but none for codfw row C).
All servers in row B are listed on https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=12&role=server&status=1

The rack B4 switch replacement will cause up to 30min downtime for the following servers:
https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=12&rack_id=54&role=server&status=1
13*mw, 10*wtp, db2096, sessionstore2001
cc
@Eevans for sessionstore2001 (T209389)
@elukey for mw and wtp and "to ensure we removed any mcrouter proxies from the config"
@jcrespo for db2096

Looking at doing it Wednesday December 12th - 4pm UTC - 10am Dallas time - 3h

1/preparations

  • Rack QFX [papaul]
  • Connect console [papaul]
  • Connect USB drive containing Junos 14.1X53-D43.7 (present in install2002:/home/ayounsi/jinstall-qfx-5-14.1X53-D43.7-domestic-signed.tgz) [papaul]
  • Pre-populate SFP-Ts [papaul]
ge-4/0/0 
ge-4/0/1
ge-4/0/17
ge-4/0/18
ge-4/0/19
ge-4/0/20
ge-4/0/21
ge-4/0/22
ge-4/0/23
ge-4/0/24
ge-4/0/25
ge-4/0/26
ge-4/0/27
ge-4/0/28
ge-4/0/29
ge-4/0/30
ge-4/0/31
ge-4/0/32
ge-4/0/33
ge-4/0/34
ge-4/0/35
ge-4/0/36
ge-4/0/37
ge-4/0/38
ge-4/0/39
  • Upgrade and configure VCP on QFX [arzhel]
request system software add jinstall-qfx-5-14.1X53-D43.7-domestic-signed.tgz force-host...
request virtual-chassis mode fabric mixed local
request virtual-chassis vc-port set pic-slot 0 port 52 local
request virtual-chassis vc-port set pic-slot 0 port 53 local
request system zeroize
  • Get QFX serial# (netbox)
  • Pre run (but don't connect) VC links [papaul]

2/ recabling

Virtual Chassis Fabric-codfw 10G + recable.png (2×1 px, 174 KB)

To be on the safe side:

  • Depool site in DNS [arzhel]
  • Redirect eqsin/ulsfo caches to eqiad [arzhel]
  • Insert uplink module to B8 (hot-insertable) [papaul]
  • Enable all VC ports (except uplinks) on spines [arzhel]
  • Enable VC ports on fpc8 uplink module [arzhel]
  • Add: [papaul]

Links are 40G unless 10G is specified
fpc2-fpc4
fpc5-fpc7

  • Confirm working [arzhel]
  • Remove: [papaul]

fpc3:1/2-fpc4:1/0
fpc3:1/0-fpc1:1/1
fpc5:1/1-fpc6:1/0
fpc1:1/0-fpc8:1/0
fpc6:1/1-fpc8:1/1

  • Add: [papaul]

fpc1-fpc7
fpc2-fpc6
fpc2-fpc8 (2*10G)

  • Confirm working [arzhel]
  • Remove fpc8:1/2-fpc7:0/50 [papaul]
  • Add fpc8-fpc7 (with 2*10G) [papaul]
  • Add fpc3-fpc7 [papaul]
  • cleanup unused VC ports [arzhel]

3/ FPC4 replacement

  • Downtime hosts in Icinga [arzhel]
  • Shutdown EX [arzhel]
  • Reconfigure VCP with QFX serial# [arzhel]

set virtual-chassis member 4 serial-number XXXX

  • Power on QFX [papaul]
  • Connect console [papaul]
  • Move VC cables from EX to QFX (ports 52/53) [papaul]
  • Move servers' uplinks from EX to QFX [papaul]
  • verify monitoring is happy [arzhel]
  • Repool site [arzhel]
  • Update Netbox (rename old/new a4 switches, serial connection, etc) [papaul]

https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/vcf-removing.html
https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/vcf-adding-device.html
https://www.juniper.net/documentation/en_US/release-independent/junos/topics/reference/specifications/uplink-module-ex4300.html

Event Timeline

ayounsi created this task.

No mcrouter codfw proxies present in B4, all good.

The switch is connected to port 48 on scs-a1-codfw

Part's ETA is today, rescheduling this work to tomorrow (Dec. 12th) same time.

For the connection fpc2-fpc8 on fpc2 I will be using xe-2/0/41 and xe-2/0/42
For the connection fpc7-fpc8 on fpc7 I will be using xe-2/0/43 and xe-2/0/44

Change 479230 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Disable codfw for row B recabling

https://gerrit.wikimedia.org/r/479230

Change 479230 merged by Ayounsi:
[operations/dns@master] Disable codfw for row B recabling

https://gerrit.wikimedia.org/r/479230

Mentioned in SAL (#wikimedia-operations) [2018-12-12T15:54:31Z] <XioNoX> Depool codfw for row B recabling - T210456

Change 479233 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Redirect eqsin/ulsfo caches to eqiad

https://gerrit.wikimedia.org/r/479233

Change 479233 merged by Ayounsi:
[operations/puppet@production] Redirect eqsin/ulsfo caches to eqiad

https://gerrit.wikimedia.org/r/479233

Mentioned in SAL (#wikimedia-operations) [2018-12-12T16:01:47Z] <XioNoX> Redirect eqsin/ulsfo caches to eqiad - T210456

Mentioned in SAL (#wikimedia-operations) [2018-12-12T17:54:56Z] <XioNoX> shutting down asw-b4-codfw - T210456

Change 479262 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Revert "Disable codfw for row B recabling"

https://gerrit.wikimedia.org/r/479262

Change 479263 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Revert "Redirect eqsin/ulsfo caches to eqiad"

https://gerrit.wikimedia.org/r/479263

Change 479263 merged by Ayounsi:
[operations/puppet@production] Revert "Redirect eqsin/ulsfo caches to eqiad"

https://gerrit.wikimedia.org/r/479263

Mentioned in SAL (#wikimedia-operations) [2018-12-12T19:28:17Z] <XioNoX> revert redirecting eqsin/ulsfo caches to eqiad - T210456

Change 479262 merged by Ayounsi:
[operations/dns@master] Revert "Disable codfw for row B recabling"

https://gerrit.wikimedia.org/r/479262

Papaul updated the task description. (Show Details)

Complete