Page MenuHomePhabricator

codfw row A recable and add QFX
Closed, ResolvedPublic

Description

The recabling should not cause any service interruption (even though it caused some few seconds downtime for a similar recabling in eqiad but none for codfw row C).
All servers in row A are listed on https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=11&role=server&status=1

The rack A4 switch replacement will cause up to 30min downtime for the following servers:
https://netbox.wikimedia.org/dcim/devices/?q=&rack_id=46&role=server&status=1
ores2001, 2*ganeti, 15*mw
cc @akosiaris to know what specific actions need to be taken for Ores and Ganeti
@elukey "to ensure we removed any mcrouter proxies from the config"

Looking at doing it Tuesday December 18th - 4pm UTC - 10am Dallas time - 3h

1/preparations

  • Rack QFX [papaul]
  • Connect console [papaul]
  • Connect USB drive containing Junos 14.1X53-D43.7 (present in install2002:/home/ayounsi/jinstall-qfx-5-14.1X53-D43.7-domestic-signed.tgz) [papaul]
  • Pre-populate SFP-Ts [papaul]
ge-4/0/0 
ge-4/0/1 
ge-4/0/2 
ge-4/0/3 
ge-4/0/4 
ge-4/0/5 
ge-4/0/6 
ge-4/0/7 
ge-4/0/8 
ge-4/0/9 
ge-4/0/10
ge-4/0/11
ge-4/0/12
ge-4/0/13
ge-4/0/14
ge-4/0/15
ge-4/0/16
ge-4/0/17
  • Upgrade and configure VCP on QFX [arzhel]
request system software add jinstall-qfx-5-14.1X53-D43.7-domestic-signed.tgz force-host...
request virtual-chassis mode fabric mixed local
request virtual-chassis vc-port set pic-slot 0 port 52 local
request virtual-chassis vc-port set pic-slot 0 port 53 local
request system zeroize
  • Get QFX serial# (In Netbox)
  • Pre run (but don't connect) VC links [papaul]

2/ recabling

To be on the safe side:

  • Depool site in DNS [arzhel]
  • Redirect eqsin/ulsfo caches to eqiad [arzhel]
  • Redirect authdns2001 to authdns1001 [arzhel]
  • Insert uplink module to A8 (hot-insertable) [papaul]
  • Enable all VC ports (except uplinks) on spines [arzhel]
  • Enable VC ports on fpc8 uplink module [arzhel]
  • Add: [papaul]

Links are 40G unless 10G is specified
fpc2-fpc4
fpc5-fpc7

  • Confirm working [arzhel]
  • Remove: [papaul]

fpc3:1/2-fpc4:1/0
fpc3:1/0-fpc1:1/1
fpc5:1/1-fpc6:1/0
fpc1:1/0-fpc8:1/0
fpc6:1/1-fpc8:1/1

  • Add: [papaul]

fpc1-fpc7
fpc2-fpc6
fpc2-fpc8 (2*10G)

  • Confirm working [arzhel]
  • Remove fpc8:1/2-fpc7:0/50 [papaul]
  • Add fpc8-fpc7 (with 2*10G) [papaul]
  • Add fpc3-fpc7 [papaul]
  • cleanup unused VC ports [arzhel]

3/ FPC4 replacement

  • Downtime hosts in Icinga [arzhel]
  • Shutdown EX [arzhel]
  • Reconfigure VCP with QFX serial# [arzhel]

set virtual-chassis member 4 serial-number XXXX

  • Power on QFX [papaul]
  • Connect console [papaul]
  • Move VC cables from EX to QFX (ports 52/53) [papaul]
  • Move servers' uplinks from EX to QFX [papaul]
  • verify monitoring is happy [arzhel]
  • Repool site [arzhel]
  • Remove ns1 redirect [arzhel]
  • Redirect eqsin/ulsfo back to codfw [arzhel]
  • Update Netbox (rename old/new a4 switches, serial connection, etc) [papaul]

https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/vcf-removing.html
https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/vcf-adding-device.html
https://www.juniper.net/documentation/en_US/release-independent/junos/topics/reference/specifications/uplink-module-ex4300.html

Event Timeline

ayounsi triaged this task as Medium priority.Nov 26 2018, 8:42 PM
ayounsi created this task.
ayounsi reassigned this task from ayounsi to Papaul.Nov 26 2018, 9:51 PM

ores2001, 2*ganeti, 15*mw
cc @akosiaris to know what specific actions need to be taken for Ores and Ganeti

for ores2001, nothing is really required aside from some downtime in icinga. The software will keep on chugging happily on the other 8 nodes. Same goes for the mws (which are not anyway receiving any end user traffic)

for ganeti, I 'll have to empty the nodes from live VMs on that say. It's easy, ping me the previous day (Dec 4th) and I 'll do it.

Papaul updated the task description. (Show Details)Dec 3 2018, 3:35 PM
Papaul updated the task description. (Show Details)Dec 3 2018, 3:40 PM
ayounsi updated the task description. (Show Details)Dec 3 2018, 7:12 PM

Parts keeps getting delayed, new shipping is expected for this Friday, rescheduling the work for next Wednesday.

ayounsi updated the task description. (Show Details)Dec 4 2018, 9:11 PM

No mcrouter proxies on A4, all good.

Part's ETA is today, for DCops convenience, rescheduling this one to next Wednesday, and row B to Dec. 12th, same time.

ayounsi updated the task description. (Show Details)Dec 11 2018, 4:04 PM

fpc2-fpc8 connection xe-2/0/41 and xe-2/0/42
fpc7-fpc8 connection xe-7/0/43 and xe-7/0/44

Papaul updated the task description. (Show Details)Dec 17 2018, 5:33 PM
ayounsi updated the task description. (Show Details)Dec 17 2018, 6:59 PM

Some more shuffling around to facilitate DCops work (as row A has more machines), scheduling row A Tomorrow and row D Wednesday.

Mentioned in SAL (#wikimedia-operations) [2018-12-18T15:30:25Z] <akosiaris> empty ganeti2005, ganeti2006 for T210447

Change 480514 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool codfw for row A recabling

https://gerrit.wikimedia.org/r/480514

Change 480514 merged by Ayounsi:
[operations/dns@master] Depool codfw for row A recabling

https://gerrit.wikimedia.org/r/480514

Mentioned in SAL (#wikimedia-operations) [2018-12-18T15:44:12Z] <XioNoX> depool codfw for T210447

Change 480518 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Redirect eqsin/ulsfo caches to eqiad

https://gerrit.wikimedia.org/r/480518

Change 480518 merged by Ayounsi:
[operations/puppet@production] Redirect eqsin/ulsfo caches to eqiad

https://gerrit.wikimedia.org/r/480518

Mentioned in SAL (#wikimedia-operations) [2018-12-18T15:47:10Z] <XioNoX> redirect eqsin/ulsfo caches to eqiad for T210447

Mentioned in SAL (#wikimedia-operations) [2018-12-18T15:53:23Z] <XioNoX> redirect ns1 to authdns1001 for T210447

ayounsi updated the task description. (Show Details)Dec 18 2018, 3:56 PM
ayounsi updated the task description. (Show Details)Dec 18 2018, 4:07 PM

Mentioned in SAL (#wikimedia-operations) [2018-12-18T16:07:29Z] <XioNoX> starting codfw row A recabling - T210447

Mentioned in SAL (#wikimedia-operations) [2018-12-18T17:58:00Z] <XioNoX> shutdown fpc4 for replacement - T210447

ayounsi updated the task description. (Show Details)Dec 18 2018, 6:31 PM

Mentioned in SAL (#wikimedia-operations) [2018-12-18T18:40:59Z] <XioNoX> Revert "Redirect eqsin/ulsfo caches to eqiad" - T210447

Mentioned in SAL (#wikimedia-operations) [2018-12-18T18:42:17Z] <XioNoX> repool codfw - T210447

Papaul reassigned this task from Papaul to ayounsi.Dec 18 2018, 6:49 PM
Papaul updated the task description. (Show Details)
ayounsi updated the task description. (Show Details)Dec 18 2018, 7:00 PM
ayounsi added a comment.EditedDec 18 2018, 7:03 PM

This has been completed 30min before schedule despite 2 issues:

  • A massive spike of multicast most likely due to the recabling flooded the entire infra causing various issues, to be investigated - T212273
  • 2 then 3 of the 4 ports of the uplink module didn't work, re-seating it then bouncing the VC ports solved the issue
ayounsi closed this task as Resolved.Dec 18 2018, 7:03 PM

Mentioned in SAL (#wikimedia-operations) [2018-12-18T19:06:57Z] <XioNoX> redirect ns1 back to authdns2001 - T210447

Mentioned in SAL (#wikimedia-operations) [2018-12-19T08:43:44Z] <akosiaris> rebalance row_A ganeti01.svc.codfw.wmnet nodegroup after recabling T210447