Page MenuHomePhabricator

codfw row D recable and add QFX
Closed, ResolvedPublic

Description

The recabling should not cause any service interruption (even though it caused some few seconds downtime for a similar recabling in eqiad but none for codfw row D).
All servers in row D are listed on https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=14&role=server&status=1

The rack D4 switch replacement will cause up to 30min downtime for the following servers:
https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=14&rack_id=70&role=server&status=1
mc2033, 10*mw, wdqs2006, sessionstore2003, ores2008
cc
@Eevans for sessionstore2003 (T209389)
@elukey for mc2033 and mw and "to ensure we removed any mcrouter proxies from the config"
@Gehel for wdqs2006
@akosiaris for ores2008

Looking at doing it Wednesday December 19th - 4pm UTC - 10am Dallas time - 3h

1/preparations

  • Rack QFX [papaul]
  • Connect console [papaul]
  • Connect USB drive containing Junos 14.1X53-D43.7 (present in install2002:/home/ayounsi/jinstall-qfx-5-14.1X53-D43.7-domestic-signed.tgz) [papaul]
  • Pre-populate SFP-Ts [papaul]
ge-4/0/0 
ge-4/0/1 
ge-4/0/2 
ge-4/0/3 
ge-4/0/4 
ge-4/0/5 
ge-4/0/6 
ge-4/0/7 
ge-4/0/8 
ge-4/0/9 
ge-4/0/10
ge-4/0/11
ge-4/0/12
  • Upgrade and configure VCP on QFX [arzhel]
request system software add jinstall-qfx-5-14.1X53-D43.7-domestic-signed.tgz force-host...
request virtual-chassis mode fabric mixed local
request virtual-chassis vc-port set pic-slot 0 port 52 local
request virtual-chassis vc-port set pic-slot 0 port 53 local
request system zeroize
  • Get QFX serial#
  • Pre run (but don't connect) VC links [papaul]

2/ recabling

Virtual Chassis Fabric-codfw 10G + recable.png (2×1 px, 174 KB)

To be on the safe side:

  • Depool site in DNS [arzhel]
  • Redirect eqsin/ulsfo caches to eqiad [arzhel]
  • Downtime VC ports Icinga alert [arzhel]
  • Insert uplink module to A8 (hot-insertable) [papaul]
  • Enable all VC ports (except uplinks) on spines [arzhel]
  • Enable VC ports on fpc8 uplink module [arzhel]
  • Add: [papaul]

Links are 40G unless 10G is specified
fpc2-fpc4
fpc5-fpc7

  • Confirm working [arzhel]
  • Remove: [papaul]

fpc3:1/2-fpc4:1/0
fpc3:1/0-fpc1:1/1
fpc5:1/1-fpc6:1/0
fpc1:1/0-fpc8:1/0
fpc6:1/1-fpc8:1/1

  • Add: [papaul]

fpc1-fpc7
fpc2-fpc6
fpc2-fpc8 (2*10G)

  • Confirm working [arzhel]
  • Remove fpc8:1/2-fpc7:0/50 [papaul]
  • Add fpc8-fpc7 (with 2*10G) [papaul]
  • Add fpc3-fpc7 [papaul]
  • cleanup unused VC ports [arzhel]

3/ FPC4 replacement

  • Downtime hosts in Icinga [arzhel]
  • Shutdown EX [arzhel]
  • Reconfigure VCP with QFX serial# [arzhel]

set virtual-chassis member 4 serial-number XXXX

  • Power on QFX [papaul]
  • Connect console [papaul]
  • Move VC cables from EX to QFX (ports 52/53) [papaul]
  • Move servers' uplinks from EX to QFX [papaul]
  • verify monitoring is happy [arzhel]
  • Repool site [arzhel]
  • Update Netbox (rename old/new a4 switches, serial connection, etc) [papaul]

https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/vcf-removing.html
https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/vcf-adding-device.html
https://www.juniper.net/documentation/en_US/release-independent/junos/topics/reference/specifications/uplink-module-ex4300.html

Event Timeline

ayounsi triaged this task as Medium priority.Nov 26 2018, 11:34 PM
ayounsi created this task.

@akosiaris for ores2008

Just schedule downtime in icinga and do whatever actions are required. The service will happily keep chugging along on the other 8 hosts in eqiad.

@Gehel for wdqs2006

Depooling and downtime in Icinga should be good enough. There should be no user traffic on this server and updater will catch up on lag once connectivity is restored.

Need to check with Joe but I'd do the following:

  • replace mw2287 in mcrouter codfw proxy config with another one in DX (with X!=4)
  • before the maintenance, remove mc2033 from the mcrouter config, and roll restart all mcrouters

Change 477472 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] mcrouter: replace codfw proxy before maintenance

https://gerrit.wikimedia.org/r/477472

Change 477473 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] mcrouter: temporary remove mc2033 to ease network maintenance

https://gerrit.wikimedia.org/r/477473

Change 477472 merged by Effie Mouzeli:
[operations/puppet@production] mcrouter: replace codfw proxy before maintenance

https://gerrit.wikimedia.org/r/477472

Mentioned in SAL (#wikimedia-operations) [2018-12-12T16:34:34Z] <jijiki> Merged 477472 "mcrouter: replace codfw proxy before maintenance", eqiad mcrouters are picking up the change - T210467

Change 477473 merged by Effie Mouzeli:
[operations/puppet@production] mcrouter: temporary remove mc2033 to ease network maintenance

https://gerrit.wikimedia.org/r/477473

connected to scs-c1-codfw on port 48

Some more shuffling around to facilitate DCops work (as row A has more machines), scheduling row A Tomorrow and row D Wednesday.

Change 480776 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool codfw for row D recabling

https://gerrit.wikimedia.org/r/480776

fpc2-fpc8 xe-2/0/41 and xe-2/0/42
fpc7-fpc8 xe-7/0/43 and xe-7/0/44

Change 480776 merged by Ayounsi:
[operations/dns@master] Depool codfw for row D recabling

https://gerrit.wikimedia.org/r/480776

Change 480782 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Redirect eqsin/ulsfo caches to eqiad

https://gerrit.wikimedia.org/r/480782

Change 480782 merged by Ayounsi:
[operations/puppet@production] Redirect eqsin/ulsfo caches to eqiad

https://gerrit.wikimedia.org/r/480782

Mentioned in SAL (#wikimedia-operations) [2018-12-19T16:04:23Z] <XioNoX> Redirect eqsin/ulsfo caches to eqiad - T210467

Mentioned in SAL (#wikimedia-operations) [2018-12-19T16:52:40Z] <XioNoX> codfw row D maintenance finished without issues - T210467

Papaul updated the task description. (Show Details)
Papaul subscribed.

Mentioned in SAL (#wikimedia-operations) [2018-12-19T19:33:41Z] <XioNoX> Revert "Redirect eqsin/ulsfo caches to eqiad" - T210467

This has been completed under 1h with no issues whatsoever.