Page MenuHomePhabricator

codfw row D recable and add QFX
Closed, ResolvedPublic

Description

The recabling should not cause any service interruption (even though it caused some few seconds downtime for a similar recabling in eqiad but none for codfw row D).
All servers in row D are listed on https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=14&role=server&status=1

The rack D4 switch replacement will cause up to 30min downtime for the following servers:
https://netbox.wikimedia.org/dcim/devices/?q=&rack_group_id=14&rack_id=70&role=server&status=1
mc2033, 10*mw, wdqs2006, sessionstore2003, ores2008
cc
@Eevans for sessionstore2003 (T209389)
@elukey for mc2033 and mw and "to ensure we removed any mcrouter proxies from the config"
@Gehel for wdqs2006
@akosiaris for ores2008

Looking at doing it Wednesday December 19th - 4pm UTC - 10am Dallas time - 3h

1/preparations

  • Rack QFX [papaul]
  • Connect console [papaul]
  • Connect USB drive containing Junos 14.1X53-D43.7 (present in install2002:/home/ayounsi/jinstall-qfx-5-14.1X53-D43.7-domestic-signed.tgz) [papaul]
  • Pre-populate SFP-Ts [papaul]
ge-4/0/0 
ge-4/0/1 
ge-4/0/2 
ge-4/0/3 
ge-4/0/4 
ge-4/0/5 
ge-4/0/6 
ge-4/0/7 
ge-4/0/8 
ge-4/0/9 
ge-4/0/10
ge-4/0/11
ge-4/0/12
  • Upgrade and configure VCP on QFX [arzhel]
request system software add jinstall-qfx-5-14.1X53-D43.7-domestic-signed.tgz force-host...
request virtual-chassis mode fabric mixed local
request virtual-chassis vc-port set pic-slot 0 port 52 local
request virtual-chassis vc-port set pic-slot 0 port 53 local
request system zeroize
  • Get QFX serial#
  • Pre run (but don't connect) VC links [papaul]

2/ recabling

To be on the safe side:

  • Depool site in DNS [arzhel]
  • Redirect eqsin/ulsfo caches to eqiad [arzhel]
  • Downtime VC ports Icinga alert [arzhel]
  • Insert uplink module to A8 (hot-insertable) [papaul]
  • Enable all VC ports (except uplinks) on spines [arzhel]
  • Enable VC ports on fpc8 uplink module [arzhel]
  • Add: [papaul]

Links are 40G unless 10G is specified
fpc2-fpc4
fpc5-fpc7

  • Confirm working [arzhel]
  • Remove: [papaul]

fpc3:1/2-fpc4:1/0
fpc3:1/0-fpc1:1/1
fpc5:1/1-fpc6:1/0
fpc1:1/0-fpc8:1/0
fpc6:1/1-fpc8:1/1

  • Add: [papaul]

fpc1-fpc7
fpc2-fpc6
fpc2-fpc8 (2*10G)

  • Confirm working [arzhel]
  • Remove fpc8:1/2-fpc7:0/50 [papaul]
  • Add fpc8-fpc7 (with 2*10G) [papaul]
  • Add fpc3-fpc7 [papaul]
  • cleanup unused VC ports [arzhel]

3/ FPC4 replacement

  • Downtime hosts in Icinga [arzhel]
  • Shutdown EX [arzhel]
  • Reconfigure VCP with QFX serial# [arzhel]

set virtual-chassis member 4 serial-number XXXX

  • Power on QFX [papaul]
  • Connect console [papaul]
  • Move VC cables from EX to QFX (ports 52/53) [papaul]
  • Move servers' uplinks from EX to QFX [papaul]
  • verify monitoring is happy [arzhel]
  • Repool site [arzhel]
  • Update Netbox (rename old/new a4 switches, serial connection, etc) [papaul]

https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/vcf-removing.html
https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/vcf-adding-device.html
https://www.juniper.net/documentation/en_US/release-independent/junos/topics/reference/specifications/uplink-module-ex4300.html

Event Timeline

ayounsi triaged this task as Medium priority.Nov 26 2018, 11:34 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 26 2018, 11:34 PM

@akosiaris for ores2008

Just schedule downtime in icinga and do whatever actions are required. The service will happily keep chugging along on the other 8 hosts in eqiad.

@Gehel for wdqs2006

Depooling and downtime in Icinga should be good enough. There should be no user traffic on this server and updater will catch up on lag once connectivity is restored.

elukey added a comment.Dec 4 2018, 7:29 AM

Need to check with Joe but I'd do the following:

  • replace mw2287 in mcrouter codfw proxy config with another one in DX (with X!=4)
  • before the maintenance, remove mc2033 from the mcrouter config, and roll restart all mcrouters

Change 477472 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] mcrouter: replace codfw proxy before maintenance

https://gerrit.wikimedia.org/r/477472

Change 477473 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] mcrouter: temporary remove mc2033 to ease network maintenance

https://gerrit.wikimedia.org/r/477473

jijiki added a subscriber: jijiki.Dec 5 2018, 5:01 PM

Change 477472 merged by Effie Mouzeli:
[operations/puppet@production] mcrouter: replace codfw proxy before maintenance

https://gerrit.wikimedia.org/r/477472

Mentioned in SAL (#wikimedia-operations) [2018-12-12T16:34:34Z] <jijiki> Merged 477472 "mcrouter: replace codfw proxy before maintenance", eqiad mcrouters are picking up the change - T210467

jijiki moved this task from Backlog/Radar to In Progress on the User-jijiki board.Dec 12 2018, 6:36 PM

Change 477473 merged by Effie Mouzeli:
[operations/puppet@production] mcrouter: temporary remove mc2033 to ease network maintenance

https://gerrit.wikimedia.org/r/477473

connected to scs-c1-codfw on port 48

Papaul updated the task description. (Show Details)Dec 17 2018, 6:28 PM
ayounsi updated the task description. (Show Details)Dec 17 2018, 7:00 PM

Some more shuffling around to facilitate DCops work (as row A has more machines), scheduling row A Tomorrow and row D Wednesday.

Papaul updated the task description. (Show Details)Dec 17 2018, 7:23 PM
ayounsi updated the task description. (Show Details)Dec 17 2018, 7:38 PM

Change 480776 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool codfw for row D recabling

https://gerrit.wikimedia.org/r/480776

fpc2-fpc8 xe-2/0/41 and xe-2/0/42
fpc7-fpc8 xe-7/0/43 and xe-7/0/44

Change 480776 merged by Ayounsi:
[operations/dns@master] Depool codfw for row D recabling

https://gerrit.wikimedia.org/r/480776

Mentioned in SAL (#wikimedia-operations) [2018-12-19T15:48:40Z] <XioNoX> depool codfw - T210467

Papaul updated the task description. (Show Details)Dec 19 2018, 3:49 PM

Change 480782 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Redirect eqsin/ulsfo caches to eqiad

https://gerrit.wikimedia.org/r/480782

Change 480782 merged by Ayounsi:
[operations/puppet@production] Redirect eqsin/ulsfo caches to eqiad

https://gerrit.wikimedia.org/r/480782

Mentioned in SAL (#wikimedia-operations) [2018-12-19T16:04:23Z] <XioNoX> Redirect eqsin/ulsfo caches to eqiad - T210467

ayounsi updated the task description. (Show Details)Dec 19 2018, 4:16 PM

Mentioned in SAL (#wikimedia-operations) [2018-12-19T16:33:48Z] <XioNoX> shutdown asw-d4-codfw - T210467

ayounsi updated the task description. (Show Details)Dec 19 2018, 4:38 PM

Mentioned in SAL (#wikimedia-operations) [2018-12-19T16:52:40Z] <XioNoX> codfw row D maintenance finished without issues - T210467

ayounsi updated the task description. (Show Details)Dec 19 2018, 4:53 PM
Papaul reassigned this task from Papaul to ayounsi.Dec 19 2018, 6:05 PM
Papaul updated the task description. (Show Details)
Papaul added a subscriber: Papaul.

Mentioned in SAL (#wikimedia-operations) [2018-12-19T19:32:11Z] <XioNoX> repool codfw - T210467

Mentioned in SAL (#wikimedia-operations) [2018-12-19T19:33:41Z] <XioNoX> Revert "Redirect eqsin/ulsfo caches to eqiad" - T210467

ayounsi closed this task as Resolved.Dec 19 2018, 7:35 PM

This has been completed under 1h with no issues whatsoever.