Page MenuHomePhabricator

codfw row C recable and add QFX
Closed, ResolvedPublic

Description

The recabling should not cause any service interruption even though it caused some few seconds downtime for a similar recabling in eqiad, so the site should be depooled to be on the safe side.
All servers in row C are listed on https://netbox.wikimedia.org/dcim/devices/?tenant=0&q=&sort=name&rack_group_id=13&role=server

The rack C4 switch replacement will cause up to 30min downtime for the following servers:
https://netbox.wikimedia.org/dcim/devices/?rack_id=62&rack_group_id=13&role=server&status=1&tenant=0&q=&sort=name
One mc, 11*wtp, 27*mw
cc @Joe and @elukey to know if that's okay

Looking at doing it Wednesday 7th - 4pm UTC - 10am Dallas time

1/preparations

  • Rack QFX [papaul]
  • Connect console [papaul]
  • Connect USB drive containing Junos 14.1X53-D43.7 (present in install2002:/home/ayounsi/jinstall-qfx-5-14.1X53-D43.7-domestic-signed.tgz) [papaul]
  • Pre-populate SFP-Ts [papaul]
ge-4/0/1 
ge-4/0/2 
ge-4/0/3 
ge-4/0/4 
ge-4/0/5 
ge-4/0/6 
ge-4/0/7 
ge-4/0/8 
ge-4/0/9 
ge-4/0/10
ge-4/0/11
ge-4/0/12
ge-4/0/13
ge-4/0/14
ge-4/0/15
ge-4/0/16
ge-4/0/17
ge-4/0/18
ge-4/0/19
ge-4/0/20
ge-4/0/21
ge-4/0/22
ge-4/0/23
ge-4/0/24
ge-4/0/25
ge-4/0/26
ge-4/0/27
ge-4/0/28
ge-4/0/29
ge-4/0/30
ge-4/0/31
ge-4/0/32
ge-4/0/33
ge-4/0/34
ge-4/0/35
ge-4/0/36
ge-4/0/37
ge-4/0/38
ge-4/0/39
  • Upgrade and configure VCP on QFX [arzhel]
request system software add jinstall-qfx-5-14.1X53-D43.7-domestic-signed.tgz force-host...
request virtual-chassis mode fabric mixed local
request system zeroize
  • Get QFX serial# [arzhel]
  • Pre run VC links [papaul]

2/ recabling

Virtual Chassis Fabric-codfw 10G + recable(2).png (2×1 px, 174 KB)

Ignore FPC8 as C8 is the frack rack

  • Depool site in DNS
  • Redirect eqsin/ulsfo caches to eqiad
  • Enable all VC ports (except uplinks) on spines [arzhel]
  • Shutdown fpc8 [arzhel]
  • remove fpc8 from config [arzhel]
  • remove fpc8 VC links [papaul]
  • Add: [papaul]

fpc2-fpc4
fpc5-fpc7

  • Confirm working [arzhel]
  • Remove: [papaul]

fpc3-fpc4
fpc3-fpc1
fpc5-fpc6

  • Add: [papaul]

fpc1-fpc7
fpc3-fpc7
fpc2-fpc6

  • Confirm working [arzhel]
  • cleanup unused VC ports [arzhel]

3/ FPC4 replacement

  • Downtime hosts in Icinga [arzhel]
  • Shutdown EX [arzhel]
  • Reconfigure VCP with QFX serial# [arzhel]

set virtual-chassis member 4 serial-number XXXX

  • Power on QFX [papaul]
  • Enable VC ports on QFX
request virtual-chassis vc-port set pic-slot 0 port 52 local
request virtual-chassis vc-port set pic-slot 0 port 53 local
  • Move VC cables from EX to QFX [papaul]
  • Move servers' uplinks from EX to QFX [papaul]
  • Repool site [arzhel]
  • Update Netbox [papaul]

https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/vcf-removing.html
https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/vcf-adding-device.html

Event Timeline

ayounsi triaged this task as Medium priority.Oct 29 2018, 10:25 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Here is the full list of hosts in that row. No outages expected, but brief (5s) connectivity interruption for some racks is possible.
CCing services owners, to know if it's an acceptable risk and if it can be mitigated by depooling services.
I realize it's also a short head's up (this Thursday), so I'm fine with rescheduling it too.
@Joe @elukey @ema @BBlack @jcrespo @Andrew @akosiaris @Gehel @fgiunchedi

conf2002
cp2013
cp2014
cp2015
cp2016
cp2017
cp2018
db2033
db2036
db2037
db2038
db2041
db2043
db2044
db2046
db2047       
db2049
db2050
db2070
db2073
db2077
db2080
db2087
db2090
db2095
db8083
dbstore2002
deploy2001
dns2001
elastic2013
elastic2014
elastic2015
elastic2016
elastic2017
elastic2018
elastic2031
elastic2032
elastic2033
es2012
es2015
graphite2002
kafka2003
kubernetes2003
labtestcontrol2003
labtestservices2002
lvs2001:ens1f0
lvs2002:ens1f0
lvs2003:ens1f0
lvs2004:ens1f0
lvs2005:ens1f0
lvs2006:ens1f0
lvs2009:enp59s0f0
maps2003
mc2017
mc2018
mc2027
mc2028
mc2029
mc2030
mc2031
ms-be2015
ms-be2020
ms-be2021
ms-be2034
ms-be2035
ms-be2036
ms-be2042
ms-fe2007
mw2150
mw2151
mw2152
mw2153
mw2154
mw2155       
mw2156
mw2157
mw2158
mw2159
mw2160
mw2161
mw2162
mw2163
mw2164
mw2165
mw2166
mw2167
mw2168
mw2169
mw2170
mw2171
mw2172
mw2173
mw2174
mw2175
mw2176
mw2177
mw2178
mw2179
mw2180
mw2181
mw2182
mw2183
mw2184
mw2185
mw2186
mw2187
mw2188
mw2189
mw2190
mw2191
mw2192
mw2193
mw2194
mw2195
mw2196
mw2197
mw2198
mw2199
mw2200
mw2201
mw2202
mw2203
mw2204
mw2205
mw2206
mw2207
mw2208
mw2209
mw2210
mw2211
mw2212
mw2213
mw2214
mwlog2001
ores2005
ores2006
oresrdb2002
pc2005       
phab2001
rdb2005
restbase2003
restbase2004
restbase2008
restbase2011
scb2001
scb2006
tegmen
thumbor2001
thumbor2002
wdqs2001
wtp2011
wtp2012
wtp2013
wtp2014
wtp2015
wtp2016      
wtp2017
wtp2018
wtp2019
wtp2020

Thursday

No problem on my side, a short network outage is not a huge issue on codfw for dbs, but I cannot guarantee they will not page, and I won't be around to attend it- someone else will have to.

CCing services owners, to know if it's an acceptable risk and if it can be mitigated by depooling services.

Short interruptions are ok with me

About the C4 switch replacement: there are 4 mw hosts in codfw that are acting as proxies for mcrouter to replicate keys from eqiad to codfw:

elukey@mw1347:~$ cat /etc/mcrouter/config.json | jq '.pools.codfw'
{
  "servers": [
    "10.192.0.61:11214:ascii:ssl",
    "10.192.16.54:11214:ascii:ssl",
    "10.192.32.102:11214:ascii:ssl",
    "10.192.48.109:11214:ascii:ssl"
  ]
}

Among these mw2214 will be impacted, so it is probably wise to move the functionality away to another node first. For the same reason I'd prefer mc2029 to be moved with high priority, having it down might cause some replication impact (not the end of the world but..).

Change 470752 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] mcrouter: switch codfw proxy mw2214 with mw2163

https://gerrit.wikimedia.org/r/470752

There is a problem in the schedule I am afraid.. Nov 1st is holiday for most of the Europeans, plus I am a bit concerned about DBA presence since @Banyek and and Manuel are on holiday tomorrow, and Jaime will not be available.

Can we re-schedule @ayounsi ?

Here is the full list of hosts in that row. No outages expected, but brief (5s) connectivity interruption for some racks is possible.
CCing services owners, to know if it's an acceptable risk and if it can be mitigated by depooling services.

Regarding things I manage/know about:

conf2002

This will most likely break etcd replication as our master dc is codfw at the moment. This will page and someone will need to restart etcdmirror on the relevant host in eqiad (it will page).

mc2017
mc2018
mc2027
mc2028
mc2029
mc2030
mc2031

After the end of the outage, we will need to verify that redis session replication restarts as expected

ms-be2015
ms-be2020
ms-be2021
ms-be2034
ms-be2035
ms-be2036
ms-be2042
ms-fe2007

we will most likely need to run switftrepl after the outage to catch up on missing originals. Should we failover traffic just to eqiad until that's done? @godog any opinion?

mw2150
mw2151
mw2152
mw2153
mw2154
mw2155
mw2156
mw2157
mw2158
mw2159
mw2160
mw2161
mw2162
mw2163
mw2164
mw2165
mw2166
mw2167
mw2168
mw2169
mw2170
mw2171
mw2172
mw2173
mw2174
mw2175
mw2176
mw2177
mw2178
mw2179
mw2180
mw2181
mw2182
mw2183
mw2184
mw2185
mw2186
mw2187
mw2188
mw2189
mw2190
mw2191
mw2192
mw2193
mw2194
mw2195
mw2196
mw2197
mw2198
mw2199
mw2200
mw2201
mw2202
mw2203
mw2204
mw2205
mw2206
mw2207
mw2208
mw2209
mw2210
mw2211
mw2212
mw2213
mw2214

We will need to ensure we removed any mcrouter proxies from the config for the maintenance. Elukey is looking into it. We will also get paged by the fact pybal cannot probably remove so many servers from its configuration.

restbase2003
restbase2004
restbase2008
restbase2011

losing a full row can be an issue for cassandra, please check with @Eevans before proceeding.

wtp2011
wtp2012
wtp2013
wtp2014
wtp2015
wtp2016
wtp2017
wtp2018
wtp2019
wtp2020

Again, a pybal page can be expected.

I have to stress that I will not be around on thursday, and so will most european SREs. I'm not sure it's a good idea to proceed unless some of us agree to work on thursday and move the bank holiday.

The heads up was a bit late indeed, and I'd advise to postpone the maintenance.

To be clear: I think we should do the maintenance without depooling anything and check what would happen when we lose a row, even if in an inactive datacenter. But we should do that (call it an extreme chaos engineering drill) when most people are not on vacation.

we will most likely need to run switftrepl after the outage to catch up on missing originals. Should we failover traffic just to eqiad until that's done? @godog any opinion?

No we won't, brief interruptions are well tolerated by swift especially by the backends. For the frontends in theory we should depool but plug/unplug event is fine too IMHO.

[ ... ]

restbase2003
restbase2004
restbase2008
restbase2011

losing a full row can be an issue for cassandra, please check with @Eevans before proceeding.

I'd expect a few errors as in-flight queries fail, and requests are routed around the unreachable nodes, and a reduction in capacity by 1/3 (probably the more painful outcome, since we're IO-bound ATM). However, since this data-center is only processing async updates, I don't expect there will be any (visible) issues.

I'll be around at this time; Feel free to ping me on IRC if you want someone to keep an eye on it.

Thanks for the feedback, let's reschedule it for next Wednesday 7th - 4pm UTC - 10am Dallas time.

Note I didn't ask for a delay- and neither Manuel (vacations) or I (training) will be around that new day either. Balasz will be, however.

Change 472173 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool codfw for codfw row C maintenance

https://gerrit.wikimedia.org/r/472173

Change 472177 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Redirect eqsin/ulsfo caches to eqiad

https://gerrit.wikimedia.org/r/472177

Change 472173 merged by Ayounsi:
[operations/dns@master] Depool codfw for codfw row C maintenance

https://gerrit.wikimedia.org/r/472173

Mentioned in SAL (#wikimedia-operations) [2018-11-07T15:57:06Z] <XioNoX> depool codfw for row C maintenance - T208272

Change 472177 merged by Ayounsi:
[operations/puppet@production] Redirect eqsin/ulsfo caches to eqiad

https://gerrit.wikimedia.org/r/472177

Mentioned in SAL (#wikimedia-operations) [2018-11-07T15:58:41Z] <XioNoX> Redirect eqsin/ulsfo caches to eqiad - T208272

Mentioned in SAL (#wikimedia-operations) [2018-11-07T16:20:55Z] <XioNoX> Enable all VC ports (except uplinks) on spines - T208272

Mentioned in SAL (#wikimedia-operations) [2018-11-07T16:35:45Z] <XioNoX> shutdown asw-c-codfw FPC8 - T208272

Mentioned in SAL (#wikimedia-operations) [2018-11-07T16:37:58Z] <XioNoX> remove asw-c-codfw FPC8 from config - T208272

Change 470752 merged by Giuseppe Lavagetto:
[operations/puppet@production] mcrouter: switch codfw proxy mw2214 with mw2163

https://gerrit.wikimedia.org/r/470752

Mentioned in SAL (#wikimedia-operations) [2018-11-07T19:45:02Z] <XioNoX> asw-c-codfw maintenance finished successfuly - T208272

Mentioned in SAL (#wikimedia-operations) [2018-11-07T19:48:19Z] <XioNoX> Revert "Redirect eqsin/ulsfo caches to eqiad" - T208272

This has been completed successfully.

Everything went as expected, nothing other than C4 went offline.

Maintenance took 45min longer than planned window. Mostly due to my instructions being not detailed enough or incorrect, which has been fixed in the task description (for future similar work).