⚓ T148506 Rack and setup new eqiad row D switch stack (EX4300/QFX5100)

Subject	Repo	Branch	Lines +/-
puppetmaster: Re-depool puppetmaster1002	operations/puppet	production	+6 -3
puppetmaster: Depool puppetmaster1002	operations/puppet	production	+1 -1
Depool db servers on row D except es1019	operations/mediawiki-config	master	+30 -26

The following things are now pending from @Cmjohnson:

The link between D 2 and D 5 does not seem to be working — is seen as down from the D 5 side, but not from the D 2 side, which also seems to notice when D 5 gets rebooted. Something weird going on there that needs some extra physical debugging (I've tried rebooting a bunch of times). Since we have no other link right now, D 5 has not joined the stack at all, so this needs to absolutely be our next step. (Note that I upgraded asw-d5 to the same JunOS manually, so they all run the same JunOS now.)
We'll need to connect asw2-d with cr1-eqiad (and asw-d-eqiad with 4x10G. Let's sync up on IRC for that one.
A lot of cross-stack links are missing, due to missing cables, cf. T149726.

mark moved this task from Backlog to In Progress on the netops board.Nov 3 2016, 4:22 PM

The 2<->5 link was due to a faulty cable. Chris has replaced that and the stack is fully formed now, albeit with not much redundancy (still waiting on cables).

We'll do cr1<->asw2 and asw<->asw2 links next.

faidon updated the task description. (Show Details)Nov 4 2016, 3:47 PM

asw<->asw2 links are done, 4x10G on racks D 2 and D 7 (2 each).

The 4x10G links from cr1-eqiad:ae4 to asw-d-eqiad:ae1 have been moved over to asw2-d-eqiad:ae1, also on D 2 and D 7 (spines).

mc1033-mc1036 (new but inactive memcached servers) have also been moved over to asw2-d-eqiad as test hosts.

faidon mentioned this in T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006.Nov 8 2016, 12:51 PM

Change 320401 had a related patch set uploaded (by Jcrespo):
Depool db servers on row D except es1019

https://gerrit.wikimedia.org/r/320401

gerritbot added a project: Patch-For-Review.Nov 8 2016, 2:34 PM

Change 320401 merged by jenkins-bot:
Depool db servers on row D except es1019

https://gerrit.wikimedia.org/r/320401

asw2-d-eqiad has been confirmed to be affected by T133387 and enabling IGMP snooping on the QFXes breaks IPv6.

Since this affects us across DC and is an relatively important issue, and since asw2-d-eqiad isn't in production yet, I would rather open a Juniper case and use asw2-d-eqiad as the guinea pig for whatever Juniper recommends we do as our course of action (such as service-disrupting JunOS upgrades for example).

Thus, an asw2-d-eqiad deployment is currently blocked on T133387, which in turn is blocked on T147518 :(

Received the new cables and finished with the row redundancy less the d1 to d8 link which will need fiber.

D1 to D8 was patched with fiber QSFP+s (et-1/1/0 <-> et-8/1/0). The no-name optics we bought in T149726 appear as QSFP+-40G-CU3M in show chassis hardware but other than that everything looks good.

faidon mentioned this in T156004: Move db1051 to row B3.Jan 23 2017, 3:41 PM

ayounsi subscribed.Apr 3 2017, 10:13 PM

• Marostegui mentioned this in T162681: Network maintenance on row D (databases).Apr 11 2017, 12:28 PM

• Marostegui added a subtask: T162681: Network maintenance on row D (databases).

• Marostegui mentioned this in T162133: Replace some masters in eqiad while it is not active .Apr 11 2017, 12:49 PM

Talked to Chris and Brandon, we're going to aim for doing the work on Wednesday April 26.

elukey created subtask T163002: Spread eqiad analytics Kafka nodes to multiple racks ans rows.Apr 14 2017, 2:51 PM

akosiaris mentioned this in T163324: switchover icinga.wikimedia.org from einsteinium to tegmen.Apr 19 2017, 1:56 PM

akosiaris created subtask T163324: switchover icinga.wikimedia.org from einsteinium to tegmen.

akosiaris created subtask T163326: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 and back (after T148506).Apr 19 2017, 2:08 PM

• chasemp mentioned this in T163402: Ensure we can survive a loss of labservices1001.Apr 19 2017, 11:52 PM

FYI @Andrew labservices1001 will be caught up in this as it lives in D3. Previously we had some issues with that host being offline where labservices1002 was not standing in as expected IIRC. It seems like the outage is expected to be brief :) but I created T163402 to sort it out before the 26th because you never know.

Paladox subscribed.Apr 20 2017, 12:37 AM

Change 349164 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] puppetmaster: Depool puppetmaster1002

https://gerrit.wikimedia.org/r/349164

akosiaris closed subtask T163324: switchover icinga.wikimedia.org from einsteinium to tegmen as Resolved.Apr 20 2017, 4:01 PM

Change 349164 merged by Alexandros Kosiaris:
[operations/puppet@production] puppetmaster: Depool puppetmaster1002

https://gerrit.wikimedia.org/r/349164

Change 349419 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] puppetmaster: Re-depool puppetmaster1002

https://gerrit.wikimedia.org/r/349419

Change 349419 merged by Alexandros Kosiaris:
[operations/puppet@production] puppetmaster: Re-depool puppetmaster1002

https://gerrit.wikimedia.org/r/349419

akosiaris closed subtask T163326: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 and back (after T148506) as Resolved.Apr 21 2017, 3:15 PM

@ayounsi, to follow up on our previous IRC conversation:

My understanding is that restbase1009, 1014, 1015, and 1018 will all go down briefly as the patch cables are moved from one switch to another. We do have a production service running in eqiad during the switchover (change-propagation), that processes updates and stores to Cassandra. It is pretty fault tolerant though, we have replicas distributed across machines in rows a, b, and d, (and these machines are all in row d, leaving 2 replicas online). Additionally, anything failed would be retried after.

That said, it'd be great if we could get a bit of advanced notice in order to silence the icinga notifications, and a heads up after, so that we can ensure everything is still in order. Would this be possible?

In T148506#3202477, @Eevans wrote:

That said, it'd be great if we could get a bit of advanced notice in order to silence the icinga notifications, and a heads up after, so that we can ensure everything is still in order. Would this be possible?

Sure, I will give you a head's up ~1h before starting if that's good.

From the feedback I collected here is what I believe the maintenance will look like. please let me know if something is wrong/needs to be clarified/added.
Days before
Switchover from einsteinium to tegmen T163324
Depool puppetmaster1002 (DONE/akosiaris)
switchover from oresrdb1001 to oresrdb1002 T163326
Ban all elasticsearch nodes of row D (glederrey)
fail etcd over to codfw (DONE glavagetto+rcoccioli/akosiaris)
Ensure we can survive a loss of labservices1001 T163402
Move kafka1020 to row B T163002
Databases maintenance (DONE T162681)

1h before maintenance (13:30 UTC)
ping Eevans (see above)
ping glederrey for elasticsearch and logstash coordination
Disable elasticsearch check https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+health+check+for+shards
Downtime servers from rows D2 and D8
Ping elukey to drain traffic from hadoop nodes

30min before maintenance (14:00 UTC)

Ensure CR2 doesn't become VRRP master:

cr2-eqiad# show | compare 
[edit interfaces ae4 unit 1004 family inet address 208.80.155.99/27 vrrp-group 4]
+        priority 70;
[edit interfaces ae4 unit 1004 family inet6 address 2620:0:861:4:fe00::2/64 vrrp-inet6-group 4]
+        priority 70;
[edit interfaces ae4 unit 1020 family inet address 10.64.48.3/22 vrrp-group 20]
+        priority 70;
[edit interfaces ae4 unit 1020 family inet6 address 2620:0:861:107:fe00::3/64 vrrp-inet6-group 20]
+        priority 70;
[edit interfaces ae4 unit 1023 family inet address 10.64.53.3/24 vrrp-group 23]
+        priority 70;
[edit interfaces ae4 unit 1023 family inet6 address 2620:0:861:108:fe00::2/64 vrrp-inet6-group 23]
+        priority 70;

Disable interface between cr2 and asw-d-eqiad:ae2

cr2-eqiad# set interfaces ae4 disable

Verify connectivity is still there (ping from host in row-d to outside host)
Verify traffic goes (cr2 <->) cr1 <-> asw2 <-> asw (and the other way around)

Fiber re-cabling

Either move fiber:
from asw-d:xe-1/1/2 to asw2-d:xe-2/0/42
from asw-d:xe-6/0/31 to asw2-d:xe-2/0/43
from asw-d:xe-7/0/31 to asw2-d:xe-7/0/42
from asw-d:xe-8/0/31 to asw2-d:xe-7/0/43

Or run new fibers:
between cr2-eqiad:xe-3/0/3 and asw2-d:xe-2/0/42
between cr2-eqiad:xe-3/1/3 and asw2-d:xe-2/0/43
between cr2-eqiad:xe-4/0/3 and asw2-d:xe-7/0/42
between cr2-eqiad:xe-4/3/3 and asw2-d:xe-7/0/43

Write down the cables IDs
Verify individual links are up/up

*Enable interface between cr2-eqiad and asw2:

cr2-eqiad# delete interfaces ae4 disable

Verify LACP interface is up.
Verify some traffic is flowing through ae4

Remove lower VRRP priority on cr2-eqiad:

[edit interfaces ae4 unit 1004 family inet address 208.80.155.99/27 vrrp-group 4]
-        priority 70;
[edit interfaces ae4 unit 1004 family inet6 address 2620:0:861:4:fe00::2/64 vrrp-inet6-group 4]
-        priority 70;
[edit interfaces ae4 unit 1020 family inet address 10.64.48.3/22 vrrp-group 20]
-        priority 70;
[edit interfaces ae4 unit 1020 family inet6 address 2620:0:861:107:fe00::3/64 vrrp-inet6-group 20]
-        priority 70;
[edit interfaces ae4 unit 1023 family inet address 10.64.53.3/24 vrrp-group 23]
-        priority 70;
[edit interfaces ae4 unit 1023 family inet6 address 2620:0:861:108:fe00::2/64 vrrp-inet6-group 23]
-        priority 70;

Rename interfaces:

On cr2-eqiad:

set interfaces xe-3/0/3 description "Core: asw2-d-eqiad:xe-2/0/42 {#XXX} [10Gbps DF]"
set interfaces xe-3/1/3 description "Core: asw2-d-eqiad:xe-2/0/43 {#XXX} [10Gbps DF]"
set interfaces xe-4/0/3 description "Core: asw2-d-eqiad:xe-7/0/42 {#XXX} [10Gbps DF]"
set interfaces xe-4/3/3 description "Core: asw2-d-eqiad:xe-7/0/43 {#XXX} [10Gbps DF]"
set interfaces ae4 description "Core: asw2-d-eqiad:ae2"

On asw2-d:

set interfaces xe-2/0/42 description "Core: cr2-eqiad:xe-3/0/3 {#XXX} [10Gbps DF]"
set interfaces xe-2/0/43 description "Core: cr2-eqiad:xe-3/1/3 {#XXX} [10Gbps DF]"
set interfaces xe-7/0/42 description "Core: cr2-eqiad:xe-4/0/3 {#XXX} [10Gbps DF]"
set interfaces xe-7/0/43 description "Core: cr2-eqiad:xe-4/3/3 {#XXX} [10Gbps DF]"
set interfaces ae2 description "Core: cr2-eqiad:ae4"

During maintenance
Move servers from rack D8 to D7
Move servers from rack D2 to D8
Move servers uplinks from asw to asw2 in racks D1,D3-D6
Continuously monitor servers for unexpected outages

After maintenance
Ping Eevans (see above)
Repool puppetmaster1002 (akosiaris)
Unban elasticsearch nodes of row D (glederrey)
Elasticsearch reindex that time period if we see any lost writes (glederrey)
Re-enable all Icinga checks
Verify no servers are left on asw
Disable interface between asw2 and asw
Power down asw

@ayounsi I would like to get an early start on this NLT than 0930 EST. Will that be possible? Thanks

@Cmjohnson unfortunately there is another maintenance scheduled to end at 10:00 EST (14:00 UTC), doing the maintenance after that time ensures there is no overlap.

In T148506#3205842, @ayounsi wrote:

Days before
Move kafka1020 to row B T163002

Note about this move: today I will not be available to shut down the node since I'll be on vacation, but @Ottomata should be able to assist if needed. If the move is planned for tomorrow (26th) before the row-d work feel free to ping me.

For the record, @akosiaris and me switched etcd client traffic to codfw to allow relocating conf1003 with ample time.

In T148506#3213176, @Joe wrote:

For the record, @akosiaris and me switched etcd client traffic to codfw to allow relocating conf1003 with ample time.

Adding a note: conf1003 also runs Zookeeper for Kafka and Hadoop (Analytics), it shouldn't be a huge problem to have it down for extended maintenance but let's keep it in mind :)

The databases affected by the move are now off and can be moved anytime:

es1019
db1094
db1093
db1092
db1091

We're pretty sure that the only Labs thing affected by this is instance creation. I've disabled instance creation for now, with https://gerrit.wikimedia.org/r/#/c/350414/ for Horizon and a live hack in OSM on silver.

Mentioned in SAL (#wikimedia-operations) [2017-04-26T13:54:15Z] <gehel> downtime "ElasticSearch health check for shards" checks for logstash and elasticsearch eqiad - T148506

Mentioned in SAL (#wikimedia-operations) [2017-04-26T13:56:19Z] <godog> downtime and poweroff ms-be 21 26 27 37 38 39 before switch relocation - T148506

Mentioned in SAL (#wikimedia-operations) [2017-04-26T13:59:47Z] <XioNoX> lowered VRRP priority for T148506

Mentioned in SAL (#wikimedia-operations) [2017-04-26T14:04:07Z] <XioNoX> "cr2-eqiad# set interfaces ae4 disable" done, (1 ping loss) - T148506

akosiaris mentioned this in T122676: Implement sentinel for ORES production Redis.Apr 26 2017, 2:23 PM

Mentioned in SAL (#wikimedia-operations) [2017-04-26T15:12:50Z] <XioNoX> switch ports for rack D7 and D8 configured - T148506

@ayounsi,
from asw-d:xe-1/1/2 to asw2-d:xe-2/0/42 done
from asw-d:xe-6/0/31 to asw2-d:xe-2/0/43 new fiber cable #4009
from asw-d:xe-7/0/31 to asw2-d:xe-7/0/42 done
from asw-d:xe-8/0/31 to asw2-d:xe-7/0/43 done

Mentioned in SAL (#wikimedia-operations) [2017-04-26T15:33:44Z] <XioNoX> "cr2-eqiad# delete interfaces ae4 disable" done, confirmed links and LACP are up - T148506

Mentioned in SAL (#wikimedia-operations) [2017-04-26T15:40:40Z] <_joe_> shutting down conf1003 T148506

Mentioned in SAL (#wikimedia-traffic) [2017-04-26T15:43:21Z] <XioNoX> VRRP priority removed, interfaces cr2/asw2 renamed - T148506

I currently can't SSH into any of the following hosts: cp1071, cp1072, cp1073 and cp1074. Presumably, this is due to today's maintenance. Note that the host's networking configuration seems fine (eg: I can ping them without issues and icinga is not complaining about them).

This was due to T133387.
Hosts that have an igmp-snooping membership don't receive IPv6 RA and thus don't have a default route for v6. IPv4 was working fine.
Disabling igmp-snooping on asw2-d solved the issue.

Another issues was:

inactive: interfaces interface-range vlan-analytics1-d-eqiad

activating the range solved the issue about unreachable hosts.

Because of miss-communication, the move of server uplinks from asw-d to asw2-d didn't happen today.

We are rescheduling it at 4pm UTC tomorrow (27th) and expecting it to last 1h max.
Racks D7 and D8 are all set and will not be impacted.

EDIT: plan is to start with rack D1, lower switch port, and work the way up the ports number one server moved at a time. Then move to D3, D4, D5. (D6 is empty).

For reference, here are the racks' contents:
D1: https://racktables.wikimedia.org/index.php?page=rack&rack_id=2103
D3: https://racktables.wikimedia.org/index.php?page=rack&rack_id=2105
D4: https://racktables.wikimedia.org/index.php?page=rack&rack_id=2106
D5: https://racktables.wikimedia.org/index.php?page=rack&rack_id=2107

Please let me know if there is any issue with this new time.

• jcrespo subscribed.Apr 26 2017, 8:04 PM

ayounsi updated the task description. (Show Details)Apr 27 2017, 7:49 AM

Mentioned in SAL (#wikimedia-operations) [2017-04-27T16:32:16Z] <gehel> unbanning elasticsearch servers in eqiad row D - elastic10(17|18|19|20) - T148506

Mentioned in SAL (#wikimedia-operations) [2017-04-27T16:53:06Z] <gehel> unbanning all elasticsearch servers in eqiad row D - T148506

All servers have been moved, confirmed no more interfaces are up on asw. and no more traffic (other than multicast) on the asw-asw2 link.
Monitoring is good.
Disabled ae0 on asw2 (to asw).

Will let sit for a few days before unracking/powering off asw.

ayounsi updated the task description. (Show Details)Apr 27 2017, 5:37 PM

No more scheduled downtime? Can T162681 be closed?

• Marostegui closed subtask T162681: Network maintenance on row D (databases) as Resolved.Apr 28 2017, 5:03 AM

Great! I'll start undoing some of the preparatory works, that is

repool puppetmaster1002
switchover oresrdb.svc.eqiad.wmnet from oresrdb1002 => oresrdb1001

akosiaris mentioned this in T163326: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 and back (after T148506).Apr 28 2017, 9:03 AM

akosiaris reopened subtask T163326: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 and back (after T148506) as Open.

elukey closed subtask T163002: Spread eqiad analytics Kafka nodes to multiple racks ans rows as Resolved.Apr 28 2017, 10:05 AM

akosiaris closed subtask T163326: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 and back (after T148506) as Resolved.Apr 28 2017, 10:43 AM

Change 351167 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Remove asw-d-eqiad from monitoring - T148506

https://gerrit.wikimedia.org/r/351167

Change 351167 merged by Ayounsi:
[operations/puppet@production] Remove asw-d-eqiad from monitoring - T148506

https://gerrit.wikimedia.org/r/351167

Mentioned in SAL (#wikimedia-operations) [2017-05-01T18:19:06Z] <mutante> manually removed asw-d-eqiad remnants from /etc/icinga/puppet_hosts.cfg to fix icinga config after gerrit:351167 / T148506. fixes Icinga config error. then puppet adds it back

@Cmjohnson you're free to decommission/unrack asw-d-eqiad.

RobH closed subtask Unknown Object (Task) as Resolved.Jun 12 2017, 7:54 PM

RobH closed subtask Unknown Object (Task) as Resolved.Jun 12 2017, 8:02 PM

RobH closed subtask Unknown Object (Task) as Resolved.

ayounsi moved this task from In Progress to Watching on the netops board.Jun 27 2017, 2:36 PM

Resolving this task, I created a subtask for the decom portion.

ayounsi mentioned this in T172459: eqiad row D switch upgrade.Aug 3 2017, 11:04 PM

ayounsi mentioned this in T183585: Rack/cable/configure asw2-b-eqiad switch stack.Dec 22 2017, 6:58 PM

ayounsi mentioned this in T187960: Rack/cable/configure asw2-a-eqiad switch stack.Feb 21 2018, 11:32 PM

ayounsi mentioned this in T187962: Rack/cable/configure asw2-c-eqiad switch stack.Feb 21 2018, 11:40 PM

• Cmjohnson closed subtask T170474: Decommisson and store old row D network gear. as Resolved.Sep 28 2020, 4:58 PM

Rack and setup new eqiad row D switch stack (EX4300/QFX5100)
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Status	Assigned	Task
Resolved	faidon	T149226 cr1-eqiad:ae4 is disabled due to VRRP issue
Resolved	• Cmjohnson	T148506 Rack and setup new eqiad row D switch stack (EX4300/QFX5100)
		Unknown Object (Task)
		Unknown Object (Task)
		Unknown Object (Task)
		Unknown Object (Task)
Resolved	None	T162681 Network maintenance on row D (databases)
Resolved	• Cmjohnson	T163002 Spread eqiad analytics Kafka nodes to multiple racks ans rows
Resolved	akosiaris	T163324 switchover icinga.wikimedia.org from einsteinium to tegmen
Resolved	akosiaris	T163326 switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 and back (after T148506)
Resolved	• Cmjohnson	T170474 Decommisson and store old row D network gear.

Rack and setup new eqiad row D switch stack (EX4300/QFX5100)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Rack and setup new eqiad row D switch stack (EX4300/QFX5100)
Closed, ResolvedPublic
Actions

Related Objects
Search...