Page MenuHomePhabricator

Rack and setup new eqiad row D switch stack (EX4300/QFX5100)
Closed, ResolvedPublic

Description

This task is going to track the racking and setup of the 2nd generation switch stack for eqiad's row D, aka asw2-d-eqiad.

Steps needed, per meeting with @Cmjohnson / @mark:

  • Rack 6xEX4300 + 2xQFX5100 to row D - QFX (10G) should go in racks 2 and 7 [Chris]
  • Connect those to mgmt/serial [Chris]
  • Check via serial that switches work, ports are configured as down [Faidon]
  • Interconnect those with each other according to the Google Doc diagram (spine/leaf etc.) [Chris]
  • Stack the switch, upgrade JunOS, initial switch configuration (including DNS entries) [Faidon]
  • Add to monitoring tools (LibreNMS, rancid/Icinga etc.)
  • Connect asw2-d with asw-d with 4x10G [Chris]
  • Move a few (e.g. D 1) servers from asw-d to asw2-d [Chris]
  • Test that IGMP multicast snooping works (and IPv6 still works) - T133387 [Faidon/Alex] (tested, doesn't work)
  • Move cr1 router uplinks from asw-d to asw2-d [Chris/Faidon]
  • Move cr2 router uplinks from asw-d to asw2-d [Chris/Faidon]
  • Physically relocate all of D 8 servers to D 7 :( - Connect to new stack [Chris]
  • Physically relocate all of D 2 servers to D 8 or connect all D 2 servers with 1G copper SFP, connect to new stack [ Chris]
  • Move D 1, D 3, D 5 servers from asw-d to asw2-d [Chris]

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The following things are now pending from @Cmjohnson:

  • The link between D 2 and D 5 does not seem to be working — is seen as down from the D 5 side, but not from the D 2 side, which also seems to notice when D 5 gets rebooted. Something weird going on there that needs some extra physical debugging (I've tried rebooting a bunch of times). Since we have no other link right now, D 5 has not joined the stack at all, so this needs to absolutely be our next step. (Note that I upgraded asw-d5 to the same JunOS manually, so they all run the same JunOS now.)
  • We'll need to connect asw2-d with cr1-eqiad (and asw-d-eqiad with 4x10G. Let's sync up on IRC for that one.
  • A lot of cross-stack links are missing, due to missing cables, cf. T149726.

The 2<->5 link was due to a faulty cable. Chris has replaced that and the stack is fully formed now, albeit with not much redundancy (still waiting on cables).

We'll do cr1<->asw2 and asw<->asw2 links next.

asw<->asw2 links are done, 4x10G on racks D 2 and D 7 (2 each).

The 4x10G links from cr1-eqiad:ae4 to asw-d-eqiad:ae1 have been moved over to asw2-d-eqiad:ae1, also on D 2 and D 7 (spines).

mc1033-mc1036 (new but inactive memcached servers) have also been moved over to asw2-d-eqiad as test hosts.

Change 320401 had a related patch set uploaded (by Jcrespo):
Depool db servers on row D except es1019

https://gerrit.wikimedia.org/r/320401

Change 320401 merged by jenkins-bot:
Depool db servers on row D except es1019

https://gerrit.wikimedia.org/r/320401

asw2-d-eqiad has been confirmed to be affected by T133387 and enabling IGMP snooping on the QFXes breaks IPv6.

Since this affects us across DC and is an relatively important issue, and since asw2-d-eqiad isn't in production yet, I would rather open a Juniper case and use asw2-d-eqiad as the guinea pig for whatever Juniper recommends we do as our course of action (such as service-disrupting JunOS upgrades for example).

Thus, an asw2-d-eqiad deployment is currently blocked on T133387, which in turn is blocked on T147518 :(

Received the new cables and finished with the row redundancy less the d1 to d8 link which will need fiber.

D1 to D8 was patched with fiber QSFP+s (et-1/1/0 <-> et-8/1/0). The no-name optics we bought in T149726 appear as QSFP+-40G-CU3M in show chassis hardware but other than that everything looks good.

Talked to Chris and Brandon, we're going to aim for doing the work on Wednesday April 26.

FYI @Andrew labservices1001 will be caught up in this as it lives in D3. Previously we had some issues with that host being offline where labservices1002 was not standing in as expected IIRC. It seems like the outage is expected to be brief :) but I created T163402 to sort it out before the 26th because you never know.

Change 349164 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] puppetmaster: Depool puppetmaster1002

https://gerrit.wikimedia.org/r/349164

Change 349164 merged by Alexandros Kosiaris:
[operations/puppet@production] puppetmaster: Depool puppetmaster1002

https://gerrit.wikimedia.org/r/349164

Change 349419 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet@production] puppetmaster: Re-depool puppetmaster1002

https://gerrit.wikimedia.org/r/349419

Change 349419 merged by Alexandros Kosiaris:
[operations/puppet@production] puppetmaster: Re-depool puppetmaster1002

https://gerrit.wikimedia.org/r/349419

@ayounsi, to follow up on our previous IRC conversation:

My understanding is that restbase1009, 1014, 1015, and 1018 will all go down briefly as the patch cables are moved from one switch to another. We do have a production service running in eqiad during the switchover (change-propagation), that processes updates and stores to Cassandra. It is pretty fault tolerant though, we have replicas distributed across machines in rows a, b, and d, (and these machines are all in row d, leaving 2 replicas online). Additionally, anything failed would be retried after.

That said, it'd be great if we could get a bit of advanced notice in order to silence the icinga notifications, and a heads up after, so that we can ensure everything is still in order. Would this be possible?

That said, it'd be great if we could get a bit of advanced notice in order to silence the icinga notifications, and a heads up after, so that we can ensure everything is still in order. Would this be possible?

Sure, I will give you a head's up ~1h before starting if that's good.

From the feedback I collected here is what I believe the maintenance will look like. please let me know if something is wrong/needs to be clarified/added.
Days before
Switchover from einsteinium to tegmen T163324
Depool puppetmaster1002 (DONE/akosiaris)
switchover from oresrdb1001 to oresrdb1002 T163326
Ban all elasticsearch nodes of row D (glederrey)
fail etcd over to codfw (DONE glavagetto+rcoccioli/akosiaris)
Ensure we can survive a loss of labservices1001 T163402
Move kafka1020 to row B T163002
Databases maintenance (DONE T162681)

1h before maintenance (13:30 UTC)
ping Eevans (see above)
ping glederrey for elasticsearch and logstash coordination
Disable elasticsearch check https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+health+check+for+shards
Downtime servers from rows D2 and D8
Ping elukey to drain traffic from hadoop nodes

30min before maintenance (14:00 UTC)

  • Ensure CR2 doesn't become VRRP master:
cr2-eqiad# show | compare 
[edit interfaces ae4 unit 1004 family inet address 208.80.155.99/27 vrrp-group 4]
+        priority 70;
[edit interfaces ae4 unit 1004 family inet6 address 2620:0:861:4:fe00::2/64 vrrp-inet6-group 4]
+        priority 70;
[edit interfaces ae4 unit 1020 family inet address 10.64.48.3/22 vrrp-group 20]
+        priority 70;
[edit interfaces ae4 unit 1020 family inet6 address 2620:0:861:107:fe00::3/64 vrrp-inet6-group 20]
+        priority 70;
[edit interfaces ae4 unit 1023 family inet address 10.64.53.3/24 vrrp-group 23]
+        priority 70;
[edit interfaces ae4 unit 1023 family inet6 address 2620:0:861:108:fe00::2/64 vrrp-inet6-group 23]
+        priority 70;
  • Disable interface between cr2 and asw-d-eqiad:ae2
cr2-eqiad# set interfaces ae4 disable
  • Verify connectivity is still there (ping from host in row-d to outside host)
  • Verify traffic goes (cr2 <->) cr1 <-> asw2 <-> asw (and the other way around)
  • Fiber re-cabling

Either move fiber:
from asw-d:xe-1/1/2 to asw2-d:xe-2/0/42
from asw-d:xe-6/0/31 to asw2-d:xe-2/0/43
from asw-d:xe-7/0/31 to asw2-d:xe-7/0/42
from asw-d:xe-8/0/31 to asw2-d:xe-7/0/43

Or run new fibers:
between cr2-eqiad:xe-3/0/3 and asw2-d:xe-2/0/42
between cr2-eqiad:xe-3/1/3 and asw2-d:xe-2/0/43
between cr2-eqiad:xe-4/0/3 and asw2-d:xe-7/0/42
between cr2-eqiad:xe-4/3/3 and asw2-d:xe-7/0/43

  • Write down the cables IDs
  • Verify individual links are up/up

*Enable interface between cr2-eqiad and asw2:

cr2-eqiad# delete interfaces ae4 disable
  • Verify LACP interface is up.
  • Verify some traffic is flowing through ae4
  • Remove lower VRRP priority on cr2-eqiad:
[edit interfaces ae4 unit 1004 family inet address 208.80.155.99/27 vrrp-group 4]
-        priority 70;
[edit interfaces ae4 unit 1004 family inet6 address 2620:0:861:4:fe00::2/64 vrrp-inet6-group 4]
-        priority 70;
[edit interfaces ae4 unit 1020 family inet address 10.64.48.3/22 vrrp-group 20]
-        priority 70;
[edit interfaces ae4 unit 1020 family inet6 address 2620:0:861:107:fe00::3/64 vrrp-inet6-group 20]
-        priority 70;
[edit interfaces ae4 unit 1023 family inet address 10.64.53.3/24 vrrp-group 23]
-        priority 70;
[edit interfaces ae4 unit 1023 family inet6 address 2620:0:861:108:fe00::2/64 vrrp-inet6-group 23]
-        priority 70;
  • Rename interfaces:

On cr2-eqiad:

set interfaces xe-3/0/3 description "Core: asw2-d-eqiad:xe-2/0/42 {#XXX} [10Gbps DF]"
set interfaces xe-3/1/3 description "Core: asw2-d-eqiad:xe-2/0/43 {#XXX} [10Gbps DF]"
set interfaces xe-4/0/3 description "Core: asw2-d-eqiad:xe-7/0/42 {#XXX} [10Gbps DF]"
set interfaces xe-4/3/3 description "Core: asw2-d-eqiad:xe-7/0/43 {#XXX} [10Gbps DF]"
set interfaces ae4 description "Core: asw2-d-eqiad:ae2"

On asw2-d:

set interfaces xe-2/0/42 description "Core: cr2-eqiad:xe-3/0/3 {#XXX} [10Gbps DF]"
set interfaces xe-2/0/43 description "Core: cr2-eqiad:xe-3/1/3 {#XXX} [10Gbps DF]"
set interfaces xe-7/0/42 description "Core: cr2-eqiad:xe-4/0/3 {#XXX} [10Gbps DF]"
set interfaces xe-7/0/43 description "Core: cr2-eqiad:xe-4/3/3 {#XXX} [10Gbps DF]"
set interfaces ae2 description "Core: cr2-eqiad:ae4"

During maintenance
Move servers from rack D8 to D7
Move servers from rack D2 to D8
Move servers uplinks from asw to asw2 in racks D1,D3-D6
Continuously monitor servers for unexpected outages

After maintenance
Ping Eevans (see above)
Repool puppetmaster1002 (akosiaris)
Unban elasticsearch nodes of row D (glederrey)
Elasticsearch reindex that time period if we see any lost writes (glederrey)
Re-enable all Icinga checks
Verify no servers are left on asw
Disable interface between asw2 and asw
Power down asw

@ayounsi I would like to get an early start on this NLT than 0930 EST. Will that be possible? Thanks

@Cmjohnson unfortunately there is another maintenance scheduled to end at 10:00 EST (14:00 UTC), doing the maintenance after that time ensures there is no overlap.

Days before
Move kafka1020 to row B T163002

Note about this move: today I will not be available to shut down the node since I'll be on vacation, but @Ottomata should be able to assist if needed. If the move is planned for tomorrow (26th) before the row-d work feel free to ping me.

For the record, @akosiaris and me switched etcd client traffic to codfw to allow relocating conf1003 with ample time.

For the record, @akosiaris and me switched etcd client traffic to codfw to allow relocating conf1003 with ample time.

Adding a note: conf1003 also runs Zookeeper for Kafka and Hadoop (Analytics), it shouldn't be a huge problem to have it down for extended maintenance but let's keep it in mind :)

The databases affected by the move are now off and can be moved anytime:

es1019
db1094
db1093
db1092
db1091

We're pretty sure that the only Labs thing affected by this is instance creation. I've disabled instance creation for now, with https://gerrit.wikimedia.org/r/#/c/350414/ for Horizon and a live hack in OSM on silver.

Mentioned in SAL (#wikimedia-operations) [2017-04-26T13:54:15Z] <gehel> downtime "ElasticSearch health check for shards" checks for logstash and elasticsearch eqiad - T148506

Mentioned in SAL (#wikimedia-operations) [2017-04-26T13:56:19Z] <godog> downtime and poweroff ms-be 21 26 27 37 38 39 before switch relocation - T148506

Mentioned in SAL (#wikimedia-operations) [2017-04-26T14:04:07Z] <XioNoX> "cr2-eqiad# set interfaces ae4 disable" done, (1 ping loss) - T148506

Mentioned in SAL (#wikimedia-operations) [2017-04-26T15:12:50Z] <XioNoX> switch ports for rack D7 and D8 configured - T148506

@ayounsi,
from asw-d:xe-1/1/2 to asw2-d:xe-2/0/42 done
from asw-d:xe-6/0/31 to asw2-d:xe-2/0/43 new fiber cable #4009
from asw-d:xe-7/0/31 to asw2-d:xe-7/0/42 done
from asw-d:xe-8/0/31 to asw2-d:xe-7/0/43 done

Mentioned in SAL (#wikimedia-operations) [2017-04-26T15:33:44Z] <XioNoX> "cr2-eqiad# delete interfaces ae4 disable" done, confirmed links and LACP are up - T148506

Mentioned in SAL (#wikimedia-traffic) [2017-04-26T15:43:21Z] <XioNoX> VRRP priority removed, interfaces cr2/asw2 renamed - T148506

I currently can't SSH into any of the following hosts: cp1071, cp1072, cp1073 and cp1074. Presumably, this is due to today's maintenance. Note that the host's networking configuration seems fine (eg: I can ping them without issues and icinga is not complaining about them).

This was due to T133387.
Hosts that have an igmp-snooping membership don't receive IPv6 RA and thus don't have a default route for v6. IPv4 was working fine.
Disabling igmp-snooping on asw2-d solved the issue.

Another issues was:

  1. inactive: interfaces interface-range vlan-analytics1-d-eqiad

activating the range solved the issue about unreachable hosts.

Because of miss-communication, the move of server uplinks from asw-d to asw2-d didn't happen today.

We are rescheduling it at 4pm UTC tomorrow (27th) and expecting it to last 1h max.
Racks D7 and D8 are all set and will not be impacted.

EDIT: plan is to start with rack D1, lower switch port, and work the way up the ports number one server moved at a time. Then move to D3, D4, D5. (D6 is empty).

For reference, here are the racks' contents:
D1: https://racktables.wikimedia.org/index.php?page=rack&rack_id=2103
D3: https://racktables.wikimedia.org/index.php?page=rack&rack_id=2105
D4: https://racktables.wikimedia.org/index.php?page=rack&rack_id=2106
D5: https://racktables.wikimedia.org/index.php?page=rack&rack_id=2107

Please let me know if there is any issue with this new time.

Mentioned in SAL (#wikimedia-operations) [2017-04-27T16:32:16Z] <gehel> unbanning elasticsearch servers in eqiad row D - elastic10(17|18|19|20) - T148506

Mentioned in SAL (#wikimedia-operations) [2017-04-27T16:53:06Z] <gehel> unbanning all elasticsearch servers in eqiad row D - T148506

All servers have been moved, confirmed no more interfaces are up on asw. and no more traffic (other than multicast) on the asw-asw2 link.
Monitoring is good.
Disabled ae0 on asw2 (to asw).

Will let sit for a few days before unracking/powering off asw.

No more scheduled downtime? Can T162681 be closed?

Great! I'll start undoing some of the preparatory works, that is

  • repool puppetmaster1002
  • switchover oresrdb.svc.eqiad.wmnet from oresrdb1002 => oresrdb1001

Change 351167 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Remove asw-d-eqiad from monitoring - T148506

https://gerrit.wikimedia.org/r/351167

Change 351167 merged by Ayounsi:
[operations/puppet@production] Remove asw-d-eqiad from monitoring - T148506

https://gerrit.wikimedia.org/r/351167

Mentioned in SAL (#wikimedia-operations) [2017-05-01T18:19:06Z] <mutante> manually removed asw-d-eqiad remnants from /etc/icinga/puppet_hosts.cfg to fix icinga config after gerrit:351167 / T148506. fixes Icinga config error. then puppet adds it back

@Cmjohnson you're free to decommission/unrack asw-d-eqiad.

RobH closed subtask Unknown Object (Task) as Resolved.Jun 12 2017, 7:54 PM
RobH closed subtask Unknown Object (Task) as Resolved.Jun 12 2017, 8:02 PM
RobH closed subtask Unknown Object (Task) as Resolved.

Resolving this task, I created a subtask for the decom portion.