Page MenuHomePhabricator

(Need By: TBD) rack/setup/install (2) new 10G switches
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of (2) new 10G switches, purchased via T271338.

Hostname / Racking / Installation Details

Rob isn't aware of the installation details for these switches, escalated to Willy.

Event Timeline

RobH added a parent task: Unknown Object (Task).

Hi @ayounsi - when we budgeted these last year, I think it was for general expansion of 10g switches. Do you have any specific racks you want to put this in? (like WMCS or something) If not, maybe we can add them to the new racks we'll be purchasing in FY21-22? Thanks, Willy

The original plan was to retrofit two existing 1G racks to 10G as a quick fix for the existing contention. The downside is that the migration from 1G to 10G means a rack downtime the time we replace the ToR switch.

Regardless of the expansion, we should increase the 10G capacity of the current rows (so services can still have row diversity).
The next step is to figure out which two racks are the best match, probably depending on usage. Is that something you could advice on?

Hi @ayounsi - let me check with Chris and John to confirm, but I'm thinking we should target the racks with the most amount of rack space. (so we can phase out any leftover 1g servers) I'll get you more specific racks by the end of week. Thanks, Willy

Hi @ayounsi - just to follow up on this, we should probably wait a bit longer on determining which racks to convert to 10g (after John and Chris can wrap up all the current hw installs in ~1mo). Ideally, we'd want to install these switches in racks, where there aren't too many servers. And we'll have a better idea which racks those will be after the Q3 installs are out of the way. Timing wise, it might actually line up with when the switches arrive. Thanks, Willy

After chatting with @wiki_willy the best use for those switches is to go in the C8 and D5, as cloudsw2 switches to add capacity to the existing cloudsw switches.

Exact cabling TBD, but most likely redundant 40G to their same rack cloudsw1.

WMCS network-L1(1).png (456×761 px, 59 KB)

See the two cloudsw2 on the right of the diagram.

If you have any spare 40G DACs feel free to use them, length at DCops discretion.
Otherwise let me know if I should open a procurement task for 5 of those DACs (4 + 1 spare)

@Cmjohnson @wiki_willy would it be possible to prioritize this (or at least 1 of the 2) for before the next 2 weeks?
We would like to test a fix for T284592 before rolling it to eqiad rows during the DC switchover, starting on June 28th (for a few weeks)

@ayounsi I should be able to rack them on Friday but not cable them. We are out the week of 5 July and I am on vacation as of this moment the following week. Maybe @Jclark-ctr can do the cabling if you need before 13 July

Thanks @Cmjohnson be great if we can get the ball rolling.

It'd help a lot to get them online early the week starting July 13th, if @Jclark-ctr can help there it would be appreciated :)

We want to test the impact of changing the buffer partition on these when they first go in, before doing any prod. switches. The haste is because we need to make that change in eqiad prod. rows before the DC switchover is rolled back.

RobH mentioned this in Unknown Object (Task).Jul 12 2021, 4:40 PM
RobH mentioned this in Unknown Object (Task).Jul 13 2021, 4:32 PM
RobH added a subtask: Unknown Object (Task).Jul 13 2021, 4:38 PM
wiki_willy added a subscriber: Cmjohnson.

It looks like Chris is going to be out for a while. @Jclark-ctr - can you prioritize this one, when you're back next week? Rob has ordered the DAC cables, so they should be arriving soon. Thanks, Willy

Basic plan for bringing this online should be:

  1. Add device to Netbox as cloudsw2-c8-eqiad, I guess below existing cloudsw1?
  2. Allocate SCS and MGMT SW ports in Netbox
  3. Allocate MGMT IP in Netbox
  4. Add Netbox connections on new switch ports et-0/0/48 and et-0/0/49, to existing cloudsw1-c8-eqiad et-0/0/50 and et-0/0/51.
    1. See ayounsi's diagram above) [should this be DC-ops or Netops?]
  5. Rack device on site
  6. Connect SCS port to OpenGear
  7. Netops to verify console connectivity, basic health checks
  8. Cable MGMT port
  9. Netops to configure management port via serial and validate IP connectivity.
  10. Cable QSFP interfaces mentioned in step 4.
  11. Netops will do some temporary config to test effect of change "class-of-service buffer" settings while traffic running.
  12. Netops to remove temporary config and leave all but MGMT port shut down.
  13. Netops to revisist task and reconfigure as required for WMCS usage.

switches racked and partialy cabled. Waiting on 40g dac cables Vendor did not ship cables they where waiting on confirmation on tax status

https://phabricator.wikimedia.org/T286575 updated

@cmooney @ayounsi Dac cables arrived today finished install with dac and updated netbox with console ports to scs please let me know if anything else is needed

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Jul 21 2021, 10:56 PM

Thanks @Jclark-ctr

I've set up port 39 on scs-c1-eqiad now, standard port config same as the other Juniper gear. But I get nothing back on the console when I try to connect:

# pmshell

1:       ps1-c1-eqiad       2:    ps1-c2-eqiad   3:   ps1-c3-eqiad   4:   ps1-c4-eqiad
5:       ps1-c5-eqiad       6:    ps1-c6-eqiad   7:   ps1-c7-eqiad   8:   ps1-c8-eqiad
9:       asw-c1-eqiad       10:   asw2-c2-eqiad  11:  asw2-c3-eqiad  12:  asw2-c4-eqiad
13:      asw2-c5-eqiad      14:   asw2-c6-eqiad  15:  asw2-c7-eqiad  16:  asw2-c8-eqiad
17:      ps1-d1-eqiad       18:   ps1-d2-eqiad   19:  ps1-d3-eqiad   20:  ps1-d4-eqiad
21:      ps1-d5-eqiad       22:   ps1-d6-eqiad   23:  ps1-d7-eqiad   24:  ps1-d8-eqiad
25:      pfw3a-frack        26:   pfw3b-frack    27:  fasw-c1a       28:  fasw-1b
29:      asw2-d1-eqiad      30:   asw2-d2-eqiad  31:  asw2-d3-eqiad  32:  asw2-d4-eqiad
33:      asw2-d5-eqiad      34:   asw2-d6-eqiad  35:  asw2-d7-eqiad  36:  asw2-d8-eqiad
39:      cloudsw2-c8-eqiad  41:   atlas-eqiad    48:  cr2-eqsin
                               

Connect to port > 39

<-- nothing -->

Is the switch powered up? If not might need to double-check the wiring from it to the OpenGear. Thanks.

will follow back up Monday to double check ports.

Replaced Ends on console cable clip did not want to lock in switch

Still nothing on console.

Can you also make sure mgmt is connected?

@ayounsi @cmooney used different crimper and tested with cable tester shows good now on management cable.

Mentioned in SAL (#wikimedia-operations) [2021-08-05T11:47:17Z] <XioNoX> prepare cloudsw1-c8-eqiad for cloudsw2-c8 - T277340

Thanks, I got the initial configuration done.

Left to do for C8:

  • document cables details (mgmt + DACs) in Netbox - @Jclark-ctr
  • Upgrade Junos - Netops
  • Put specific config in Homer - Netops
  • Add to monitoring - Netops

Change 710506 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Allow mgmt to reach apt.wo

https://gerrit.wikimedia.org/r/710506

Change 710506 merged by jenkins-bot:

[operations/homer/public@master] Allow mgmt to reach apt.wo

https://gerrit.wikimedia.org/r/710506

Change 710534 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Add cloudsw2-c8-eqiad to Homer

https://gerrit.wikimedia.org/r/710534

Change 710534 merged by jenkins-bot:

[operations/homer/public@master] Add cloudsw2-c8-eqiad to Homer

https://gerrit.wikimedia.org/r/710534

Change 710575 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Add cloudsw2-c8-eqiad to monitoring

https://gerrit.wikimedia.org/r/710575

Change 710575 merged by Ayounsi:

[operations/puppet@production] Add cloudsw2-c8-eqiad to monitoring

https://gerrit.wikimedia.org/r/710575

Last thing to do is enable the interfaces on the cloudsw1-c8 side and it will be ready to receive servers.

Mentioned in SAL (#wikimedia-operations) [2021-08-09T05:56:16Z] <XioNoX> enable cloudsw1-c8 interfaces toward cloudsw2-c8 - T277340

cloudsw2-c8 is ready to receive servers.

@Jclark-ctr please let us know when cloudsw2-d5 is ready for Netops, and @cmooney will take care of configuring it.

Change 712365 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add cloudsw2-d5-eqiad to Homer

https://gerrit.wikimedia.org/r/712365

Change 712365 merged by jenkins-bot:

[operations/homer/public@master] Add cloudsw2-d5-eqiad to Homer

https://gerrit.wikimedia.org/r/712365

Change 712930 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add cloudsw2-d5-eqiad to monitoring

https://gerrit.wikimedia.org/r/712930

Change 712930 merged by Cathal Mooney:

[operations/puppet@production] Add cloudsw2-d5-eqiad to monitoring

https://gerrit.wikimedia.org/r/712930

Mentioned in SAL (#wikimedia-operations) [2021-08-13T11:36:10Z] <topranks> cloudsw1-d5-eqiad - configuring new 2x40G trunk to cloudsw2-d5-eqiad with homer (T277340)

cloudsw2-d5-eqiad is now configured and ready for server connections.

I believe this task can now be closed?

There's one alert firing with regards to that switch, is that expected? https://alerts.wikimedia.org/?q=instance%3Dcloudsw2-d5-eqiad.mgmt.eqiad.wmnet

At the time of writing:

alertname: Storage over 90%
scope: global
1
description
summary: Storage over 90%
title: Alert for device cloudsw2-d5-eqiad.mgmt.eqiad.wmnet - Storage over 90%
3 hours agoinstance: cloudsw2-d5-eqiad.mgmt.eqiad.wmnet
source: librenms
team: noc
@cluster: wikimedia.org

@dcaro thanks. It's nothing to worry about, the other one (cloudsw2-c8-eqiad) is showing the same. I'll touch base with @ayounsi next week and see what the best way to deal with this is, I had a quick look in LibreNMS but the alert thresholds seem to be globally defined so don't want to make any change just yet.

I bumped the threshold. All good now.