Page MenuHomePhabricator

cr3-eqsin to production
Closed, ResolvedPublic

Description

  • Schedule maintenance,eg. 16:00 UTC (low traffic time, but can be done during normal time too) On Monday 22nd or Tuesday 23rd.

  • Depool eqsin
  • Downtime cr1-eqsin
  • Copy cr1-eqsin IX to cr3-eqsin
  • Move VRRP master to cr2
  • Power cr1-eqsin down
  • If needed unrack cr1-eqsin
  • Connect core links:
    • Move cable 1115 from cr1-eqsin:xe-1/0/0 to cr3-eqsin:xe-0/1/5 - link should come up
    • Move cable 1116 from cr1-eqsin:xe-2/0/0 to cr3-eqsin:et-0/0/0
    • Move cable 1116 from cr2-eqsin:xe-0/1/6 to cr2-eqsin:et-0/0/0 - link should stay down until cr2 is re-configured
    • Move cable 1121 from cr1-eqsin:xe-0/1/0 to cr3-eqsin:xe-0/1/0 - link should come up
    • Move cable 1076 from cr1-eqsin:xe-2/0/3 to cr3-eqsin:et-0/0/1
    • Move cable 1076 from asw-0603-eqsin:xe-0/0/21 to asw-0603-eqsin:et-??? - link should stay down until asw1-eqsin is re-configured
    • Remove cable 1084 between cr1-eqsin:xe-1/1/0 and asw-0604-eqsin:xe-1/0/21
  • Reconfigure cr2-cr3 links on the cr2 side
  • Reconfigure asw1-eqsin for new et- interface
  • Check for connectivity with switch vlans
  • Re-configure cr2 with cr3-eqsin neighbor IP (BGP confed)
  • Check for cr3 to be reachable on loopback IP
  • Check that we're using the Transport link
  • Add cr3-eqsin to monitoring (and remove cr1) (Puppet)
  • Re-configure servers BGP sessions with new cr3 IP (anycast/LVS) (Puppet)
  • Re-configure all routers with cr3-eqsin public IP (BGP confed)
  • Connect transit/peering links
    • Move cable 1120 from cr1-eqsin:xe-0/0/0 cr3-eqsin:xe-0/1/1 - link should come up
    • Move cable 1016 from cr1-eqsin:xe-2/0/1 cr3-eqsin:xe-0/1/3 - link should come up
    • Move cable ??? from cr1-eqsin:xe-2/0/2 cr3-eqsin:xe-0/1/4 - link should come up
  • Check for properly established BGP
  • Check if no discrepancies in Homer (CR)
  • Check that all ports are populated with optics

DONE for CR3

  • Downtime cr2-eqsin
  • Fail VRRP master to cr3-eqsin
  • Recable cr2-eqsin<->asw
  • Move cable 1118 from cr1-eqsin:xe-0/1/4 to cr3-eqsin:et-0/0/1
  • Move cable 1118 from asw-0604-eqsin:xe-1/0/20 to asw-0604-eqsin:et-??? - link should stay down until re-configured
  • Remove cable 1117 between cr2-eqsin:xe-0/1/3 and asw-0603-eqsin:xe-0/0/20
  • Reconfigure cr2-eqsin<->asw (both sides)
  • Check for connectivity

Done for onsite work

  • Upgrade cr2-eqsin
  • Check for all green
  • Repool site
  • Update DNS with new neighbor hostnames (and remove cr1) (DNS)
  • Update LibreNMS (remove cr1, bills, etc)
  • Update Netbox (cables)

Event Timeline

ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 606419 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Replace cr1-eqsin with cr3-eqsin

https://gerrit.wikimedia.org/r/606419

Change 606423 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Remove cr1-eqsin loopback & rename relevant links (cr1->cr3)

https://gerrit.wikimedia.org/r/606423

Change 606425 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] cr1-eqsin -> cr3-eqsin

https://gerrit.wikimedia.org/r/606425

It isn't clear to me if this is for Jin (DreamICC) or for Equinix remote hands. If it is Jin, and I'll be supervising them, I prefer we never do work in eqsin on Monday, as Monday AM there is Sunday PM for me, and I rather not lose my weekend to this (having to be at the keyboard @ 6PM Sunday ruins that day for me to do anything fun.)

If we are having Equinix remote hands do the work, same timeframe applies if you would like me to easily stay late and monitor the work as it progresses. If @ayounsi will be coordinating instead, then I have no preference.

  • Will I be coordinating this? (if so Tuesday-Thursday is best)?
  • Do we want Jin to do this (need more notice than this likely) or Equinix remote hands? I prefer Jin as they follow directions better.

It's a tradeoff between DC traffic, work quality and working hours for both remote hands and me.

  • eqsin peak traffic is between 5am UTC and 3pm UTC
  • My only requirement is to start it no later than 4pm UTC, as a 2h maintenance means finishing at 8pm my time
  • 4pm UTC means midnight Singapore time, which is terribly late for Jin, but might be ok for Equinix if they have 24/7 staff onsite
  • I think we all agree that Jin would do better than Equinix remote hands (especially as it's not a single task we need to do but several, with the small risk of having to rollback if something goes wrong

@RobH: I'm fine coordinating it if outside of your working hours.

I've created a google doc, since Jin doesn't use phabricator, outlining all the steps above:

https://docs.google.com/document/d/1s2_ALpvDT9xTGihYE8BIXo41dSR9T11sZNUFs1mMrFM/edit?usp=sharing

I'll be emailing Jin shortly and cc'ing willy and arzhel on the email.

RobH mentioned this in Unknown Object (Task).Jun 18 2020, 5:22 PM

Change 608882 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool eqsin for cr3-eqsin setup

https://gerrit.wikimedia.org/r/c/operations/dns/ /608882

Change 608882 merged by Ayounsi:
[operations/dns@master] Depool eqsin for cr3-eqsin setup

https://gerrit.wikimedia.org/r/c/operations/dns/ /608882

Mentioned in SAL (#wikimedia-operations) [2020-07-01T15:00:35Z] <XioNoX> depool eqsin for routers work - T255766

Mentioned in SAL (#wikimedia-operations) [2020-07-01T15:03:09Z] <XioNoX> move vrrp master to cr2-eqsin - T255766

Mentioned in SAL (#wikimedia-operations) [2020-07-01T15:09:48Z] <XioNoX> bump eqsin-codfw ospf link cost - T255766

Mentioned in SAL (#wikimedia-operations) [2020-07-01T15:13:02Z] <XioNoX> disable cr1-eqsin transit/peering BGP - T255766

Mentioned in SAL (#wikimedia-operations) [2020-07-01T15:15:23Z] <XioNoX> disable BGP to pybal on cr1-eqsin - T255766

Mentioned in SAL (#wikimedia-operations) [2020-07-01T15:16:58Z] <XioNoX> re0.cr1-eqsin> request system power-off both-routing-engines - T255766

Change 606425 merged by Ayounsi:
[operations/puppet@production] cr1-eqsin -> cr3-eqsin

https://gerrit.wikimedia.org/r/c/operations/puppet/ /606425

Mentioned in SAL (#wikimedia-operations) [2020-07-01T16:00:13Z] <XioNoX> updating eqsin LVS BGP neighbors IPs - T255766

Change 606423 merged by Ayounsi:
[operations/dns@master] Remove cr1-eqsin loopback & rename relevant links (cr1->cr3)

https://gerrit.wikimedia.org/r/c/operations/dns/ /606423

Replacement went smooth! Last step is to update Netbox.

Change 606419 merged by Ayounsi:
[operations/homer/public@master] Replace cr1-eqsin with cr3-eqsin

https://gerrit.wikimedia.org/r/c/operations/homer/public/ /606419

ayounsi claimed this task.

Netbox updated.