Page MenuHomePhabricator

Upgrade core routers to Junos 21+
Closed, ResolvedPublic

Description

The Junos 21 branch is now in Junos recommended versions for MX routers, see https://kb.juniper.net/InfoCenter/index?page=content&id=KB21476&smlogin=true

Upgrading our MXs has several advantages:

  • Testing routers redundancy in a planned window
  • Keeping a tight Junos version spread (we currently have 12, 14, 15, 17, 18, 20)
  • Leveraging features improvements (eg. DNS in mgmt-junos see T269340)
  • Fixing low risk security issues

Process is documented in https://wikitech.wikimedia.org/wiki/Juniper_router_upgrade

DeviceScheduled forStatus
cr3-ulsfo2022-09-06 - 08:00 UTC - 2h T295690#8213789
cr4-ulsfo2022-09-06 - 08:00 UTC - 2h T295690#8213789
cr3-eqsin2022-09-07 - 08:00 UTC - 2h T295690#8217086
cr2-eqsin2022-09-07 - 08:00 UTC - 2h T295690#8217086
cr3-knams2022-09-08 - 08:00 UTC - 2h T295690#8220460
cr2-esams2022-09-08 - 08:00 UTC - 2h T295690#8220460
cr3-esams2022-09-08 - 08:00 UTC - 2h T295690#8220460
cr3-esams (attempt #2)2022-09-12 - 08:00 UTC - 3h T295690#8229042
cr1-codfw2022-09-13 - 08:00 UTC - 3h T295690#8232607
cr2-codfw2022-09-13 - 08:00 UTC - 3h T295690#8232607
cr2-eqdfw2022-09-14 - 08:00 UTC - 1h T295690#8236185
cr1-eqiad2022-09-29 - 08:00 UTC - 3h T295690#8272177
cr2-eqiad2022-09-29 - 08:00 UTC - 3h T295690#8272177
cr2-eqord2022-09-29 - 08:00 UTC - 3h T295690#8272177
  • Then remove unnecessary images from apt1001:/srv/junos/

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 830085 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Depool ulsfo for routers ugprades

https://gerrit.wikimedia.org/r/830085

Change 830085 merged by Ayounsi:

[operations/dns@master] Depool ulsfo for routers ugprades

https://gerrit.wikimedia.org/r/830085

Mentioned in SAL (#wikimedia-operations) [2022-09-06T07:52:36Z] <XioNoX> depool ulsfo for routers upgrade - T295690

Icinga downtime and Alertmanager silence (ID=7eb8120c-f8b6-4c79-8deb-b18a305a2353) set by ayounsi@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: router upgrade

cr3-ulsfo.wikimedia.org

Mentioned in SAL (#wikimedia-operations) [2022-09-06T08:42:10Z] <XioNoX> restart cr3-ulsfo for software upgrade - T295690

Mentioned in SAL (#wikimedia-operations) [2022-09-06T10:26:38Z] <XioNoX> put cr3-ulsfo back in service - T295690

Mentioned in SAL (#wikimedia-operations) [2022-09-06T10:42:34Z] <XioNoX> drain traffic from cr4-ulsfo - T295690

Mentioned in SAL (#wikimedia-operations) [2022-09-06T11:06:31Z] <XioNoX> restart cr4-ulsfo for software upgrade - T295690

Mentioned in SAL (#wikimedia-operations) [2022-09-06T11:17:58Z] <XioNoX> put cr4-ulsfo back in service - T295690

This has been quite eventful.

To keep in mind that those upgrade need the no-validate knob, more details in the doc

We first went for Junos 21.4R2-Sx which is the most recent Junos recommended version, but there is an annoying bug preventing the FPC to come online.
We then downgraded to Junos 21.2R2-Sx. There the FPC worked fine, but there are 2 issues:

  • VRRP adjacency to (at least) older Junos, using MD5 keys don't establish, causing a split brain (both routers are master), solved by remove the md5 key as it's only present on v4 and doesn't bring much benefits
  • On cr3-ulsfo some BFD sessions are still down (but upper BGP is fine)

Change 830156 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Add variable to disable VRRP auth

https://gerrit.wikimedia.org/r/830156

Change 830156 merged by Ayounsi:

[operations/homer/public@master] Add variable to disable VRRP auth

https://gerrit.wikimedia.org/r/830156

Change 830499 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Depool eqsin for core router upgrades.

https://gerrit.wikimedia.org/r/830499

Change 830499 merged by Cathal Mooney:

[operations/dns@master] Depool eqsin for core router upgrades.

https://gerrit.wikimedia.org/r/830499

Mentioned in SAL (#wikimedia-operations) [2022-09-07T07:46:16Z] <topranks> Depool eqsin from user traffic in advance of core router upgrades - T295690

Change 830563 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Disbale VRRP auth in eqsin

https://gerrit.wikimedia.org/r/830563

Change 830563 merged by jenkins-bot:

[operations/homer/public@master] Disbale VRRP auth in eqsin

https://gerrit.wikimedia.org/r/830563

Icinga downtime and Alertmanager silence (ID=7af287ca-21ab-4f9d-adb3-478641fdd465) set by cmooney@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: router upgrade

cr2-eqsin

Icinga downtime and Alertmanager silence (ID=826e80d5-55a6-4bb6-ab1c-e094eba7f6cd) set by cmooney@cumin1001 for 1:00:00 on 1 host(s) and their services with reason: router upgrade

cr3-eqsin

Change 830575 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Revert "Depool eqsin for core router upgrades."

https://gerrit.wikimedia.org/r/830575

Change 830575 merged by Cathal Mooney:

[operations/dns@master] Revert "Depool eqsin for core router upgrades."

https://gerrit.wikimedia.org/r/830575

Mentioned in SAL (#wikimedia-operations) [2022-09-07T09:48:09Z] <topranks> Re-pooling eqsin for user traffic after successful core router upgrades - T295690

Upgrade completed ok for cr2-eqsin and cr3-eqsin.

Went straight to 21.2R3-S2.9 based on experience in ulsfo, all went ok. Used no-validate when adding image to device.

No issues encountered, BFD adjacencies formed ok after reboot on both.

Change 830729 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Depool esams for routers upgrades

https://gerrit.wikimedia.org/r/830729

Change 830730 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Disable VRRP auth for esams

https://gerrit.wikimedia.org/r/830730

Change 830729 merged by Ayounsi:

[operations/dns@master] Depool esams for routers upgrades

https://gerrit.wikimedia.org/r/830729

Mentioned in SAL (#wikimedia-operations) [2022-09-08T07:41:06Z] <XioNoX> depool esams for routers upgrade - T295690

Mentioned in SAL (#wikimedia-operations) [2022-09-08T08:07:28Z] <XioNoX> drain draffic from cr3-esams - T295690

Icinga downtime and Alertmanager silence (ID=3b336fa4-f522-4b10-abdb-d6be83f6a04a) set by ayounsi@cumin2002 for 2:00:00 on 3 host(s) and their services with reason: router upgrade

cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt

Mentioned in SAL (#wikimedia-operations) [2022-09-08T08:44:27Z] <XioNoX> reverting cr3-esams changes (JTAC will be needed for a firmware upgrade), and moving on to cr2-esams - T295690

Icinga downtime and Alertmanager silence (ID=e0d9eb2b-5520-4f80-912e-3627c94e9982) set by ayounsi@cumin2002 for 2:00:00 on 3 host(s) and their services with reason: router upgrade

cr2-esams,cr2-esams IPv6,re0.cr2-esams.mgmt

Icinga downtime and Alertmanager silence (ID=ff1db65d-a6ee-4e20-ae07-837bbe264b2f) set by ayounsi@cumin2002 for 2:00:00 on 2 host(s) and their services with reason: router upgrade

cr3-knams,cr3-knams IPv6

Mentioned in SAL (#wikimedia-operations) [2022-09-08T09:35:48Z] <XioNoX> drain draffic from cr3-knams - T295690

Change 830730 merged by Ayounsi:

[operations/homer/public@master] Disable VRRP auth for esams

https://gerrit.wikimedia.org/r/830730

cr2-esams and cr3-knams got upgraded as expected.
cr3-esams failed as it requires a firmware upgrade, and only JTAC can provide us the firmware. We will follow up with them.

Also request system storage cleanup re1 has been added to the doc for multi-RE devices.

Mentioned in SAL (#wikimedia-operations) [2022-09-08T10:07:42Z] <XioNoX> re-pool esams after routers upgrade - T295690

The firmware provided by Juniper seems to be accepted by cr3-esams:

cmooney@re0.cr3-esams> show system firmware | match "^Part|version|i40"
Part             Type           Tag Current   Available Status
                                    version   version
Routing Engine 0 RE i40e-NVM    7   4.26      6.01      OK

I've scheduled this for 08:00 UTC on Monday to try again.

Change 831479 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Depool esams for cr3-esams core router upgrade.

https://gerrit.wikimedia.org/r/831479

Change 831479 merged by Cathal Mooney:

[operations/dns@master] Depool esams for cr3-esams core router upgrade.

https://gerrit.wikimedia.org/r/831479

Mentioned in SAL (#wikimedia-operations) [2022-09-12T08:00:38Z] <topranks> de-pooliong esams in advance of upgrade to core router cr3-esams T295690

Icinga downtime and Alertmanager silence (ID=1e573369-5fdd-4621-8ae7-786b5a67de04) set by cmooney@cumin1001 for 2:00:00 on 1 host(s) and their services with reason: router upgrade

cr3-esams

Icinga downtime and Alertmanager silence (ID=57f0ae1d-0fa1-4b98-9454-bea638ac3971) set by cmooney@cumin1001 for 2:00:00 on 3 host(s) and their services with reason: router upgrade

cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt

Icinga downtime and Alertmanager silence (ID=39465e0b-b93d-45ba-b1d8-0c49dacc39fb) set by cmooney@cumin1001 for 2:00:00 on 3 host(s) and their services with reason: router upgrade

cr3-esams,cr3-esams IPv6,re0.cr3-esams.mgmt

Change 831506 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Repool esams after cr3-esams core router upgrade.

https://gerrit.wikimedia.org/r/831506

Change 831506 merged by Cathal Mooney:

[operations/dns@master] Repool esams after cr3-esams core router upgrade.

https://gerrit.wikimedia.org/r/831506

Mentioned in SAL (#wikimedia-operations) [2022-09-12T10:55:10Z] <topranks> re-pooliong esams after successful upgrade of core router cr3-esams T295690

Upgrade of cr3-esams went well earlier. Firmware upgrade works as per docs. I will put up more info on that later for our own reference.

cmooney updated the task description. (Show Details)

Change 831800 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Depool codfw prior to core router upgrades.

https://gerrit.wikimedia.org/r/831800

Change 831800 merged by Cathal Mooney:

[operations/dns@master] Depool codfw prior to core router upgrades.

https://gerrit.wikimedia.org/r/831800

Icinga downtime and Alertmanager silence (ID=927fadc1-f5b2-478f-95ce-98bfc47881a9) set by cmooney@cumin1001 for 2:00:00 on 3 host(s) and their services with reason: router upgrade

cr1-codfw,cr1-codfw IPv6,re0.cr1-codfw.mgmt

Change 831840 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Disable VRRP auth between CRs in codfw

https://gerrit.wikimedia.org/r/831840

Change 831840 merged by jenkins-bot:

[operations/homer/public@master] Disable VRRP auth between CRs in codfw

https://gerrit.wikimedia.org/r/831840

Change 831889 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Re-pool codfw after upgrading core routers on site

https://gerrit.wikimedia.org/r/831889

Change 831889 merged by Cathal Mooney:

[operations/dns@master] Re-pool codfw after upgrading core routers on site

https://gerrit.wikimedia.org/r/831889

cr1-codfw and cr2-codfw sucessfully upgraded today. Took a while with the firmware upgrades too, I've added some notes here on probably the best way to approach upgrading both the firmware and JunOS with fewest switchovers.

Given the time I'll re-schedule cr2-eqdfw for tomorrow morning (Wed 14 Sept) instead.

cr2-eqdfw upgrade completed successfully today.

Icinga downtime and Alertmanager silence (ID=1ea26f52-695b-41ae-a3b4-28808d44161a) set by ayounsi@cumin1001 for 4:00:00 on 3 host(s) and their services with reason: router upgrade

cr1-eqiad,cr1-eqiad IPv6,re0.cr1-eqiad.mgmt

Mentioned in SAL (#wikimedia-operations) [2022-09-29T07:57:58Z] <XioNoX> drain traffic away from cr1-eqiad - T295690

Mentioned in SAL (#wikimedia-operations) [2022-09-29T08:15:51Z] <XioNoX> first cr1-eqiad RE switchover (for NVM firmware) - T295690

Mentioned in SAL (#wikimedia-operations) [2022-09-29T08:43:35Z] <XioNoX> second cr1-eqiad RE switchover - T295690

Change 836727 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Fully remove VRRP auth

https://gerrit.wikimedia.org/r/836727

Change 836727 merged by jenkins-bot:

[operations/homer/public@master] Fully remove VRRP auth

https://gerrit.wikimedia.org/r/836727

Icinga downtime and Alertmanager silence (ID=acbab0ff-4998-42b3-b0ad-a6be933dfff6) set by ayounsi@cumin1001 for 4:00:00 on 3 host(s) and their services with reason: router upgrade

cr2-eqiad,cr2-eqiad IPv6,re0.cr2-eqiad.mgmt

Mentioned in SAL (#wikimedia-operations) [2022-09-29T09:42:38Z] <XioNoX> first cr2-eqiad RE switchover - T295690

Mentioned in SAL (#wikimedia-operations) [2022-09-29T10:07:44Z] <XioNoX> second (and longest) cr2-eqiad RE switchover - T295690

Icinga downtime and Alertmanager silence (ID=01f0d013-5101-4278-93a6-1ea49f9dea28) set by ayounsi@cumin1001 for 1:00:00 on 2 host(s) and their services with reason: router upgrade

cr2-eqord,cr2-eqord IPv6

Mentioned in SAL (#wikimedia-operations) [2022-09-29T11:06:21Z] <XioNoX> restart cr2-eqord for upgrade - T295690

eqiad and eqord went extremely well.

Thanks @cmooney for the firmware instructions

ayounsi updated the task description. (Show Details)

I went through the useful https://apps.juniper.net/feature-explorer/select-software.html?typ=1&swName=Junos%20OS&rel=21.2R3&sid=1211&platform=MX204&pid=11310204
All the way down to "Features Introduced in Release - Junos OS 18.1R1"

Some interesting ones: