Page MenuHomePhabricator

Upgrade core routers to Junos 23.4R2
Closed, ResolvedPublic

Description

23.4R2 is now the recommended JunOS version for MX204 and MX480. To keep up to date, and to address some medium-level security advisories, we should upgrade our estate to this version.

Upgrades should follow the standard process.

DeviceScheduled forStatus
cr1-codfwTues May 20 2025Done 23.4R2.13
cr2-codfwTues May 20 2025Done 23.4R2.13
cr1-drmrsTues Mar 18 2025Done 23.4R2.13
cr2-drmrsMon Mar 31 2025Done 23.4R2.13
cr2-eqdfwMon Jun 17 2024Done
cr1-eqiadNov 14th at 10am CTDone 23.4R2
cr2-eqiadNov 14th at 11am CTDone 23.4R.2
cr2-eqordApril 3 2025Done 23.4R2
cr2-eqsinOctober 9 2024Done
cr3-eqsinMay 13 2025Done
cr1-esamsMay 14 2025Done
cr2-esamsMay 14 2025Done
cr3-ulsfoSeptember 30 2024Done: 23.4R2
cr4-ulsfoApril 1 2025Done: 23.4R2
cr1-magruJanuary 27 2025Done: 23.4R2
cr2-magruJanuary 27 2025Done: 23.4R2

Details

Related Changes in Gerrit:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks @akosiaris @Joe we can hold back on codfw for now and work on eqiad. when we switch back to eqiad we can schedule the upgrade for codfw.

Makes sense to me. Thanks for accommodating us. For what is worth, we are moving back to eqiad on Wednesday, 19 March 2025

codfw will be the primary during that set of dates, it should NOT be depooled.

Agreed. It should also be possible for us to do the core router upgrade without depooling the site, but it makes the steps a little trickier and adds some risk.

So overall agree it's best to wait until we switch back to eqiad, I don't think the urgency of the router upgrades requires us to do it before then.

Upgrades should follow the standard process

The standard process docs are outdated I fear.

Depool site (optional)
(optional) if codfw, drain mw traffic sudo cookbook sre.mediawiki.route-traffic primary

codfw will be the primary during that set of dates, it should NOT be depooled.

Wiki updated.

Makes sense to me. Thanks for accommodating us. For what is worth, we are moving back to eqiad on Wednesday, 19 March 2025

Is there a tracking task or is it too soon :) We could then add it as sub-task.

It's indeed not an emergency so we can easily wait for the next switchover.

@Papaul I added a 8th point in the "Cleanup" section of https://wikitech.wikimedia.org/wiki/Juniper_router_upgrade#Cleanup
I noticed that some interfaces were missing from LibreNMS billing feature, and I suspect it's because they got their internal ID changed with the latest upgrades. Something to keep an eye on.

Mentioned in SAL (#wikimedia-operations) [2024-11-14T15:36:05Z] <sukhe@cumin1002> START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: junos upgrade, T364092]

Mentioned in SAL (#wikimedia-operations) [2024-11-14T15:36:21Z] <sukhe@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: junos upgrade, T364092]

Mentioned in SAL (#wikimedia-operations) [2024-11-14T19:37:25Z] <sukhe@cumin1002> START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: junos upgrade done, T364092]

Mentioned in SAL (#wikimedia-operations) [2024-11-14T19:37:29Z] <sukhe@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: junos upgrade done, T364092]

cmooney updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2025-03-18T18:13:26Z] <topranks> reboot cr1-drmrs to update JunOS (router is drained of traffic) T364092

Mentioned in SAL (#wikimedia-operations) [2025-03-31T12:11:39Z] <topranks> set "graceful sender" option on cr2-drmrs to darin for JunOS upgrade T364092

Mentioned in SAL (#wikimedia-operations) [2025-03-31T12:52:01Z] <topranks> reboot cr2-drmrs to updrade JunOS T364092

Mentioned in SAL (#wikimedia-operations) [2025-04-01T11:16:12Z] <topranks> reboot cr4-ulsfo to upgrade JunOS T364092

Mentioned in SAL (#wikimedia-operations) [2025-04-03T10:50:00Z] <topranks> drain transport circuits to eqord (Chicago network pop) to prep for Junos upgrade cr2-eqord T364092

Mentioned in SAL (#wikimedia-operations) [2025-04-03T11:06:55Z] <topranks> pre-pend as paths announced to codfw/eqiad from eqord to prep for JunOS upgrade T364092

Mentioned in SAL (#wikimedia-operations) [2025-04-03T11:31:32Z] <topranks> disable EBGP sessions to internet peers on cr2-eqord to prep for JunOS upgrade T364092

Mentioned in SAL (#wikimedia-operations) [2025-04-03T11:33:17Z] <topranks> reboot cr2-eqord to complete JunOS upgrade T364092

ayounsi updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2025-05-13T12:31:23Z] <ayounsi@cumin1002> START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: cr3-eqsin upgrade, T364092]

Mentioned in SAL (#wikimedia-operations) [2025-05-13T12:31:28Z] <ayounsi@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: cr3-eqsin upgrade, T364092]

Icinga downtime and Alertmanager silence (ID=a82fd52b-494d-4956-9f75-7cd844fe0007) set by ayounsi@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: upgrade

cr3-eqsin

Mentioned in SAL (#wikimedia-operations) [2025-05-13T12:40:49Z] <XioNoX> cr3-eqsin# set protocols bgp graceful-shutdown sender - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-13T12:53:39Z] <XioNoX> cr3-eqsin - lower vrrp priority - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-13T12:57:40Z] <XioNoX> cr3-eqsin - shutdown transit/peering BGP sessions - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-13T13:00:30Z] <XioNoX> cr3-eqsin> request vmhost software add /var/tmp/junos-vmhost-install-mx-x86-64-23.4R2-S3.9.tgz - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-13T13:15:08Z] <XioNoX> cr3-eqsin> request vmhost reboot - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-13T13:47:06Z] <ayounsi@cumin1002> START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: cr3-eqsin upgrade finished, T364092]

Mentioned in SAL (#wikimedia-operations) [2025-05-13T13:47:09Z] <ayounsi@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: cr3-eqsin upgrade finished, T364092]

Mentioned in SAL (#wikimedia-operations) [2025-05-14T07:36:23Z] <ayounsi@cumin1002> START - Cookbook sre.dns.admin DNS admin: depool site esams [reason: esams routers upgrade, T364092]

Mentioned in SAL (#wikimedia-operations) [2025-05-14T07:36:31Z] <ayounsi@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site esams [reason: esams routers upgrade, T364092]

Icinga downtime and Alertmanager silence (ID=239f1d24-394b-4cd2-b80b-211b30b54a1a) set by ayounsi@cumin1002 for 1:00:00 on 3 host(s) and their services with reason: cr2-esams upgrade

cr2-esams,cr2-esams IPv6,cr2-esams.mgmt

Mentioned in SAL (#wikimedia-operations) [2025-05-14T07:43:25Z] <XioNoX> cr2-esams# set protocols bgp graceful-shutdown sender - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-14T07:58:23Z] <XioNoX> cr2-esams - disable transit/IX BGP sessions - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-14T07:59:10Z] <XioNoX> cr2-esams> request vmhost software add /var/tmp/junos-vmhost-install-mx-x86-64-23.4R2-S3.9.tgz - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-14T08:13:42Z] <XioNoX> cr2-esams> request vmhost reboot - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-14T08:39:26Z] <XioNoX> cr1-esams# set protocols bgp graceful-shutdown sender - T364092

Icinga downtime and Alertmanager silence (ID=0ccf059a-76d1-46d7-9ee7-b67d79c235aa) set by ayounsi@cumin1002 for 1:00:00 on 1 host(s) and their services with reason: cr1-esams upgrade

re0.cr1-esams.mgmt

Icinga downtime and Alertmanager silence (ID=ed684b09-6354-460a-9fbf-3df20fbe3f21) set by ayounsi@cumin1002 for 1:00:00 on 2 host(s) and their services with reason: cr1-esams upgrade

cr1-esams,cr1-esams IPv6

Mentioned in SAL (#wikimedia-operations) [2025-05-14T08:44:19Z] <XioNoX> cr1-esams - disable transit/IX BGP sessions - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-14T08:46:15Z] <XioNoX> cr1-esams - Install image on backup RE - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-14T08:58:27Z] <XioNoX> cr1-esams request vmhost reboot re1 - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-14T09:05:53Z] <XioNoX> cr1-esams> request chassis routing-engine master switch - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-14T09:21:40Z] <XioNoX> re1.cr1-esams> request vmhost reboot re0 - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-14T09:28:33Z] <XioNoX> cr1-esams> request chassis routing-engine master switch - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-14T09:49:56Z] <ayounsi@cumin1002> START - Cookbook sre.dns.admin DNS admin: pool site esams [reason: esams routers upgrade finished, T364092]

Mentioned in SAL (#wikimedia-operations) [2025-05-14T09:50:00Z] <ayounsi@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site esams [reason: esams routers upgrade finished, T364092]

Change #1148281 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/alerts@master] BFDdown: don't deploy in codfw

https://gerrit.wikimedia.org/r/1148281

Change #1148281 merged by jenkins-bot:

[operations/alerts@master] BFDdown: don't deploy in codfw

https://gerrit.wikimedia.org/r/1148281

Icinga downtime and Alertmanager silence (ID=8c92db5f-18b6-481b-8642-01c1d92b5cb0) set by cmooney@cumin1003 for 2:00:00 on 10 host(s) and their services with reason: upgrade cr1-codfw JunOS

cr1-codfw,cr1-codfw IPv6,cr2-eqdfw,cr1-eqiad,cr3-eqsin,cr4-ulsfo,pfw1-codfw,re0.cr1-codfw.mgmt,ssw1-a1-codfw.mgmt,ssw1-d1-codfw.mgmt

Mentioned in SAL (#wikimedia-operations) [2025-05-20T11:41:27Z] <topranks> drain transport circuits landing on cr1-codfw of traffic before router upgrade (T364092)

Mentioned in SAL (#wikimedia-operations) [2025-05-20T11:49:03Z] <topranks> apply bgp "graceful shutdown" community on cr1-codfw ahead of JunOS upgrade (T364092)

Mentioned in SAL (#wikimedia-operations) [2025-05-20T12:05:57Z] <topranks> disable routing-engine sync / graceful-switchover on cr1-codfw ahead of JunOS upgrade on RE1 T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-20T12:23:23Z] <topranks> rebooting backup routing-engine RE1 on cr1-codfw to install JunOS upgrade (T364092)

Mentioned in SAL (#wikimedia-operations) [2025-05-20T12:32:13Z] <topranks> switching active routing-engine to RE1 on cr1-codfw (this will cause protocol adjacencies to flap) (T364092)

Icinga downtime and Alertmanager silence (ID=f40f3f46-731d-46ef-9db5-647d735907d6) set by cmooney@cumin1003 for 3:00:00 on 1 host(s) and their services with reason: upgrade cr1-codfw JunOS

cr2-codfw

Mentioned in SAL (#wikimedia-operations) [2025-05-20T12:57:10Z] <topranks> rebooting backup routing-engine RE0 on cr1-codfw to install JunOS upgrade (T364092)

Mentioned in SAL (#wikimedia-operations) [2025-05-20T13:05:00Z] <topranks> switching active routing-engine to RE0 on cr1-codfw (this will cause protocol adjacencies to flap) (T364092)

Mentioned in SAL (#wikimedia-operations) [2025-05-20T13:15:02Z] <topranks> re-enable graceful switchover on cr1-codfw (T364092)

Mentioned in SAL (#wikimedia-operations) [2025-05-20T13:29:37Z] <topranks> drain transport circuits landing on cr2-codfw of traffic before router upgrade (T364092)

Mentioned in SAL (#wikimedia-operations) [2025-05-20T13:40:46Z] <topranks> disabling bgp groups on cr2-codfw ahead of upgrade/line-card install (T364092)

Mentioned in SAL (#wikimedia-operations) [2025-05-20T13:56:34Z] <topranks> rebooting backup routing-engine RE1 on cr2-codfw to install JunOS upgrade (T364092)

Mentioned in SAL (#wikimedia-operations) [2025-05-20T14:03:44Z] <topranks> switching active routing-engine to RE1 on cr2-codfw (this will cause protocol adjacencies to flap) (T364092)

Mentioned in SAL (#wikimedia-operations) [2025-05-20T14:39:58Z] <topranks> switching active routing-engine to RE0 on cr2-codfw (this will cause protocol adjacencies to flap) (T364092)

cmooney updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2025-05-21T09:03:01Z] <XioNoX> cr2-eqdfw# set protocols bgp graceful-shutdown sender - T364092

Icinga downtime and Alertmanager silence (ID=048b70e3-25f1-4871-b6c8-5ea7b074de1e) set by ayounsi@cumin1002 for 2:00:00 on 2 host(s) and their services with reason: router upgrade

cr2-eqdfw,cr2-eqdfw IPv6

Mentioned in SAL (#wikimedia-operations) [2025-05-21T09:13:26Z] <XioNoX> cr2-eqdfw - shutdown transit/ix BGP sessions - T364092

Mentioned in SAL (#wikimedia-operations) [2025-05-21T09:22:24Z] <XioNoX> cr2-eqdfw> request vmhost reboot - T364092

ayounsi updated the task description. (Show Details)

All done! Thank you all.