Page MenuHomePhabricator

Upgrade core routers to Junos 23.4R2
Open, MediumPublic

Description

23.4R2 is now the recommended JunOS version for MX204 and MX480. To keep up to date, and to address some medium-level security advisories, we should upgrade our estate to this version.

Upgrades should follow the standard process. The new routers in magru are already on the target release so we only need to do the other sites.

DeviceScheduled forStatus
cr1-codfw
cr2-codfw
cr1-drmrs
cr2-drmrs
cr2-eqdfwMon Jun 17 2024Done
cr1-eqiadNov 14th at 10am CTDone 23.4R2
cr2-eqiadNov 14th at 11am CTDone 23.4R.2
cr2-eqord
cr2-eqsinOctober 9 2024Done
cr3-eqsin
cr1-esams
cr2-esams
cr3-ulsfoSeptember 30 2024Done: 23.4R2
cr4-ulsfo

Event Timeline

cmooney triaged this task as Medium priority.May 3 2024, 9:49 AM
cmooney created this task.

Both Junos 22.2R3-Sx and Junos 22.4R3 are latest recommended. fyi, I went with 22.4R3 in magru.

Mentioned in SAL (#wikimedia-operations) [2024-06-17T15:49:59Z] <topranks> rebooting cr2-eqdfw to upgrade JunOS T364092

Both Junos 22.2R3-Sx and Junos 22.4R3 are latest recommended. fyi, I went with 22.4R3 in magru.

Doh, I went with 22.2R3 on cr2-eqdfw should have checked this previously. We can use 22.4R3 on the rest and revisit eqdfw if we think it's worth it.

cmooney renamed this task from Upgrade core routers to Junos 22.2R3 to Upgrade core routers to Junos 22.4R3.Jun 17 2024, 4:56 PM
cmooney added a parent task: Restricted Task.Jul 10 2024, 11:37 AM

There has been a spike of CPU usage on cr1-eqiad (with no impact), not sure if just a coincidence.

ayounsi renamed this task from Upgrade core routers to Junos 22.4R3 to Upgrade core routers to Junos 23.4R2.Oct 1 2024, 7:36 AM

A few more reasons to upgrade in {T376986}.

There will be some maintenance in magru sometime next week and the site will be de-pool we can take advantage of this maintenance window to upgrade the router there from 22.4R3 to 23.4R2

Papaul updated the task description. (Show Details)
Papaul updated the task description. (Show Details)

cr1-eqiad is stated for Nov 13 but note that T376737 is also scheduled for that period (Nov 13, 8 CT) and it might make tricky for both magru and eqiad to be depooled. Can we please move this to the next day perhaps, or even better, give a buffer of two days?

I see there is a maintenance planned for codfw now, and that the plan is to depool the datacenter. Does this mean we're doing a datacenter switchover?

Because otherwise, we can't really depool codfw from being in the path of a request. We can of course just do the upgrade, and then do a switchover if we lose redundancy, but we should be aware of the risk.

@ssingh thanks i forgot about the 13th I update the dates.

Upgrades should follow the standard process

The standard process docs are outdated I fear.

Depool site (optional)
(optional) if codfw, drain mw traffic sudo cookbook sre.mediawiki.route-traffic primary

codfw will be the primary during that set of dates, it should NOT be depooled.

I see there is a maintenance planned for codfw now, and that the plan is to depool the datacenter. Does this mean we're doing a datacenter switchover?

Because otherwise, we can't really depool codfw from being in the path of a request. We can of course just do the upgrade, and then do a switchover if we lose redundancy, but we should be aware of the risk.

@Joe thank you for letting us know.

Thanks @akosiaris @Joe we can hold back on codfw for now and work on eqiad. when we switch back to eqiad we can schedule the upgrade for codfw.

Thanks @akosiaris @Joe we can hold back on codfw for now and work on eqiad. when we switch back to eqiad we can schedule the upgrade for codfw.

Makes sense to me. Thanks for accommodating us. For what is worth, we are moving back to eqiad on Wednesday, 19 March 2025

codfw will be the primary during that set of dates, it should NOT be depooled.

Agreed. It should also be possible for us to do the core router upgrade without depooling the site, but it makes the steps a little trickier and adds some risk.

So overall agree it's best to wait until we switch back to eqiad, I don't think the urgency of the router upgrades requires us to do it before then.

Upgrades should follow the standard process

The standard process docs are outdated I fear.

Depool site (optional)
(optional) if codfw, drain mw traffic sudo cookbook sre.mediawiki.route-traffic primary

codfw will be the primary during that set of dates, it should NOT be depooled.

Wiki updated.

Makes sense to me. Thanks for accommodating us. For what is worth, we are moving back to eqiad on Wednesday, 19 March 2025

Is there a tracking task or is it too soon :) We could then add it as sub-task.

It's indeed not an emergency so we can easily wait for the next switchover.

@Papaul I added a 8th point in the "Cleanup" section of https://wikitech.wikimedia.org/wiki/Juniper_router_upgrade#Cleanup
I noticed that some interfaces were missing from LibreNMS billing feature, and I suspect it's because they got their internal ID changed with the latest upgrades. Something to keep an eye on.

Mentioned in SAL (#wikimedia-operations) [2024-11-14T15:36:05Z] <sukhe@cumin1002> START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: junos upgrade, T364092]

Mentioned in SAL (#wikimedia-operations) [2024-11-14T15:36:21Z] <sukhe@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: junos upgrade, T364092]

Mentioned in SAL (#wikimedia-operations) [2024-11-14T19:37:25Z] <sukhe@cumin1002> START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: junos upgrade done, T364092]

Mentioned in SAL (#wikimedia-operations) [2024-11-14T19:37:29Z] <sukhe@cumin1002> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: junos upgrade done, T364092]