Page MenuHomePhabricator

Upgrade End Of Support Junos
Open, MediumPublic

Description

From https://support.juniper.net/support/eol/software/junos/ I noticed that we're running a few out of support (or soon to be).

Management routers, we're good.
Payment firewalls, we're good.
frack switches, we're good.
Core sites switches, we're good.

For the management switches : T390814: Upgrade management switches to Junos 21.4

Core routers tracked in T364092: Upgrade core routers to Junos 23.4R2

Cloud switches:

  • cloudsw1-e4-eqiad
  • cloudsw1-f4-eqiad

POP switches:

  • asw1-b12-drmrs
  • asw1-b13-drmrs

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi renamed this task from Upgrade Junos 20 switches to Upgrade End Of Support Junos.Apr 2 2025, 7:51 AM
ayounsi updated the task description. (Show Details)
ayounsi triaged this task as Medium priority.Apr 7 2025, 3:01 PM

@Papaul would you be ok to take care of that ?

@Vgutierrez @ssingh could that be a good opportunity to see how drmrs handles the loss of a switch/rack ?

With the site depooled, and while one ToR switch is upgrading, maybe we could see if the other rack could handle all the traffic properly ?

Thanks @ayounsi. @Papaul specifically the request relates to drmrs. Cloud services may need more planning with the WMCS team on scheduling so you can leave that to me for now, hopefully with their newly installed large ceph hosts they will be better able to withstand the loss of a rack.

@Vgutierrez @ssingh could that be a good opportunity to see how drmrs handles the loss of a switch/rack ?

With the site depooled, and while one ToR switch is upgrading, maybe we could see if the other rack could handle all the traffic properly ?

AFAIK our depool policy won't allow that for the CDN services, we would lose 50% (4 per cluster) of the caching nodes and liberica would keep sending traffic to ceil(66.6%) = 6 of them

Hi Netops folks. Thanks for suggesting the idea of testing drmrs. Since this requires some changes on our end as well (adjusting the depool policy somehow) and for the purposes of planning, when are you planning to work on this?

There is no particular rush, let's say before the end of 2025 ?

There is no particular rush, let's say before the end of 2025 ?

OK thanks for sharing, we will triage and plan accordingly!

@ssingh @Vgutierrez hello just checking in to see if you have a day and time for this for drmrs.
Thanks

@ssingh @Vgutierrez hello just checking in to see if you have a day and time for this for drmrs.
Thanks

Hi @Papaul. What day and time do you have in mind? There is some work required on Traffic's end for the stress testing at least, since our current depool policy can't be tuned per-site and I think we may not be able to get to it in this quarter. So I think we can do a simple depool and for that there is really no preference from us.

@ssingh thanks for the update. I am planning on doing it before Thanksgiving any day during the week of November 17th works for me. Let me know if that woks for you and I can get back with you on the exact day and time.

@ssingh thanks for the update. I am planning on doing it before Thanksgiving any day during the week of November 17th works for me. Let me know if that woks for you and I can get back with you on the exact day and time.

That works for us Papaul, the week of Nov 17.

@ssingh @Vgutierrez planning on doing this on Nov 19th @10:am CT. Thank you

@ssingh @Vgutierrez planning on doing this on Nov 19th @10:am CT. Thank you

Thanks @Papaul, that works for us.

@ayounsi @cmooney on the other QFX5120-48Y in magru we are running version 22.2R3.S3.18 or right now the recommande version for that model is 23.4R2-S5. Do you want me to do 23.4R2-S5 or stick to 22.2R3-S7?

@ayounsi @cmooney on the other QFX5120-48Y in magru we are running version 22.2R3.S3.18 or right now the recommande version for that model is 23.4R2-S5. Do you want me to do 23.4R2-S5 or stick to 22.2R3-S7?

23.4 is the release we want to be on. We should use the recommended version in that from Juniper, so 23.4R2-S5.

Let's open a different task for magru. drmrs is more urgent as they're end of support (and older). magru is to be done when we have time (lower priority).

edit: missread the question. Target version should be 23.4.
If you meant to upgrade magru to 23.4, then let's do it in a different task.

Depooling at 15:30 UTC

% ssh cumin1003.eqiad.wmnet
$ tmux
$ sudo cookbook sre.dns.admin -t T390813 depool drmrs

Pre-check

$ sudo cookbook sre.dns.admin show
==> CURRENT STATE:
text-addrs: pooled at all sites
text-next: pooled at all sites
upload-addrs: pooled at all sites
ncredir-addrs: pooled at all sites
<==
show action called; outputting current admin_state above. No changes were made.

Mentioned in SAL (#wikimedia-operations) [2025-11-19T15:32:28Z] <slyngshede@cumin1003> START - Cookbook sre.dns.admin DNS admin: depool site drmrs [reason: no reason specified, T390813]

Mentioned in SAL (#wikimedia-operations) [2025-11-19T15:32:34Z] <slyngshede@cumin1003> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site drmrs [reason: no reason specified, T390813]

$ sudo cookbook sre.dns.admin -t T390813 depool drmrs
==> CURRENT STATE:
text-addrs: pooled at all sites
text-next: pooled at all sites
upload-addrs: pooled at all sites
ncredir-addrs: pooled at all sites
<==
Acquired lock for key /spicerack/locks/cookbooks/sre.dns.admin: {'concurrency': 1, 'created': '2025-11-19 15:32:28.364070', 'owner': 'slyngshede@cumin1003 [2930071]', 'ttl': 60}
START - Cookbook sre.dns.admin DNS admin: depool site drmrs [reason: no reason specified, T390813]
==> You are now about to: depool site drmrs
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
Setting pooled=no for tags: {'name': 'drmrs'}
==> APPLIED STATE:
text-addrs: depooled in drmrs
text-next: depooled in drmrs
upload-addrs: depooled in drmrs
ncredir-addrs: depooled in drmrs
<==
Released lock for key /spicerack/locks/cookbooks/sre.dns.admin: {'concurrency': 1, 'created': '2025-11-19 15:32:28.364070', 'owner': 'slyngshede@cumin1003 [2930071]', 'ttl': 60}
END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site drmrs [reason: no reason specified, T390813]
slyngshede@cumin1003:~$ 
$ sudo cookbook sre.dns.admin show
==> CURRENT STATE:
text-addrs: depooled in drmrs
text-next: depooled in drmrs
upload-addrs: depooled in drmrs
ncredir-addrs: depooled in drmrs
<==
show action called; outputting current admin_state above. No changes were made.
slyngshede@cumin1003:~$ 

Icinga downtime and Alertmanager silence (ID=03158056-58f0-40e9-8ef7-4dd2bc33743a) set by pt1979@cumin2002 for 1:00:00 on 5 host(s) and their services with reason: router upgrade

asw1-b[12-13]-drmrs,cr[1-2]-drmrs,mr1-drmrs

Icinga downtime and Alertmanager silence (ID=1b28f61a-0c57-409c-a53a-429cb2d44ddb) set by pt1979@cumin2002 for 1:00:00 on 8 host(s) and their services with reason: router upgrade

asw1-b[12-13]-drmrs IPv6,asw1-b[12-13]-drmrs.mgmt,cr[1-2]-drmrs IPv6,cr[1-2]-drmrs.mgmt

Both switches in drmrs are now running Junos: 23.4R2-S5.8. @cmooney i am sending the task to you since you wanted to do the cloud switches.