Page MenuHomePhabricator

Upgrade POPs asw to Junos 21
Closed, ResolvedPublic

Description

Similar to T295691, T295690, T316529.

The Junos 21 branch is now in Junos recommended versions for our access switches, see https://kb.juniper.net/InfoCenter/index?page=content&id=KB21476&smlogin=true

Upgrading them brings several advantages:

  • Keeping a tight Junos version spread (we currently have 14, 17, 18, 20)
  • Leveraging features improvements (eg. DNS in mgmt-junos see T269340)
  • Fixing low risk security issues

Those switches are L2 only, so it's less urgent as they're not exposed to the Internet, and they're in VCs so it means all members have to go down at the same time (we're not trying ISSU and similar).

DeviceScheduled forStatus
asw2-ulsfoJan 24th - 12:30UTC Upgraded from 14 to 21
asw1-eqsinJan 11th - 10:00UTC Upgraded from 14 to 20
asw2-esamsJan 12th - 10:00UTC Upgraded from 18 to 21

Event Timeline

ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 877091 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Depool ulsfo for network maintenance

https://gerrit.wikimedia.org/r/877091

Change 877091 merged by Ayounsi:

[operations/dns@master] Depool ulsfo for network maintenance

https://gerrit.wikimedia.org/r/877091

Mentioned in SAL (#wikimedia-operations) [2023-01-09T08:56:43Z] <XioNoX> depool ulsfo for network maintenance - T316532

I didn't proceed with the upgrade as there was errors. I opened a JTAC case 2023-0109-616616 with:

Hi,
I'm trying to upgrade a virtual-chassis made of two EX4600, from 14.1X53-D45.3 to 21.4R3-S2.4.
The first error I got was:

ayounsi@asw2-ulsfo> request system software add /var/tmp/jinstall-host-ex-4600-21.4R3-S2.4-signed.tgz 

Checking pending install on fpc2

Checking pending install on fpc1
Pushing bundle /var/tmp/jinstall-host-ex-4600-21.4R3-S2.4-signed.tgz to fpc2

Validating on fpc2

Validating on fpc1
Done with validate of </var/tmp/jinstall-host-ex-4600-21.4R3-S2.4-signed.tgz> on VC members

fpc2:
Verified jinstall-vjunos-21.4R3-S2.4.tgz signed by PackageProductionSHA1RSA_2022
Adding vjunos...
Saving contents of boot area prior to installation
tar: ./cdboot: Cannot open: Input/output error
tar: ./loader.help: Cannot open: Input/output error
tar: ./pxeboot: Cannot open: Input/output error
tar: ./kgzldr.o: Cannot open: Input/output error
tar: Error exit delayed from previous errors

WARNING:     This package will load JUNOS 21.4R3-S2.4 software.
WARNING:     It will save JUNOS configuration files, and SSH keys
WARNING:     (if configured), but erase all other files and information
WARNING:     stored on this machine.  It will attempt to preserve dumps
WARNING:     and log files, but this can not be guaranteed.  This is the
WARNING:     pre-installation stage and all the software is loaded when
WARNING:     you reboot the system.

POST-INSTALL...
Saving the config files ...
NOTICE: uncommitted changes have been saved in /var/db/config/juniper.conf.pre-install
Pushing installation package to host...
rcmd: connection timeout
download_file /var/tmp/preinstall/vjunos-install.sh failed, ret=1

WARNING:     A REBOOT IS REQUIRED TO LOAD THIS SOFTWARE CORRECTLY. Use the
WARNING:     'request system reboot' command when software installation is
WARNING:     complete. To abort the installation, do not reboot your system,
WARNING:     instead use the 'request system software delete jinstall'
WARNING:     command as soon as this operation completes.

ERROR: jinstall-vjunos fails post-install
ERROR: jinstall-vjunos-21.4R3-S2.4-signed fails post-install

After investigation I thought that it could be related to this error:
https://supportportal.juniper.net/s/article/Unable-to-upgrade-QFX5100-due-to-jinstall-vjunos-fails-post-install-error?language=en_US

So I temporarily removed those configurations options:

[edit system]
-   internet-options {
-       tcp-drop-synfin-set;
-       no-tcp-reset drop-all-tcp;
-   }

And ran request system configuration rescue save

I then tried another upgrade, which failed with a different errors:

ayounsi@asw2-ulsfo> request system software add /var/tmp/jinstall-host-ex-4600-21.4R3-S2.4-signed.tgz no-validate 
 
 Checking pending install on fpc2
 
 Checking pending install on fpc1
 Pushing bundle /var/tmp/jinstall-host-ex-4600-21.4R3-S2.4-signed.tgz to fpc2
 
 fpc2:
 WARNING: The /tmp filesystem that is created by the next stage of
 WARNING: the installer does not have sufficient space. This package
 WARNING: requires 1260492k free (734k for configuration
 WARNING: files and 1259758k for the new software), but there is
 WARNING: only 1048576k available.
 
 
 WARNING: This installation attempt will be aborted.
 WARNING: If you wish to force the installation despite these warnings
 WARNING: you may use the 'force' option on the command line.

I tried with no-validate no-copy unlink force but I am getting the same error.

Finally found that the issue was the files in fpc2:/var/tmp/preinstall/, so I deleted them and re-ran:

ayounsi@asw2-ulsfo> request system software add /var/tmp/jinstall-host-ex-4600-21.4R3-S2.4-signed.tgz no-copy                             

Checking pending install on fpc2

Checking pending install on fpc1
Pushing bundle /var/tmp/jinstall-host-ex-4600-21.4R3-S2.4-signed.tgz to fpc2

Validating on fpc2

Validating on fpc1
Done with validate of </var/tmp/jinstall-host-ex-4600-21.4R3-S2.4-signed.tgz> on VC members

fpc2:
Verified jinstall-vjunos-21.4R3-S2.4.tgz signed by PackageProductionSHA1RSA_2022
Adding vjunos...
Saving contents of boot area prior to installation
tar: ./cdboot: Cannot open: Input/output error
tar: ./loader.help: Cannot open: Input/output error
tar: ./pxeboot: Cannot open: Input/output error
tar: ./kgzldr.o: Cannot open: Input/output error
tar: Error exit delayed from previous errors

WARNING:     This package will load JUNOS 21.4R3-S2.4 software.
WARNING:     It will save JUNOS configuration files, and SSH keys
WARNING:     (if configured), but erase all other files and information
WARNING:     stored on this machine.  It will attempt to preserve dumps
WARNING:     and log files, but this can not be guaranteed.  This is the
WARNING:     pre-installation stage and all the software is loaded when
WARNING:     you reboot the system.

POST-INSTALL...
Saving the config files ...
NOTICE: uncommitted changes have been saved in /var/db/config/juniper.conf.pre-install
Pushing installation package to host...
rcp: /var/run/./vjunos-install.sh: Read-only file system
download_file /var/tmp/preinstall/vjunos-install.sh failed, ret=1

WARNING:     A REBOOT IS REQUIRED TO LOAD THIS SOFTWARE CORRECTLY. Use the
WARNING:     'request system reboot' command when software installation is
WARNING:     complete. To abort the installation, do not reboot your system,
WARNING:     instead use the 'request system software delete jinstall'
WARNING:     command as soon as this operation completes.

ERROR: jinstall-vjunos fails post-install
ERROR: jinstall-vjunos-21.4R3-S2.4-signed fails post-install

See that now the rcp error is "rcp: /var/run/./vjunos-install.sh: Read-only file system" while it was previously "rcmd: connection timeout"
So my questions are multiples:
First, and the most important, how do I proceed with my virtual-chassis upgrade?
Second, are the "Cannot open: Input/output error" messages a problem?

Thank you!

JTAC replied with (I cherry picked the useful info):

As JTAC we would suggest that you perform a step upgrade in your case and jump on an intermediate release first and then to the target version to avoid any unwanted file corruption. Please follow the upgrade path below:

14.1X53-D45.3 >> 20.3R2 >> Targeted version: 21.4R3
Also since EX4600 is host based, kindly use the below command while upgrading your VC:
request system software add <junos-image> no-validate force-host
You can check the below Kb as well for VC upgrade:
https://supportportal.juniper.net/s/article/EX-How-to-upgrade-the-software-version-for-an-EX-Series-Switch-Virtual-Chassis-environment?language=en_US

I'll try it again tomorrow.

Change 877221 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Depool ulsfo for network maintenance

https://gerrit.wikimedia.org/r/877221

Change 877221 merged by Ayounsi:

[operations/dns@master] Depool ulsfo for network maintenance

https://gerrit.wikimedia.org/r/877221

Mentioned in SAL (#wikimedia-operations) [2023-01-10T07:28:54Z] <XioNoX> depool ulsfo for network maintenance - T316532

Same issue with rcp: /var/run/./vjunos-install.sh: Read-only file system and then mount: /dev/ad0s1a : Resource temporarily unavailable, which is mounted on (both fpcs):

Filesystem              Size       Used      Avail  Capacity   Mounted on
/dev/ad0s1a            1003M       776M       146M       84%  /

It also fits the Input/output error

I'm wondering if it's something a reboot could help... Hopefully JTAC will find a solution that doesn't mean onsite factory reset the device (install using the jloader).

Note that removing

[edit system]
-   internet-options {
-       tcp-drop-synfin-set;
-       no-tcp-reset drop-all-tcp;
-   }

Is needed otherwise rcmd: connection timeout happens.

Mentioned in SAL (#wikimedia-operations) [2023-01-10T09:25:48Z] <XioNoX> repool ulsfo (maintenance cancelled) - T316532

The pre-upgrade went fine on asw1-eqsin, so I guess the ulsfo issue is a corrupted storage.

The last step for eqsin is a reboot, so I'll maintain tomorrow's upgrade, go to 20.3 then 21.4.

From JTAC:

This message “Read-only file system” suggest file system issues. I found one case with same behavior and the upgrade had to do it with a format install to reset the FS. This is the procedure used https://supportportal.juniper.net/s/article/EX-QFX-Procedure-to-format-install-QFX5K-device-using-a-USB?language=en_US (applies for EX4600 too).

Basically is:
Create a bootable USB using the desired Junos version. Rufus can be used to do this.
Connect the USB
Reboot the device and hypervisor.
Please let me know any questions you may have.

As this is quite inconvenient, I asked if there was another less intrusive way:
Their reply:

There is another option if this is a file system. We can format it using one of the packages already saved in the device and then do the upgrade. But please allow me to test the jump you are trying to perform to make sure is not an incompatibility due to a very software jump.

Change 878854 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] depool eqsin for network maintenance

https://gerrit.wikimedia.org/r/878854

Change 878854 merged by Ayounsi:

[operations/dns@master] depool eqsin for network maintenance

https://gerrit.wikimedia.org/r/878854

Mentioned in SAL (#wikimedia-operations) [2023-01-11T10:02:26Z] <XioNoX> asw1-eqsin> request system reboot all-members - T316532

fpc0 went back up fine, but fpc1 not so much... It's not fully booting and stuck at a busybox like shell. Root password works so that means the config is still alive somewhere.
That means those 13 servers are down (and matching VMs) until it's fixed: https://netbox.wikimedia.org/dcim/devices/?q=&rack_id=78&status=active&role_id=1
Keeping the site depooled for now and opened a critical JTAC task "2023-0111-618257"

 (hd0,0): Filesystem type is ext2fs, partition type 0x83
�root='/dev/ram0'
rootfs=''
fstype='ramfs'
initdebug=''
NEWROOT='/sysroot'
Unpacking root filesystem....
873426 blocks
Mounted root filesystem on /dev/ram0
Switching to newStarting udev: [  OK  ]
Found required disks: /dev/sda /dev/sdb
  Reading all physical volumes.  This may take a while...
  Found volume group "vg0_vjunos" using metadata type lvm2
  3 logical volume(s) in volume group "vg0_vjunos" now active



EX Last reboot reason: Software Reset
EX System booted from Boot1, UUID="5473d68e-b872-4d92-ba52-24216c2a1c95".
Setting next boot device to SSD1
Generating fstab ....
		Welcome to CentOS 
Starting udev: [  OK  ]
Setting hostname localhost:  [  OK  ]
Setting up Logical Volume Management:   3 logical volume(s) in volume group "vg0_vjunos" now active
[  OK  ]
Checking filesystems
Checking all file systems.
[/sbin/fsck.ext4 (1) -- /var] fsck.ext4 -a /dev/mapper/vg0_vjunos-lv_var 
LINUX_VAR: clean, 253/690880 files, 86450/2760704 blocks
[/sbin/fsck.ext4 (1) -- /junos] fsck.ext4 -a /dev/mapper/vg0_vjunos-lv_junos 
JUNOS: clean, 30/786432 files, 683671/3145728 blocks
[/sbin/fsck.ext4 (1) -- /recovery] fsck.ext4 -a /dev/mapper/vg0_vjunos-lv_junos_recovery 
JUNOS_RECOVERY: clean, 16/262144 files, 607939/1048576 blocks
[/sbin/fsck.ext4 (1) -- /boot] fsck.ext4 -a /dev/sdb1 
/dev/sdb1: clean, 36/62592 files, 49664/250000 blocks
[  OK  ]
Mounting local filesystems:  [  OK  ]
Enabling /etc/fstab swaps:  [  OK  ]
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
Stride=0 blocks, Stripe width=0 blocks
168 inodes, 1344 blocks
67 blocks (4.99%) reserved for the super user
First data block=1
Maximum filesystem blocks=1572864
1 block group
8192 blocks per group, 8192 fragments per group
168 inodes per group

Writing inode tables: done                            

Filesystem too small for a journal
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 27 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
/var: 10949574656 bytes were trimmed
/junos: 10084581376 bytes were trimmed
/recovery: 1804845056 bytes were trimmed
/boot: 820572160 bytes were trimmed
chmod: cannot access `/sys/bus/pci/devices/0000:07:00.0/config': No such file or directory
chmod: cannot access `/sys/bus/pci/devices/0000:07:00.0/resource*': No such file or directory
chmod: cannot access `/sys/bus/pci/devices/0000:08:00.0/config': No such file or directory
chmod: cannot access `/sys/bus/pci/devices/0000:08:00.0/resource*': No such file or directory
Starting bandwidth shaping: [  OK  ]
haveged: haveged starting up
Starting kdump:[FAILED]
Starting sshd: /usr/sbin/sshd: /usr/lib64/libcrypto.so.10: no version information available (required by /usr/sbin/sshd)
/usr/sbin/sshd: /usr/lib64/libcrypto.so.10: no version information available (required by /usr/sbin/sshd)
OpenSSL version mismatch. Built against 1000105f, you have 10000003
[FAILED]
Starting system logger: [  OK  ]
Starting xinetd: [  OK  ]
Starting sntpc: [  OK  ]
Starting libvirtd daemon: libvirtd: /usr/lib64/libcrypto.so.10: no version information available (required by /usr/lib64/libssh2.so.1)
[  OK  ]
Starting vehostd: /etc/init.d/vehostd: line 36: /etc/init.d/jgcov-env-vars.sh: No such file or directory
vehostd: /usr/lib64/libcrypto.so.10: no version information available (required by /usr/lib64/libssh2.so.1)
vehostd: invalid option -- 'r'
[  OK  ]
Starting crond: [  OK  ]
Starting watchdog: [  OK  ]
Entering non-interactive startup

localhost login:

We tried to boot on the Recovery Junos (both 14 and 20) but the same error happened.

Next step is onsite "format install" https://supportportal.juniper.net/s/article/EX-QFX-Procedure-to-format-install-QFX5K-device-using-a-USB?language=en_US

ayounsi mentioned this in Unknown Object (Task).Jan 11 2023, 1:10 PM

Next step is onsite "format install" https://supportportal.juniper.net/s/article/EX-QFX-Procedure-to-format-install-QFX5K-device-using-a-USB?language=en_US

Rob remotely power-cycled the switch using the PDUs and it came back up.

Change 879268 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Depool esams for network maintenance

https://gerrit.wikimedia.org/r/879268

On ulsfo:

I had no issues in the lab going from 14.1X53-D54.1 (It was the only available in the lab) to 19.1. (the closest version available on the website). That means the issue you have is a file system one, affecting your device only. Since this is a file system issue, my recommendation is to have someone on site to perform the upgrade, since we will need to reboot the host (hypervisor) and we don’t know how much this FS issue could affect during the reboot .

 So we might be able to get away with remote reboot, but due to the filesystem state it's possible that it doesn't come back up after a reboot so we need someone onsite or ready to go onsite asap.

Change 879268 merged by Ayounsi:

[operations/dns@master] Depool esams for network maintenance

https://gerrit.wikimedia.org/r/879268

Mentioned in SAL (#wikimedia-operations) [2023-01-12T08:50:38Z] <XioNoX> depool esams for network maintenance - T316532

Icinga downtime and Alertmanager silence (ID=43642849-a893-44f6-961e-0bb82f3a9b4e) set by ayounsi@cumin1001 for 2:00:00 on 36 host(s) and their services with reason: nework maintenance

bast[3004-3005].wikimedia.org,cp[3050-3065].esams.wmnet,dns[3001-3002].wikimedia.org,doh[3001-3002].wikimedia.org,durum[3001-3002].esams.wmnet,ganeti[3001-3003].esams.wmnet,install3001.wikimedia.org,lvs[3005-3007].esams.wmnet,ncredir[3001-3002].esams.wmnet,netflow3002.esams.wmnet,ping3002.esams.wmnet,prometheus3001.esams.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-01-12T09:46:28Z] <XioNoX> redirect ns2 to authdns1001 - T316532

Mentioned in SAL (#wikimedia-operations) [2023-01-12T10:01:12Z] <XioNoX> reboot asw2-esams for upgrade - T316532

10min downtime, everything went smooth.

Mentioned in SAL (#wikimedia-operations) [2023-01-12T10:24:58Z] <XioNoX> rollback redirect ns2 to authdns1001 - T316532

Re-scheduling ulsfo for Jan 16th at 12:00 UTC

Change 883106 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/dns@master] Depool ulsfo for network maintenance

https://gerrit.wikimedia.org/r/883106

Change 883106 merged by Ayounsi:

[operations/dns@master] Depool ulsfo for network maintenance

https://gerrit.wikimedia.org/r/883106

Mentioned in SAL (#wikimedia-operations) [2023-01-24T10:49:43Z] <XioNoX> depool ulsfo for network maintenance - T316532

Icinga downtime and Alertmanager silence (ID=795679e1-6c07-4196-8280-0cef7454587d) set by ayounsi@cumin1001 for 2:00:00 on 36 host(s) and their services with reason: nework maintenance

bast[4003-4004].wikimedia.org,cp[4037-4052].ulsfo.wmnet,dns[4003-4004].wikimedia.org,doh[4001-4002].wikimedia.org,durum[4001-4002].ulsfo.wmnet,ganeti[4005-4008].ulsfo.wmnet,install4001.wikimedia.org,lvs[4008-4010].ulsfo.wmnet,ncredir[4001-4002].ulsfo.wmnet,netflow4002.ulsfo.wmnet,prometheus4001.ulsfo.wmnet
ayounsi claimed this task.

All done!

fpc2 didn't like the first "blank" reboot and required a power cycle using the PDU.
After that the upgrade went as expected.

For eqsin, staying on 20 as the 20 to 21 jump doesn't justify the engineering time and downtime.

Change 883497 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Disable Telemetry on eqsin switches

https://gerrit.wikimedia.org/r/883497

Change 883497 merged by Ayounsi:

[operations/homer/public@master] Disable Telemetry on eqsin switches

https://gerrit.wikimedia.org/r/883497