|Resolved||zeljkofilipin||T284730 selenium-daily-beta-CirrusSearch fails with `Request failed due to invalid argument: invalid argument: unrecognized capability: chromeOptions`|
|Resolved||Krinkle||T284696 Update Fresh from Node.js 10 LTS to Node.js 12 LTS|
|Resolved||hashar||T292729 TAR_ENTRY_ERROR ENOSPC: no space left on device|
|Resolved||hashar||T252071 Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye|
|Resolved||hashar||T284774 Provide one or more Qemu agents in CI that use a newer version than 2.x|
|Resolved||None||T284507 Request increased quota for 'integration' Cloud VPS project|
|Resolved||Bstorm||T277078 Support Cinder for CI docker workers|
- Mentioned In
- T252071: Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye
T290615: integration-agent-qemu-1001 in project integration has corrupted disk / partition
T290372: New instances with ephemeral disk on Bullseye should start with ephemeral disk unused
- Mentioned Here
- T290372: New instances with ephemeral disk on Bullseye should start with ephemeral disk unused
T250808: Decide how to run a test involving docker inside WMF CI
I created this with @LarsWirzenius as part of T250808, a nd we documented the provisoning steps at https://www.mediawiki.org/wiki/Continuous_integration/Qemu.
However, what we did not document is how the Debian base image for Qemu itself was made. This is something Lars made and uploaded for us, but I'm not what considerations and configurations went into that.
The next hurdle:
Notice: /Stage[main]/Labs_lvm/Exec[create-volume-group]/returns: /usr/local/sbin/make-instance-vg: lvm is not active on this host; unable to create a volume. Error: '/usr/local/sbin/make-instance-vg '/dev/sda'' returned 1 instead of one of  Error: /Stage[main]/Labs_lvm/Exec[create-volume-group]/returns: change from 'notrun' to ['0'] failed: '/usr/local/sbin/make-instance-vg '/dev/sda'' returned 1 instead of one of  Info: Class[Labs_lvm]: Unscheduling all events on Class[Labs_lvm] Notice: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns: Traceback (most recent call last): Notice: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns: File "/usr/local/sbin/pv-free", line 17, in <module> Notice: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns: assert pvfree.endswith("G")
The switch from lvm to the new "ephemeral" (cinder-but-not-really-cinder) didn't quite work out because for some reason the space was already mounted at /mnt by default, which isn't meant to happen on new instances. But, that's a bug filed at T290372.
There was also a bug that even after unmounting this, cinderutils still wasn't able to discover the information it needed from the Puppet "facts".
For both of issues, @Bstorm worked her magic to make this work for the qemu-1002 instance specifically.
Change 717732 abandoned by Krinkle:
[operations/puppet@production] ci: Fix profile::ci to be compatible with new empheral storage
I've un-picked this from integration-puppet-master-02 in favour of now-merged https://gerrit.wikimedia.org/r/719376
OK. qemu-1003 is now up in the same shape as qemu-1001 and qemu-1002 were, although with a smaller ephemeral disk (40G instead of 60G) but we were only using 18G of it so that should be fine.
integration-agent-qemu-1003 is ready but:
- we do not have the base Qemu image available. It was created by @LarsWirzenius by a script we do not have
- there are manual steps required https://www.mediawiki.org/wiki/Continuous_integration/Qemu
I've dropped a bullseye virtual machine image in my home directory on integration-agent-qemu-1003 and updated the instructions at https://www.mediawiki.org/wiki/Continuous_integration/Qemu#Runbook to reflect how I created it.
I initially didn't have sudo rights on that machine, so I wasn't able to set up the remaining bits in /srv/vm-images. Looks like I do now, so I'll work on that.
Looks like I have nuked /srv on integration-agent-qemu-1003 and all the steps done by @dpifke in November have been lost. I will try to puppetize the runbook at https://www.mediawiki.org/wiki/Continuous_integration/Qemu
I have spent most of my day on that, my conclusions:
- reinventing the wheel
- our doc as way too many manual steps: https://www.mediawiki.org/wiki/Continuous_integration/Qemu
- my port to shell replay the same quirks to boot the vm, install grub etc which is fragile https://gerrit.wikimedia.org/r/c/operations/puppet/+/758514/1/modules/profile/files/ci/ci-create-qemu-image.sh
I think we are better off using the Debian nocloud images and customizes them with virt-customize. Example to generate the ssh server host key:
virt-customize -a debian-11-nocloud-amd64-20220121-894.qcow2 --run-command 'dpkg-reconfigure openssh-server'
There are bunch of other subcommands to inject files / ssh keys / install package and that sounds more robust than our home made steps.
- image format
- the existing image on integration-qemu-1001 are using the raw image format which is not that modern
- our script in integration/config jjb/qemu-run.bash does a full copy of the image to avoid writing to the original image
Those can be addressed by using the qcow2 format which lets one chain layers. So we can have the upstream image untouched, create a layer with our customization and the CI jobs would create a layer on top of that which would then be discarded at end of build. Layering is achieved using qemu-img create -b parent.qcow2 delta.qcow2. So tentatively we would have the following layers:
|Debian upstream image|
|The job layer|
The above would save us from having to maintain images from scratch and speed up the jobs since they would no more have to copy a 4G+ image.
I have cherry picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/758514/ on integration-puppetmaster03 , applied the new role::ci::agent::qemu to integration-agent-qemu03 and the images are created. Unfortunately the delta.qcow2 image refuses to boot in grub, in the grub console I have to do something such as:
set prefix=(hdo,gpt3)/boot/grub.cfg insmod normal normal
And then it boots properly and I can ssh to it!
My investigation so far:
The debian image comes with:
<rescue> fdisk -l -o +UUID Disk /dev/sda: 2 GiB, 2147483648 bytes, 4194304 sectors Disk model: QEMU HARDDISK Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 5FD49DFB-D172-7F43-91D6-3A19BE736942 Device Start End Sectors Size Type UUID /dev/sda1 262144 4194270 3932127 1.9G Linu 84ED8658-3E53-CE47-BE88-7BF8A752B672 /dev/sda14 2048 8191 6144 3M BIOS D9197A0E-152A-B14F-83CB-3F6D2A0AAAAB /dev/sda15 8192 262143 253952 124M EFI 282D06F9-8520-9A45-8832-8E65DD741EC0
So linux is on /dev/sda1.
The script resize that partition since it is too small:
$ qemu-img create -f qcow2 grown.qcow2 3G $ virt-resize --align-first never --expand /dev/sda1 /srv/vm-images/debian-11-nocloud-amd64-20220121-894.qcow2 grown.qcow2 Summary of changes: /dev/sda14: This partition will be left alone. /dev/sda15: This partition will be left alone. /dev/sda1: This partition will be resized from 1.9G to 2.9G. The filesystem ext4 on /dev/sda1 will be expanded using the ‘resize2fs’ method. ********** [ 28.1] Setting up initial partition table on grown.qcow2 [ 54.1] Copying /dev/sda14 [ 54.6] Copying /dev/sda15 [ 55.5] Copying /dev/sda1 [ 83.2] Expanding /dev/sda1 (now /dev/sda3) using the ‘resize2fs’ method
If I understand it properly the partition are renumbered in the process:
Which I guess cause Grub to be confused.
I went to check the grown image:
fdisk -l -o +UUID Disk /dev/sda: 3 GiB, 3221225472 bytes, 6291456 sectors Disk model: QEMU HARDDISK Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 5FD49DFB-D172-7F43-91D6-3A19BE736942 Device Start End Sectors Size Type UUID /dev/sda1 2048 8191 6144 3M BIOS D9197A0E-152A-B14F-83CB-3F6D2A0AAAAB /dev/sda2 8192 262143 253952 124M EFI 282D06F9-8520-9A45-8832-8E65DD741EC0 /dev/sda3 262144 6288895 6026752 2.9G Linu 84ED8658-3E53-CE47-BE88-7BF8A752B672 Disk /dev/sdb: 4 GiB, 4294967296 bytes, 8388608 sectors Disk model: QEMU HARDDISK Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes
In /sysroot/boot/grub/grub.cfg both have linux boot referring to an entirely different UUID: ec402599-5951-4f28-b3af-cd15b00cf1f7 example:
linux /boot/vmlinuz-5.10.0-11-amd64 root=UUID=ec402599-5951-4f28-b3af-cd15b00cf1f7 ro console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0
So I would imagine the idea is to have Grub to lookup by UUID, none of the partition have a matching UUID and Grub would fallback on attempting to boot on /dev/sda1. That would work on the upstream Debian image which has the system on /dev/sda1 but does not work after resizing since it is now at /dev/sda3.
When booting in the grub rescue boot:
Welcome to GRUB! error: unknown filesystem. grub rescue> ls (hd0) (hd0,gpt3) (hd0,gpt2) (hd0,gpt1) (fd0) grub rescue> set cmdpath=(hd0) prefix=(hd0,gpt1)/boot/grub root=hd0,gpt1
And it boot with:
grub rescue> set prefix=(hd0,gpt3)/boot/grub insmod normal normal
I have found the fix! Grub has to be reinstalled thus I have added:
virt-customize --run-command 'grub-install /dev/sda' -a delta.qcow2
I have wiped the qcow2 image, ran puppet.
I have updated the job with https://gerrit.wikimedia.org/r/c/integration/config/+/759499 and triggered a build at https://integration.wikimedia.org/ci/job/fresh-test/218/console . The docker pull is very slow, potentially due to disk write and the qcow layer being expanded.
I have rolled back the jobs, put 1003 offline and 1001 online.
To be continued later!
I am back from vacations. The wall I have hit was that the docker pull was extremely slow either due to network issue or disk I/O, I suspect the qcow2 image has to be resized on each write which would slow it down. qemu-img to preallocate the whole image when it is created which might speed it up.
The magic is to use -snapshot which disable writing back to the image and also set caching to unsafe so every writes are kept in memory. I have also made qemu to use 7 vCPU rather than 1 which speed up some operations :]