Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | zeljkofilipin | T284730 selenium-daily-beta-CirrusSearch fails with `Request failed due to invalid argument: invalid argument: unrecognized capability: chromeOptions` | |||
Resolved | Krinkle | T284696 Update Fresh from Node.js 10 LTS to Node.js 12 LTS | |||
Resolved | • hashar | T292729 TAR_ENTRY_ERROR ENOSPC: no space left on device | |||
Resolved | • hashar | T252071 Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye | |||
Resolved | • hashar | T284774 Provide one or more Qemu agents in CI that use a newer version than 2.x | |||
Resolved | None | T284507 Request increased quota for 'integration' Cloud VPS project | |||
Resolved | • Bstorm | T277078 Support Cinder for CI docker workers |
Event Timeline
I created this with @LarsWirzenius as part of T250808, a nd we documented the provisoning steps at https://www.mediawiki.org/wiki/Continuous_integration/Qemu.
However, what we did not document is how the Debian base image for Qemu itself was made. This is something Lars made and uploaded for us, but I'm not what considerations and configurations went into that.
Mentioned in SAL (#wikimedia-releng) [2021-09-03T23:02:41Z] <Krinkle> Creating integration-agent-qemu-1002 (Debian 11 Bullseye, g3.cores8.ram24.disk20.ephemeral40.4xiops), ref T284774
Change 717687 had a related patch set uploaded (by Krinkle; author: Krinkle):
[operations/puppet@production] ci: Add 'bulleye' to docker lsbdistcodename hack
The next hurdle:
Notice: /Stage[main]/Labs_lvm/Exec[create-volume-group]/returns: /usr/local/sbin/make-instance-vg: lvm is not active on this host; unable to create a volume. Error: '/usr/local/sbin/make-instance-vg '/dev/sda'' returned 1 instead of one of [0] Error: /Stage[main]/Labs_lvm/Exec[create-volume-group]/returns: change from 'notrun' to ['0'] failed: '/usr/local/sbin/make-instance-vg '/dev/sda'' returned 1 instead of one of [0] Info: Class[Labs_lvm]: Unscheduling all events on Class[Labs_lvm] Notice: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns: Traceback (most recent call last): Notice: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns: File "/usr/local/sbin/pv-free", line 17, in <module> Notice: /Stage[main]/Profile::Labs::Lvm::Srv/Labs_lvm::Volume[second-local-disk]/Exec[available-space-second-local-disk]/returns: assert pvfree.endswith("G")
Change 717732 had a related patch set uploaded (by Krinkle; author: Krinkle):
[operations/puppet@production] ci: Fix profile::ci to be compatible with new empheral storage
The switch from lvm to the new "ephemeral" (cinder-but-not-really-cinder) didn't quite work out because for some reason the space was already mounted at /mnt by default, which isn't meant to happen on new instances. But, that's a bug filed at T290372.
There was also a bug that even after unmounting this, cinderutils still wasn't able to discover the information it needed from the Puppet "facts".
For both of issues, @Bstorm worked her magic to make this work for the qemu-1002 instance specifically.
Signing back over to @dpifke. The integration-agent-qemu-1002 instance should be ready now with the same resources and provisioning as qemu-1001. (This does not yet include the qemu and guestfs packages, which may be installed adhoc with sudo.)
Change 717732 abandoned by Krinkle:
[operations/puppet@production] ci: Fix profile::ci to be compatible with new empheral storage
Reason:
I've un-picked this from integration-puppet-master-02 in favour of now-merged https://gerrit.wikimedia.org/r/719376
Mentioned in SAL (#wikimedia-releng) [2021-09-17T18:08:02Z] <Krinkle> Re-recreating qemu-1002 as integration-agent-qemu-1003 (Debian 11 Bullseye, g3.cores8.ram24.disk20.ephemeral40.4xiops), ref T284774
OK. qemu-1003 is now up in the same shape as qemu-1001 and qemu-1002 were, although with a smaller ephemeral disk (40G instead of 60G) but we were only using 18G of it so that should be fine.
Change 717687 merged by Giuseppe Lavagetto:
[operations/puppet@production] ci: Add 'bullseye' to docker lsbdistcodename hack
integration-agent-qemu-1003 is ready but:
- we do not have the base Qemu image available. It was created by @LarsWirzenius by a script we do not have
- there are manual steps required https://www.mediawiki.org/wiki/Continuous_integration/Qemu
I've dropped a bullseye virtual machine image in my home directory on integration-agent-qemu-1003 and updated the instructions at https://www.mediawiki.org/wiki/Continuous_integration/Qemu#Runbook to reflect how I created it.
I initially didn't have sudo rights on that machine, so I wasn't able to set up the remaining bits in /srv/vm-images. Looks like I do now, so I'll work on that.
Mentioned in SAL (#wikimedia-releng) [2022-01-31T10:49:02Z] <hashar> Added integration-agent-qemu-1003 with label Qemu # T284774
Looks like I have nuked /srv on integration-agent-qemu-1003 and all the steps done by @dpifke in November have been lost. I will try to puppetize the runbook at https://www.mediawiki.org/wiki/Continuous_integration/Qemu
Change 758514 had a related patch set uploaded (by Hashar; author: Hashar):
[operations/puppet@production] ci: Qemu image and snapshot creation
I have spent most of my day on that, my conclusions:
- reinventing the wheel
- our doc as way too many manual steps: https://www.mediawiki.org/wiki/Continuous_integration/Qemu
- my port to shell replay the same quirks to boot the vm, install grub etc which is fragile https://gerrit.wikimedia.org/r/c/operations/puppet/+/758514/1/modules/profile/files/ci/ci-create-qemu-image.sh
I think we are better off using the Debian nocloud images and customizes them with virt-customize. Example to generate the ssh server host key:
virt-customize -a debian-11-nocloud-amd64-20220121-894.qcow2 --run-command 'dpkg-reconfigure openssh-server'
There are bunch of other subcommands to inject files / ssh keys / install package and that sounds more robust than our home made steps.
- image format
- the existing image on integration-qemu-1001 are using the raw image format which is not that modern
- our script in integration/config jjb/qemu-run.bash does a full copy of the image to avoid writing to the original image
Those can be addressed by using the qcow2 format which lets one chain layers. So we can have the upstream image untouched, create a layer with our customization and the CI jobs would create a layer on top of that which would then be discarded at end of build. Layering is achieved using qemu-img create -b parent.qcow2 delta.qcow2. So tentatively we would have the following layers:
Debian upstream image |
Our customizations |
The job layer |
The above would save us from having to maintain images from scratch and speed up the jobs since they would no more have to copy a 4G+ image.
Change 759499 had a related patch set uploaded (by Hashar; author: Hashar):
[integration/config@master] jjb: adjust qemu-run.bash to use a qcow2 image
I have cherry picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/758514/ on integration-puppetmaster03 , applied the new role::ci::agent::qemu to integration-agent-qemu03 and the images are created. Unfortunately the delta.qcow2 image refuses to boot in grub, in the grub console I have to do something such as:
set prefix=(hdo,gpt3)/boot/grub.cfg insmod normal normal
And then it boots properly and I can ssh to it!
My investigation so far:
The debian image comes with:
<rescue> fdisk -l -o +UUID Disk /dev/sda: 2 GiB, 2147483648 bytes, 4194304 sectors Disk model: QEMU HARDDISK Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 5FD49DFB-D172-7F43-91D6-3A19BE736942 Device Start End Sectors Size Type UUID /dev/sda1 262144 4194270 3932127 1.9G Linu 84ED8658-3E53-CE47-BE88-7BF8A752B672 /dev/sda14 2048 8191 6144 3M BIOS D9197A0E-152A-B14F-83CB-3F6D2A0AAAAB /dev/sda15 8192 262143 253952 124M EFI 282D06F9-8520-9A45-8832-8E65DD741EC0
So linux is on /dev/sda1.
The script resize that partition since it is too small:
$ qemu-img create -f qcow2 grown.qcow2 3G $ virt-resize --align-first never --expand /dev/sda1 /srv/vm-images/debian-11-nocloud-amd64-20220121-894.qcow2 grown.qcow2 Summary of changes: /dev/sda14: This partition will be left alone. /dev/sda15: This partition will be left alone. /dev/sda1: This partition will be resized from 1.9G to 2.9G. The filesystem ext4 on /dev/sda1 will be expanded using the ‘resize2fs’ method. ********** [ 28.1] Setting up initial partition table on grown.qcow2 [ 54.1] Copying /dev/sda14 [ 54.6] Copying /dev/sda15 [ 55.5] Copying /dev/sda1 [ 83.2] Expanding /dev/sda1 (now /dev/sda3) using the ‘resize2fs’ method
If I understand it properly the partition are renumbered in the process:
Debian | Grown image |
---|---|
/dev/sda14 | /dev/sda1 |
/dev/sda15 | /dev/sda2 |
/dev/sda1 | /dev/sda3 |
Which I guess cause Grub to be confused.
I went to check the grown image:
fdisk -l -o +UUID Disk /dev/sda: 3 GiB, 3221225472 bytes, 6291456 sectors Disk model: QEMU HARDDISK Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 5FD49DFB-D172-7F43-91D6-3A19BE736942 Device Start End Sectors Size Type UUID /dev/sda1 2048 8191 6144 3M BIOS D9197A0E-152A-B14F-83CB-3F6D2A0AAAAB /dev/sda2 8192 262143 253952 124M EFI 282D06F9-8520-9A45-8832-8E65DD741EC0 /dev/sda3 262144 6288895 6026752 2.9G Linu 84ED8658-3E53-CE47-BE88-7BF8A752B672 Disk /dev/sdb: 4 GiB, 4294967296 bytes, 8388608 sectors Disk model: QEMU HARDDISK Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes
In /sysroot/boot/grub/grub.cfg both have linux boot referring to an entirely different UUID: ec402599-5951-4f28-b3af-cd15b00cf1f7 example:
linux /boot/vmlinuz-5.10.0-11-amd64 root=UUID=ec402599-5951-4f28-b3af-cd15b00cf1f7 ro console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0
So I would imagine the idea is to have Grub to lookup by UUID, none of the partition have a matching UUID and Grub would fallback on attempting to boot on /dev/sda1. That would work on the upstream Debian image which has the system on /dev/sda1 but does not work after resizing since it is now at /dev/sda3.
:-\
When booting in the grub rescue boot:
Welcome to GRUB! error: unknown filesystem. grub rescue> ls (hd0) (hd0,gpt3) (hd0,gpt2) (hd0,gpt1) (fd0) grub rescue> set cmdpath=(hd0) prefix=(hd0,gpt1)/boot/grub root=hd0,gpt1
And it boot with:
grub rescue> set prefix=(hd0,gpt3)/boot/grub insmod normal normal
...
I have put integration-agent-qemu-1003 offline and 1001 back online. Will have to find out what is wrong with Grub but that will be for later :-\
I have found the fix! Grub has to be reinstalled thus I have added:
virt-customize --run-command 'grub-install /dev/sda' -a delta.qcow2
I have wiped the qcow2 image, ran puppet.
I have put the legacy instance qemu-1001 offline and put the new one online qemu-1003.
I have updated the job with https://gerrit.wikimedia.org/r/c/integration/config/+/759499 and triggered a build at https://integration.wikimedia.org/ci/job/fresh-test/218/console . The docker pull is very slow, potentially due to disk write and the qcow layer being expanded.
I have rolled back the jobs, put 1003 offline and 1001 online.
To be continued later!
I am back from vacations. The wall I have hit was that the docker pull was extremely slow either due to network issue or disk I/O, I suspect the qcow2 image has to be resized on each write which would slow it down. qemu-img to preallocate the whole image when it is created which might speed it up.
Change 762482 had a related patch set uploaded (by Hashar; author: Hashar):
[integration/config@master] qemu-run: use one line per qemu-system-x86_64 option
Change 762483 had a related patch set uploaded (by Hashar; author: Hashar):
[integration/config@master] qemu-run: avoid copying image and faster disk IO
Change 762484 had a related patch set uploaded (by Hashar; author: Hashar):
[integration/config@master] qemu-run: allocate more CPU to the VM
The magic is to use -snapshot which disable writing back to the image and also set caching to unsafe so every writes are kept in memory. I have also made qemu to use 7 vCPU rather than 1 which speed up some operations :]
Change 762482 merged by jenkins-bot:
[integration/config@master] qemu-run: use one line per qemu-system-x86_64 option
Change 762483 merged by jenkins-bot:
[integration/config@master] qemu-run: avoid copying image and faster disk IO
Change 762484 merged by jenkins-bot:
[integration/config@master] qemu-run: allocate more CPU to the VM
Change 759499 merged by jenkins-bot:
[integration/config@master] jjb: adjust qemu-run.bash to use a qcow2 image
Change 763525 had a related patch set uploaded (by Hashar; author: Hashar):
[fresh@master] Switch tests from node10 to node12
Change 763525 merged by jenkins-bot:
[fresh@master] test: Switch integration target from fresh-node10 to fresh-node12
Mentioned in SAL (#wikimedia-releng) [2022-02-23T08:37:28Z] <hashar> Removing Stretch based integration-agent-qemu-1001 # T284774
Change 758514 merged by Jbond:
[operations/puppet@production] ci: Qemu image and snapshot creation
@jbond kindly reviewed the puppet patch and spotted a few more issues. It is fully deployed now.
I have updated https://www.mediawiki.org/wiki/Continuous_integration/Qemu
Solved!