Page MenuHomePhabricator

Upgrade gitlab-docker-runner to latest debian version
Closed, ResolvedPublic

Description

A while ago, via T321736, we put together a virtual machine with a mechanism to be able to build arbitrary artifacts via Dockerfiles.

The WMCS team has announced that we need to do the yearly claim of projects, and I just did. But I also noticed that the instance is running Bullseye and that has now been deprecated.

AFAIU, airflow-dags no longer uses this mechanism, but conda-analytics still does.

In this task we should:

  • double check if we still rely on this mechanism, or if we can perhaps move to bubbler? We do still rely on this for conda-analytics, and blubber is not built for these kind of use cases.
  • if we still need this mechanism, then we need to upgrade this gitlab runner to whatever latest LTS version and remove the old one
  • regardless, we should clean up projects that used this mechanism

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Set DOCKER_API_VERSION on docker builds.repos/data-engineering/workflow_utils!57xcollazofix-dockermain
Bump workflow_utils CI to pickup bullseye compatible steps.repos/data-engineering/conda-analytics!61xcollazobump-workflow-utilsmain
Dummy changelog to test new CI gitlab runner.repos/data-engineering/conda-analytics!60xcollazotest-new-cimain
Customize query in GitLab

Event Timeline

Following T321736#8353296 we added a 60GB volume to keep all the docker images:

ssh -J xcollazo@bastion.wmcloud.org xcollazo@gitlab-docker-runner-v2.analytics.eqiad1.wikimedia.cloud

$ hostname -f
gitlab-docker-runner-v2.analytics.eqiad1.wikimedia.cloud

ssh -J xcollazo@bastion.wmcloud.org xcollazo@gitlab-docker-runner-v2.analytics.eqiad1.wikimedia.cloud

$ hostname -f
gitlab-docker-runner-v2.analytics.eqiad1.wikimedia.cloud


$ lsblk -f
NAME    FSTYPE FSVER LABEL UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
sda                                                                            
├─sda1  ext4   1.0         e3414dec-a199-4bdd-a6c8-22c2d2c7364d   16.3G    12% /
├─sda14                                                                        
└─sda15 vfat   FAT16       8C7E-9039                             112.1M     9% /boot/efi
sdb                                        

sudo mkfs -t ext4 /dev/sdb
sudo mkdir -p /mnt/docker-scratch
sudo mount -t auto /dev/sdb /mnt/docker-scratch

xcollazo@gitlab-docker-runner-v2:~$ lsblk -f
NAME    FSTYPE FSVER LABEL UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
sda                                                                            
├─sda1  ext4   1.0         e3414dec-a199-4bdd-a6c8-22c2d2c7364d   16.3G    12% /
├─sda14                                                                        
└─sda15 vfat   FAT16       8C7E-9039                             112.1M     9% /boot/efi
sdb     ext4   1.0         e9d0b85d-9385-43b2-99bd-b5393b0d8e51   55.7G     0% /mnt/docker-scratch

Even though applying the docker puppet class had failed before (T410083#11372512), applying it now (via configs in horizon) was succeessful:

$ docker --version
Docker version 20.10.24+dfsg1, build 297e128

And we are using the 60GB volume:

$ sudo cat /etc/docker/daemon.json
{
  "data-root": "/mnt/docker-scratch/docker"
}

Let's add the library to connect to gitlab as per T321736#8351628:

curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | sudo bash

sudo gitlab-runner register  --url https://gitlab.wikimedia.org  --token *************
Runtime platform                                    arch=amd64 os=linux pid=148356 revision=bda84871 version=18.5.0

Created missing unique system ID                    system_id=*********
Enter the GitLab instance URL (for example, https://gitlab.com/):
[https://gitlab.wikimedia.org]: 
Verifying runner... is valid                        correlation_id=***** runner=*******
Enter a name for the runner. This is stored only in the local config.toml file:
[gitlab-docker-runner-v2]: 
Enter an executor: docker-windows, docker-autoscaler, ssh, virtualbox, docker, docker+machine, kubernetes, instance, custom, shell, parallels:
docker
Enter the default Docker image (for example, ruby:3.3):
docker-registry.wikimedia.org/bookworm:latest
Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!
 
Configuration (with the authentication token) was saved in "/etc/gitlab-runner/config.toml"

Similar to T321736#8357399:

$ sudo cat /etc/gitlab-runner/config.toml

concurrent = 1
check_interval = 0
connection_max_age = "15m0s"
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "gitlab-docker-runner-v2"
  url = "https://gitlab.wikimedia.org"
  id = 1538
  token = "***********"
  token_obtained_at = 2025-11-14T17:18:49Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "docker"
  [runners.cache]
    MaxUploadedArchiveSize = 0
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "docker-registry.wikimedia.org/bookworm:latest"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0
    network_mtu = 0
    # instead of docker on docker, which create "child" containers that require priviledged mode,
    # here we mount the docker socket so that we can launch "sibling" docker containers
    # see here for why this is more sound: https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/
    volumes = ["/var/run/docker.sock:/var/run/docker.sock"]

Currently stock on what seems to be a proxying issue:

Failure example: https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/jobs/678835

$ mkdir -p /usr/share/man/man1
$ apt update
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Get:1 http://mirrors.wikimedia.org/debian bullseye InRelease [75.1 kB]
Get:2 http://apt.wikimedia.org/wikimedia bullseye-wikimedia InRelease [158 kB]
Get:3 http://security.debian.org/debian-security bullseye-security InRelease [27.2 kB]
Get:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [422 kB]
Err:1 http://mirrors.wikimedia.org/debian bullseye InRelease
  Connection timed out [IP: 208.80.154.139 80]
Err:2 http://apt.wikimedia.org/wikimedia bullseye-wikimedia InRelease
  Connection timed out [IP: 208.80.154.10 80]
Get:5 http://mirrors.wikimedia.org/debian bullseye-updates InRelease [44.0 kB]
Err:5 http://mirrors.wikimedia.org/debian bullseye-updates InRelease
  Connection timed out [IP: 208.80.154.139 80]
Fetched 450 kB in 60s (7480 B/s)
Reading package lists...
Building dependency tree...
Reading state information...
All packages are up to date.
W: Failed to fetch http://mirrors.wikimedia.org/debian/dists/bullseye/InRelease  Connection timed out [IP: 208.80.154.139 80]
W: Failed to fetch http://mirrors.wikimedia.org/debian/dists/bullseye-updates/InRelease  Connection timed out [IP: 208.80.154.139 80]
W: Failed to fetch http://apt.wikimedia.org/wikimedia/dists/bullseye-wikimedia/InRelease  Connection timed out [IP: 208.80.154.10 80]
W: Some index files failed to download. They have been ignored, or old ones used instead.

The wikimedia debian mirrors seem to not be reachable for some reason.

Perhaps @BTullis has seen this before?

xcollazo renamed this task from Check what projects, if any, are still using gitlab-docker-runner to Upgrade gitlab-docker-runner to latest debian version.Fri, Nov 14, 8:08 PM
xcollazo changed the task status from Open to In Progress.
xcollazo triaged this task as Medium priority.

From cloudvps instance itself we can reach the offending hosts:

$ hostname -f
gitlab-docker-runner-v2.analytics.eqiad1.wikimedia.cloud

$ ping 208.80.154.139
PING 208.80.154.139 (208.80.154.139) 56(84) bytes of data.
64 bytes from 208.80.154.139: icmp_seq=1 ttl=60 time=0.824 ms
64 bytes from 208.80.154.139: icmp_seq=2 ttl=60 time=0.424 ms
64 bytes from 208.80.154.139: icmp_seq=3 ttl=60 time=0.432 ms
64 bytes from 208.80.154.139: icmp_seq=4 ttl=60 time=0.492 ms
^C
--- 208.80.154.139 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3071ms
rtt min/avg/max/mdev = 0.424/0.543/0.824/0.164 ms

So seems like the jobs, aka docker containers themselves, can't reach...

Ok I can repro directly on by creating a manual container from the docker image:

xcollazo@gitlab-docker-runner-v2:~$ sudo docker image list
REPOSITORY                                                          TAG              IMAGE ID       CREATED       SIZE
registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper   x86_64-v18.5.0   e3cc82a0845d   2 hours ago   94.4MB
docker-registry.wikimedia.org/bullseye                              20251019         807cec67eba7   3 weeks ago   80.7MB

xcollazo@gitlab-docker-runner-v2:~$ sudo docker run -it 807cec67eba7 bash
root@22cd011ab66a:/# ping 208.80.154.139
bash: ping: command not found

root@22cd011ab66a:/bin# apt update
Get:1 http://mirrors.wikimedia.org/debian bullseye InRelease [75.1 kB]
Get:2 http://apt.wikimedia.org/wikimedia bullseye-wikimedia InRelease [158 kB]
Get:3 http://security.debian.org/debian-security bullseye-security InRelease [27.2 kB]                                 
Get:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [422 kB]                        
Err:1 http://mirrors.wikimedia.org/debian bullseye InRelease                                                           
  Connection timed out [IP: 208.80.154.139 80]
Get:5 http://mirrors.wikimedia.org/debian bullseye-updates InRelease [44.0 kB]                           
Err:2 http://apt.wikimedia.org/wikimedia bullseye-wikimedia InRelease                                    
  Connection timed out [IP: 208.80.154.10 80]
Ign:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages               
Get:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [422 kB]                        
Err:5 http://mirrors.wikimedia.org/debian bullseye-updates InRelease                                                   
  Connection timed out [IP: 208.80.154.139 80]
Ign:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages
Get:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [552 kB]
Ign:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages                                 
Get:4 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [552 kB]
Fetched 513 kB in 1min 31s (5629 B/s)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.
W: Failed to fetch http://mirrors.wikimedia.org/debian/dists/bullseye/InRelease  Connection timed out [IP: 208.80.154.139 80]
W: Failed to fetch http://mirrors.wikimedia.org/debian/dists/bullseye-updates/InRelease  Connection timed out [IP: 208.80.154.139 80]
W: Failed to fetch http://apt.wikimedia.org/wikimedia/dists/bullseye-wikimedia/InRelease  Connection timed out [IP: 208.80.154.10 80]
W: Some index files failed to download. They have been ignored, or old ones used instead.

Ok I think I found root cause beinga mismatch MTU between host network and bridge network that docker creates:

xcollazo@gitlab-docker-runner-v2:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:5c:86:83 brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    inet 172.16.20.94/21 metric 100 brd 172.16.23.255 scope global dynamic ens3
       valid_lft 67821sec preferred_lft 67821sec
    inet6 2a02:ec80:a000:1::13f/128 scope global dynamic noprefixroute 
       valid_lft 73212sec preferred_lft 73212sec
    inet6 fe80::f816:3eff:fe5c:8683/64 scope link 
       valid_lft forever preferred_lft forever
54: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:d7:dd:02:ab brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever

So looks like this new hiera config should solve it:

docker::configuration::settings:
  data-root: /mnt/docker-scratch/docker  # make docker use attached volume
  mtu: 1450  # lower mtu due needed for network "VXLAN/IPv6-dualstack"

Let's restart things:

$ sudo run-puppet-agent
...
--- /etc/docker/daemon.json	2025-11-17 17:07:33.641404560 +0000
+++ /tmp/puppet-file20251117-267465-72rwnt	2025-11-17 17:19:45.530710864 +0000
@@ -1,3 +1,4 @@
 {
-  "data-root": "/mnt/docker-scratch/docker"
+  "data-root": "/mnt/docker-scratch/docker",
+  "mtu": 1450
 }
...

xcollazo@gitlab-docker-runner-v2:~$ sudo systemctl stop docker
Warning: Stopping docker.service, but it can still be activated by:
  docker.socket
xcollazo@gitlab-docker-runner-v2:~$ sudo ip link set dev docker0 down
xcollazo@gitlab-docker-runner-v2:~$ sudo ip link del docker0
xcollazo@gitlab-docker-runner-v2:~$ sudo rm -rf /var/lib/docker/network/files/
xcollazo@gitlab-docker-runner-v2:~$ sudo systemctl start docker
xcollazo@gitlab-docker-runner-v2:~$ ip addr show docker0

Now let's test:

xcollazo@gitlab-docker-runner-v2:~$ sudo docker run -it 807cec67eba7 bash
root@2aa75884dd64:/# apt update
Get:1 http://apt.wikimedia.org/wikimedia bullseye-wikimedia InRelease [158 kB]
Get:2 http://mirrors.wikimedia.org/debian bullseye InRelease [75.1 kB]                      
Get:3 http://mirrors.wikimedia.org/debian bullseye-updates InRelease [44.0 kB]                      
Get:4 http://security.debian.org/debian-security bullseye-security InRelease [27.2 kB]
Get:5 http://apt.wikimedia.org/wikimedia bullseye-wikimedia/main amd64 Packages [74.7 kB]
Get:6 http://mirrors.wikimedia.org/debian bullseye/main amd64 Packages [8066 kB]
Get:7 http://mirrors.wikimedia.org/debian bullseye-updates/main amd64 Packages [18.8 kB]
Get:8 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [423 kB]
Fetched 8886 kB in 2s (5352 kB/s)                       
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.
root@2aa75884dd64:/#

We should be good now.

I inadvertently committed directly to main on conda-analytics to pickup the above. Here is the commit for completeness.

gitlab-docker-runner-v2 successfully built the latest conda-analytcis artifact. 🎉

Deleted old gitlab-docker-runner VM instance, and deleted old gitlab-docker-runner-workspace volume to give back the resources to WMCS.

Deleted gitlab runner references both from airflow-dags and conda-analytics Gitlab projects.

I think we are done here for now. Bookworm should give us a 2-3 year runway.

Had to restart the VM, and lost the mounted volume. We were missing a mount definition for /mnt/docker-scratch on fstab:

$ cat /etc/fstab 
PARTUUID=60e1fb21-856d-4220-8d87-f9d6ffcda7be / ext4 rw,discard,errors=remount-ro,x-systemd.growfs 0 1
PARTUUID=f9abe075-7aa9-4f0c-bc89-11cddd2df78b /boot/efi vfat defaults 0 0
UUID=e9d0b85d-9385-43b2-99bd-b5393b0d8e51 /mnt/docker-scratch ext4 defaults 0 2

Now we good.