Page MenuHomePhabricator

Use dedicated volume for /var/lib/docker on Trusted Runners
Closed, ResolvedPublic

Description

While creating disk usage dashboards in T327435 it was noticed that /var/lib/docker is inside the root volume of gitlab-runner hosts (Trusted Runner).

The default partman config has a rather small root partition (~70GB) and a bigger /srv partition (~600GB).

We should create a dedicated partman config with a bigger /var/lib/docker to use the disk space more efficient for docker caching. Furthermore this would also prevent a full root partitions due to a lot of docker data.

Hosts to reimage:

  • gitlab-runner1002
  • gitlab-runner1003
  • gitlab-runner1004
  • gitlab-runner2002
  • gitlab-runner2003
  • gitlab-runner2004

Steps after reimage is done:

  • increase profile::gitlab::runner::docker_gc_*_water_mark matching new disk size for Trusted runners

Event Timeline

@eoghan and I will pick that task up. This should also be helpful to get used to cookbooks and the server lifecycle.

I'm going to prepare a new partman config and pair with @eoghan to do a sre.hosts.reimage of that hosts.

Jelto triaged this task as Medium priority.Feb 7 2023, 3:02 PM
Jelto moved this task from Incoming to Backlog on the collaboration-services board.

Change 887330 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] install_server: add custom partman config for gitlab-runner

https://gerrit.wikimedia.org/r/887330

Change 887330 merged by Jelto:

[operations/puppet@production] install_server: add custom partman config for gitlab-runner

https://gerrit.wikimedia.org/r/887330

Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin1001 for host gitlab-runner1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin1001 for host gitlab-runner1002.eqiad.wmnet with OS bullseye completed:

  • gitlab-runner1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302081159_eoghan_3265348_gitlab-runner1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@eoghan and I reimaged gitlab-runner1002. /var/lib/docker has a dedicated volume now with a bit under 500GB of space:

$ df -h
Filesystem                                 Size  Used Avail Use% Mounted on
udev                                        63G     0   63G   0% /dev
tmpfs                                       13G  1.7M   13G   1% /run
/dev/mapper/gitlab--runner1002--vg-root    183G  3.5G  170G   2% /
tmpfs                                       63G     0   63G   0% /dev/shm
tmpfs                                      5.0M     0  5.0M   0% /run/lock
/dev/mapper/gitlab--runner1002--vg-docker  458G  467M  434G   1% /var/lib/docker
tmpfs                                       13G     0   13G   0% /run/user/32265
tmpfs                                       13G     0   13G   0% /run/user/43207

The cookbook got stuck because we had a race condition in the puppet code. Docker tried to create a new network (/usr/bin/docker network create --driver='bridge' --subnet='172.20.0.0/16' 'gitlab-runner') but ferm already took over the docker iptables rules. This was solved by manually restarting docker and re-running puppet.

For the other trusted runners we want to correct the dependency settings in the puppet code to make sure docker network is created before ferm (or no iptables rules are created by docker) and the registration workflow is triggered only once docker configured the networks.

Change 887843 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Adds 'before' directive to docker::network in gitlab runner setup

https://gerrit.wikimedia.org/r/887843

Change 887843 merged by EoghanGaffney:

[operations/puppet@production] Adds 'before' directive to docker::network in gitlab runner setup

https://gerrit.wikimedia.org/r/887843

Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin1001 for host gitlab-runner1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin1001 for host gitlab-runner1003.eqiad.wmnet with OS bullseye completed:

  • gitlab-runner1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302091054_eoghan_3615511_gitlab-runner1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 887983 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Try running docker before the base firewall rules are added

https://gerrit.wikimedia.org/r/887983

Change 887983 merged by EoghanGaffney:

[operations/puppet@production] Try running docker before the base firewall rules are added

https://gerrit.wikimedia.org/r/887983

Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin1001 for host gitlab-runner1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin1001 for host gitlab-runner1004.eqiad.wmnet with OS bullseye completed:

  • gitlab-runner1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302091602_eoghan_3696690_gitlab-runner1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 888057 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Insert an empty DOCKER-ISOLATION-STAGE-1 chain into the ferm templates

https://gerrit.wikimedia.org/r/888057

Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin2002 for host gitlab-runner2002.codfw.wmnet with OS bullseye

Change 888057 merged by EoghanGaffney:

[operations/puppet@production] Insert an empty DOCKER-ISOLATION-STAGE-1 chain into the ferm templates

https://gerrit.wikimedia.org/r/888057

Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin2002 for host gitlab-runner2002.codfw.wmnet with OS bullseye completed:

  • gitlab-runner2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302101151_eoghan_1273554_gitlab-runner2002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin2002 for host gitlab-runner2003.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin2002 for host gitlab-runner2003.codfw.wmnet with OS bullseye completed:

  • gitlab-runner2003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302101244_eoghan_1286174_gitlab-runner2003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by eoghan@cumin2002 for host gitlab-runner2004.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by eoghan@cumin2002 for host gitlab-runner2004.codfw.wmnet with OS bullseye completed:

  • gitlab-runner2004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202302101321_eoghan_1294189_gitlab-runner2004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 888234 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Set increased thresholds for docker image/volume garbage collection

https://gerrit.wikimedia.org/r/888234

Change 888234 merged by EoghanGaffney:

[operations/puppet@production] Set increased thresholds for docker image/volume garbage collection

https://gerrit.wikimedia.org/r/888234

I think this can be closed, we're using /var/lib/docker for all gitlab-runner hosts now, and I've updated the thresholds for docker-gc.