Page MenuHomePhabricator

upgrade gitlab-runners to bullseye
Closed, ResolvedPublic

Description

This is a task to:

  • try if the existing puppet classes for gitlab-runners "just work" on bullseye like they do in buster
  • import gitlab-runner package from external repo into Wikimedia apt repo
  • see what (packages) might be missing or what changes are needed to upgrade
  • eventually upgrade all the existing runners to bullseye
  • upgrade Trusted Runners to bullseye

break-out from ticket T297411#7567146

Event Timeline

For the integration project, the instances will be migrated to Bullseye using the flavor requested at T299704: g3.cores8.ram24.disk20.ephemeral60.4xiops

The important bits being:

  • enough disk space (60G fits the need for the integration project with 24G for docker the rest as scratch area that is volume mounted in the container.
  • 4xiops to have faster disk IO

Mentioned in SAL (#wikimedia-cloud) [2022-03-02T22:22:27Z] <mutante> - creating gitlab-runner-1001 on bullseye - purely test for T297659

Change 767599 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] devtools: copy yaml key/values over from gitlab-runner project for test

https://gerrit.wikimedia.org/r/767599

Change 767599 merged by Dzahn:

[operations/puppet@production] devtools: copy yaml key/values over from gitlab-runner project for test

https://gerrit.wikimedia.org/r/767599

  • created a new instance gitlab-runner-1001 on bullseye and in devtools project
  • copied some hieradata over, incl. setting the 'fake private' value for registration token (labs/private does not apply even though it's in common due to Hiera structure in cloud / per project etc), uncommented puppetmaster setting
  • applied profile::gitlab::runner on instance
  • first puppet run has some dependency issues but creates the docker cinder volume (no need to use the script mentioned on docs page, only create and attach in Horizon)
  • second puppet run now mostly fine except the main issue:

E: Unable to locate package gitlab-runner

and one minor "dangling symbolic link" from config class trying to create config before package is installed, would surely work next run if the package was here.

Change 767604 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] aptrepo: import gitlab-runner package for bullseye

https://gerrit.wikimedia.org/r/767604

Dzahn changed the task status from Open to In Progress.Mar 2 2022, 11:25 PM
Dzahn updated the task description. (Show Details)

Change 767604 merged by Dzahn:

[operations/puppet@production] aptrepo: import gitlab-runner package for bullseye

https://gerrit.wikimedia.org/r/767604

Mentioned in SAL (#wikimedia-operations) [2022-04-04T23:51:38Z] <mutante> apt1001 - importing gitlab-runner package for bullseye via: 'sudo -E reprepro --noskipold --component thirdparty/gitlab-runner update bullseye-wikimedia' after gerrit:767604 (T297659)

The package has been imported to our repo for bullseye per above.

Running puppet on the test instance in cloud VPS installed it succesfully.

dzahn@gitlab-runner-1001:~$ dpkg -l | grep runner
ii  gitlab-runner                        14.9.1                         amd64        GitLab Runner
dzahn@gitlab-runner-1001:~$ lsb_release -c
Codename:	bullseye

Next step is to check on this issue:

Notice: /Stage[main]/Profile::Gitlab::Runner/Exec[gitlab-register-runner]/returns: Merging configuration from template file "/etc/gitlab-runner/config-template.toml" 
Notice: /Stage[main]/Profile::Gitlab::Runner/Exec[gitlab-register-runner]/returns: ERROR: Registering runner... forbidden (check registration token)  runner=private
Notice: /Stage[main]/Profile::Gitlab::Runner/Exec[gitlab-register-runner]/returns: PANIC: Failed to register the runner.

Mentioned in SAL (#wikimedia-cloud) [2022-04-07T23:52:09Z] <mutante> creating instance runner-1020 with bullseye for testing purposes, unfortunately not enough quota to just use the same as all other runners, could only do small flavor T297659

I made a new VM in the "gitlab-runners" project. runner-1020. I did this right in the gitlab-runners project and not in devtools project because that way I don't have to worry about leaking the registration token to another project and Hiera settings already exist.

It's running on bullseye and a runner is installed and active on it now but it's only for demo purposes, could not use the same flavor like existing runners on buster. so it's small and won't be able to run much.

It proofs the puppet role / package work on bullseye and we can upgrade.

I haven't actually registered the runner yet and skipped the cinder volume via hiera

Next would be to remove existing runners, one by one, replace them with new runners and register those, or ask for raised quota to be able to do a full spec one before shutting down an existing one.

https://docs.gitlab.com/runner/register/

I like the idea of putting the bullseye runner runner-1020 into the gitlab-runners project. That reduces overhead around the puppet and hiera configuration.

I haven't actually registered the runner yet and skipped the cinder volume via hiera

The runner is registered automatically. Automatic registration can be disabled by setting profile::gitlab::runner::ensure to absent for this specific runner (see runner.pp). The runner is unregistered if the value is changed from default present to absent.

So I can confirm runner-1020 shows up in the GitLab admin area. And it already executed 8 jobs mostly successfully (2 jobs failed due to build job errors, which should have nothing to do with the runner). I would like to keep runner-1020 running for the day to gather some more experience.

As you mentioned, runner-1020 flavor is quite small and the disk is also slower. So we should replace it in the near future because I can imagine that some jobs will get delayed or fail because of not enough resources. So I think we can proceed with re-image a buster runner with bullseye and use the bigger flavor g3.cores8.ram24.disk20.ephemeral60.4xiops.

Thanks for the work!

Mentioned in SAL (#wikimedia-operations) [2022-04-08T18:38:49Z] <mutante> gitlab1001 - giving myself gitlab admin rights via rake console, to be able to connect/disconnect runners T297659

Mentioned in SAL (#wikimedia-cloud) [2022-04-08T20:59:03Z] <mutante> - pausing runner-1008 from accepting new jobs, hoping it will finish all existing jobs already queued and once that is down to 0 I can replace it with a new runner on bullseye (T297659)

Mentioned in SAL (#wikimedia-cloud) [2022-04-08T22:01:30Z] <mutante> - deleting instance runner-1008 in Horizon and also deleting it in gitlab admin UI about the same time T297659

Mentioned in SAL (#wikimedia-cloud) [2022-04-08T22:03:06Z] <mutante> - deleting instance runner-1020 and recreating it with the same name but flavor g3.cores8.ram24.disk20 T297659

Mentioned in SAL (#wikimedia-operations) [2022-04-08T22:09:14Z] <mutante> gitlab - deleted runner-1008 (to replace it with a bullseye instance), recreated runner-1020 with same flavor as existing runners T297659

So I can confirm runner-1020 shows up in the GitLab admin area. And it already executed 8 jobs mostly successfully (2 jobs failed due to build job errors, which should have nothing to do with the runner). I would like to keep runner-1020 running for the day to gather some more experience.

I got myself the needed admin rights via rake console. runner-1020 had executed 27 jobs or so at this point.

As you mentioned, runner-1020 flavor is quite small and the disk is also slower. So we should replace it in the near future because I can imagine that some jobs will get delayed or fail because of not enough resources. So I think we can proceed with re-image a buster runner with bullseye and use the bigger flavor g3.cores8.ram24.disk20.ephemeral60.4xiops.

I have deleted runner-1008 AND runner-1020, both on gitlab UI and Horizon UI level, to replace it with a new bullseye runner with the same flavor as all other runners.

Originally I wanted to just recreate 1020 with the other flavor but I ran into the bug that you can't create an instance under the same name you already used shortly before.

So now there is runner-1021, bullseye and g3.cores8.ram24.disk20.ephemeral60.4xiops and it's active and waiting for jobs.

Mentioned in SAL (#wikimedia-cloud) [2022-04-11T18:25:47Z] <mutante> pausing runner-1011 in gitlab UI from accepting new jobs, then deleting instance in Horizon UI to replace it with another bullseye instance T297659

Mentioned in SAL (#wikimedia-operations) [2022-04-11T18:26:20Z] <mutante> gitlab-runners: pausing runner-1011 in gitlab UI from accepting new jobs, then deleting instance in Horizon UI to replace it with another bullseye instance T297659

Mentioned in SAL (#wikimedia-operations) [2022-04-11T19:17:16Z] <mutante> runner-1022.gitlab-runners - rm -rf /var/lib/puppet/ssl ; run puppet; sign new request on gitlab-runners-puppetmaster-01.gitlab-runners (normal procedure needed when creating fresh instance in project with local puppetmaster) T297659

After fixing a minor race condition you can now apply profile::gitlab::runner to a new instance and puppet applies everything on the first run without errors. No multiple agent runs needed anymore.

2 more buster instances removed and replaced with new bullseye instances. Added tags to the runners list to make it more clear what is what. There are some test instance entries here that we should clean up at some point.

current progress: 3 x bullseye, 7 x buster and watching how it goes until tomorrow.

Screenshot from 2022-04-11 13-38-00.png (264×1 px, 43 KB)

cc: @LSobanski for OKR progress.. this is very close now

Dzahn triaged this task as High priority.Apr 11 2022, 8:41 PM
Dzahn moved this task from Next up 🥌 to Doing 😎 on the serviceops board.
Dzahn added a subscriber: LSobanski.

priority is High if it means "are you currently working on it" and Medium if it means "how important is it". I always disagree with Andre which one it is though:)

Mentioned in SAL (#wikimedia-operations) [2022-04-14T22:28:01Z] <mutante> gitlab - deleting runner-1018, runner-1019, creating runner-1029, runner-1030 T297659

All, 10, gitlab-runner instances in the cloud VPS project are now on bullseye.

The puppetmaster isn't but not sure if that is part of it.

And then there are other runners, used for mwcli, that aren't operated by us and are not in this cloud project. I have talked to addshore about those.

@Jelto All the (non-protected) prod runners are upgraded. Now I was just wondering about the 2 protected runners. They are paused. Should I try upgrading those as well?

@Jelto All the (non-protected) prod runners are upgraded. Now I was just wondering about the 2 protected runners. They are paused. Should I try upgrading those as well?

Great to hear that! Protected Runners are paused currently (T295481 needs some more work). Upgrading those has no end-user/CI impact, as they are not used.
So we could destroy one of the Trusted Runner, create a new bullseye ganeti VM, attach the second disk for /var/lib/docker and install the puppet role.

Let me know if I should take one or two of those. I added a checkbox for trusted runners in the task description.

Change 784741 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] DHCP: make protected gitlab-runners use bullseye installer

https://gerrit.wikimedia.org/r/784741

Change 784741 merged by Dzahn:

[operations/puppet@production] DHCP: make protected gitlab-runners use bullseye installer

https://gerrit.wikimedia.org/r/784741

Mentioned in SAL (#wikimedia-operations) [2022-04-20T20:36:07Z] <mutante> gitlab-runner2001 - mkdir /home/gitlab-runner (was: PANIC: mkdir /home/gitlab-runner: permission denied and other issues, trying if it's just the missing directory or more) T297659

Change 784765 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab: temp set gitlab-runner user to root for bootstrapping gitlab-runner2001

https://gerrit.wikimedia.org/r/784765

Change 784765 merged by Dzahn:

[operations/puppet@production] gitlab: temp set gitlab-runner user to root for bootstrapping gitlab-runner2001

https://gerrit.wikimedia.org/r/784765

Change 785189 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab: ensure home dir for runner_user exists when running as non-root

https://gerrit.wikimedia.org/r/785189

Change 785189 merged by Dzahn:

[operations/puppet@production] gitlab: ensure home dir for runner_user exists when running as non-root

https://gerrit.wikimedia.org/r/785189

Change 785191 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab_runner: use config_path variable when creating config file

https://gerrit.wikimedia.org/r/785191

Change 785191 merged by Dzahn:

[operations/puppet@production] gitlab_runner: ensure the full path to the config location exists

https://gerrit.wikimedia.org/r/785191

Mentioned in SAL (#wikimedia-operations) [2022-04-21T20:14:11Z] <mutante> reimaging gitlab-runner2001.codfw.wmnet one more time to confirm things work from scratch now T297659

Change 785198 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab::runner: ensure config dir is owned by non-privileged user

https://gerrit.wikimedia.org/r/785198

Change 785198 merged by Dzahn:

[operations/puppet@production] gitlab::runner: ensure config dir is owned by non-privileged user

https://gerrit.wikimedia.org/r/785198

Mentioned in SAL (#wikimedia-operations) [2022-04-21T21:42:35Z] <mutante> shutting down and reimaging gitlab-runner1001 T297659

Mentioned in SAL (#wikimedia-operations) [2022-04-21T22:00:37Z] <mutante> gitlab-runner2001 - installing apparmor ('apparmor' is the user utilities package and was NOT installed, libapparmor1 WAS installed), this caused bug https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1808456.html after upgrading gitlab-runner to bullseye because bullseye comes with libapparmor1 by default as opposed to before T297659

Change 785228 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab::runner: if on buster, ensure apparmor package is installed

https://gerrit.wikimedia.org/r/785228

Change 785228 merged by Dzahn:

[operations/puppet@production] gitlab::runner: if on buster, ensure apparmor package is installed

https://gerrit.wikimedia.org/r/785228

Had some trouble with privileges for the non-privileged user, apparmor (which is installed by default on bullseye but without the userspace utils etc. See above.

But now it's done and both protected runners are on bullseye with no puppet errors:

(2) gitlab-runner2001.codfw.wmnet,gitlab-runner1001.eqiad.wmnet                                         
----- OUTPUT of 'lsb_release -c' -----                                                                  
Codename:       bullseye
[gitlab-runner1001:~] $ ps aux | grep runner
gitlab-+   31814  0.5  0.2 749340 37020 ?        Ssl  22:37   0:01 /usr/bin/gitlab-runner run --working-directory /home/gitlab-runner --config /home/gitlab-runner/.gitlab-runner/config.toml --service gitlab-runner --user gitlab-runner

^ running as non-privileged user with config in the home dir of that user.

[gitlab-runner1001:~] $ sudo gitlab-runner list
Runtime platform                                    arch=amd64 os=linux pid=35058 revision=f188edd7 version=14.9.1
Listing configured runners                          ConfigFile=/etc/gitlab-runner/config.toml

[gitlab-runner2001:~] $ sudo gitlab-runner list
Runtime platform                                    arch=amd64 os=linux pid=55822 revision=f188edd7 version=14.9.1
Listing configured runners                          ConfigFile=/etc/gitlab-runner/config.toml

^ If you wonder why in the list here it shows the config file in /etc/ .. I was wondering that too..but look what happens if you run that command without sudo:

[gitlab-runner1001:~] $ gitlab-runner list
Runtime platform                                    arch=amd64 os=linux pid=35290 revision=f188edd7 version=14.9.1
Listing configured runners                          ConfigFile=/home/dzahn/.gitlab-runner/config.toml

And that definitely does not exist in my personal home dir.. so... that is just confusing or a bug or is meant to show the default location. What matters is the actually running commandline.

One more patch for docker in general: https://gerrit.wikimedia.org/r/c/operations/puppet/+/785226