Page MenuHomePhabricator

Setup GitLab Runner in trusted environment
Open, In Progress, HighPublic

Description

This task is for tracking the setup of GitLab Runners in a trusted environment.

In T286958 we discussed the long term requirements for GitLab Runners. One class of Runners should run in production environments (eqiad, codfw) and execute jobs which handle sensitive credentials and produce artifacts running in production. See also https://wikitech.wikimedia.org/wiki/GitLab/Gitlab_Runner#Specific_GitLab_Runners.

I would like to reuse the existing puppet code for the Shared Runners in WMCS. I think we could start with VMs on Ganeti and later order dedicated machines and/or migrate to some Kubernetes platform.
The Runners must not be used by arbitrary jobs but only by certain projects and branches. So this runners will be setup as Specific Runners, probably executing only jobs for protected branches.

Event Timeline

Dzahn added a subscriber: Dzahn.

I would like to reuse the existing puppet code for the Shared Runners in WMCS.

Yes, this is great, all for this.

we could start with VMs on Ganeti and later order dedicated machines and/or migrate to some Kubernetes platform

That sound good to me. Jelto, would you already have "hardware" requirements for the VMs? as in CPU/RAM/disk? I can take a look at cloud VPS and compare but would that even make sense given that these might have actually much less to do than the ones in cloud?

The only things I could see as potentially different between these runners and those running in wmcs are:

  • Need to use the http proxy to reach the internet (although I'd prefer to avoid allowing that by default)
  • Need to poke specific holes the outgoing firewall to the services we want to be able to reach

That sound good to me. Jelto, would you already have "hardware" requirements for the VMs? as in CPU/RAM/disk? I can take a look at cloud VPS and compare but would that even make sense given that these might have actually much less to do than the ones in cloud?

The runners in WMCS use g3.cores8.ram24.disk20.ephemeral40.4xiops see docs and T293832#7450552.
I would like to start with the same size of machines and provision one in codfw and one in eqiad. I think for the beginning 8 cores and 24G memory is quite beefy, but it's also a good size to scale horizontally later. And we also identified fast CI as more critical than overprovisioned Runners.

After talking with @brennen today, a while back we got budget to add some hardware to the Ganeti cluster. The idea was to migrate some of the CI workload there, specially the release pipeline. The procurement ticket was T214088. The setup of those CI/release specific machines was T228926

Maybe those machines can be used for gitlab runner (or most probably they have been reused for some other purpose since then).

@Jelto I checked capacity in eqiad (ganeti1009) and codfw (ganeti2019) and looked at the "MFree" column on all the nodes.

The docs say "In theory there should be sufficient disk/memory space on all nodes in the row that you are planning to use, otherwise you might get failures when creating the VM". But the "in theory" part leaves some room where it _might_ work, heh.

The lowest values for MFree are:

20.G5 in codfw and 17.8G in eqiad.

So 16G and one VM in each main DC should be safe but already close to the limit. (cc: @Muehlenhoff @akosiaris)

Also checking with "gnt-instance list" the VMs with the most RAM are 16G. That is currently the maximum we have done there.

Based on this and you calling the ones with 24G "quite beefy" I would recommend we start with 16GB here. We can adjust the RAM with a short downtime if needed, maybe even lower than 16.

@Jelto I checked capacity in eqiad (ganeti1009) and codfw (ganeti2019) and looked at the "MFree" column on all the nodes.

The docs say "In theory there should be sufficient disk/memory space on all nodes in the row that you are planning to use, otherwise you might get failures when creating the VM". But the "in theory" part leaves some room where it _might_ work, heh.

The lowest values for MFree are:

20.G5 in codfw and 17.8G in eqiad.

So 16G and one VM in each main DC should be safe but already close to the limit. (cc: @Muehlenhoff @akosiaris)

Also checking with "gnt-instance list" the VMs with the most RAM are 16G. That is currently the maximum we have done there.

Based on this and you calling the ones with 24G "quite beefy" I would recommend we start with 16GB here. We can adjust the RAM with a short downtime if needed, maybe even lower than 16.

Yeah, let's start with 16 here.

@Jelto I checked capacity in eqiad (ganeti1009) and codfw (ganeti2019) and looked at the "MFree" column on all the nodes.

The docs say "In theory there should be sufficient disk/memory space on all nodes in the row that you are planning to use, otherwise you might get failures when creating the VM". But the "in theory" part leaves some room where it _might_ work, heh.

The lowest values for MFree are:

20.G5 in codfw and 17.8G in eqiad.

So 16G and one VM in each main DC should be safe but already close to the limit. (cc: @Muehlenhoff @akosiaris)

Also checking with "gnt-instance list" the VMs with the most RAM are 16G. That is currently the maximum we have done there.

Based on this and you calling the ones with 24G "quite beefy" I would recommend we start with 16GB here. We can adjust the RAM with a short downtime if needed, maybe even lower than 16.

Yeah, let's start with 16 here.

Yup, +1 on the 16GB RAM. More than that and live migrations would be made slower and more prone to failure. Also, it's an untested config currently.

Change 740603 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] site and install_server: add gitlab-runner1001

https://gerrit.wikimedia.org/r/740603

Change 740670 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab: create role for prod gitlab-runners, adjust cumin alias

https://gerrit.wikimedia.org/r/740670

Change 740670 merged by Dzahn:

[operations/puppet@production] gitlab: create role for prod gitlab-runners, adjust cumin alias

https://gerrit.wikimedia.org/r/740670

Change 740691 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: use gitlab_runner role on gitlab-runner1001

https://gerrit.wikimedia.org/r/740691

Mentioned in SAL (#wikimedia-operations) [2021-11-24T22:38:32Z] <mutante> running decom cookbook on gitlab-runner1001.wikimedia.org VM which was in state "ADMIN_down" and not used yet. to make room to recreate it as gitlab-runner1001.eqiad.wmnet T295481

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: gitlab-runner1001.wikimedia.org

  • gitlab-runner1001.wikimedia.org (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

How to check which row has least VMs:

[ganeti1009:~] $ for row in A B C D; do echo "row ${row}: $(sudo gnt-instance list -o name -F "pnode.group == 'row_${row}'" | wc -l) VMs"; done 
row A: 34 VMs
row B: 27 VMs
row C: 33 VMs
row D: 25 VMs

picked row D.

dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 8 --memory 16 --disk 20 --network private eqiad_D gitlab-runner1001
Ready to create Ganeti VM gitlab-runner1001.eqiad.wmnet in the ganeti01.svc.eqiad.wmnet cluster on row D with 8 vCPUs, 16GB of RAM, 20GB of disk in the private network.

new MAC is: aa:00:00:99:ec:c5

new IPs are: 10.64.48.71 , 2620:0:861:107:10:64:48:71

Change 740603 merged by Dzahn:

[operations/puppet@production] site and install_server: add gitlab-runner1001

https://gerrit.wikimedia.org/r/740603

Mentioned in SAL (#wikimedia-operations) [2021-11-24T23:26:02Z] <mutante> ganeti - bringing up new VM - sudo gnt-instance start gitlab-runner1001.eqiad.wmnet ; ran puppet on install1003; installing OS T295481

Hi @Jelto, I removed the VM with public IP, then re-created it as gitlab-runner1001.eqiad.wmnet with private IP in row D (see above).

I amended and merged your change to add it to DHCP (with new MAC address) and "insetup" role in puppet. Then installed Debian on it.

Next step would be now just to change the role in site.pp https://gerrit.wikimedia.org/r/c/operations/puppet/+/740691

This uses the new role::gitlab_runner from https://gerrit.wikimedia.org/r/c/operations/puppet/+/740670 which includes the existing profile::gitlab::runner plus base::production and base::firewall so far.

I will wait and not merge this now, first because you might want to do and watch it and second because it might cause alerts or log noise etc and we have the holidays over here and are off work.

P.S. I setup the cumin aliases as such:

gitlab-server: P{O:gitlab}
gitlab-runner: P{O:gitlab_runner}
gitlab: A:gitlab-server or A:gitlab-runner

Let me know if that is ok or you would prefer that just "gitlab" does not suddenly include the runners in addition to servers.

Cheers

edit: OS install still in progress, had to repeat it

Mentioned in SAL (#wikimedia-operations) [2021-11-24T23:44:19Z] <mutante> puppetmaster1001:~] $ sudo puppet cert sign gitlab-runner1001.eqiad.wmnet | sudo install_console gitlab-runner1001.eqiad.wmnet (T295481)

[ganeti1009:~] $ sudo gnt-instance console gitlab-runner1001.eqiad.wmnet
/dev/vda1: clean, 34733/1248480 files, 382293/4992512 blocks

Debian GNU/Linux 10 gitlab-runner1001 ttyS0

gitlab-runner1001 login:
[puppetmaster1001:~] $ sudo puppet cert sign gitlab-runner1001.eqiad.wmnet

Signing Certificate Request for:
  "gitlab-runner1001.eqiad.wmnet" (SHA256) 4A:9C:D3:8D:D7:7D:0F:E1:CD:A2:57:79:5D:37:82:D4:7F:AC:54:DA:75:8E:4F:FC:17:9D:77:1F:DB:32:F3:2A
Notice: /Stage[main]/Nrpe/Package[nagios-nrpe-server]/ensure: created

Notice: Applied catalog in 293.21 seconds
root@gitlab-runner1001:~# 
..

pending in Icinga monitoring now

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=gitlab-runner1001


root@gitlab-runner1001:~# gen_fingerprints
 +---------+---------+-----------------------------------------------------+
 | Cipher  | Algo    | Fingerprint                                         |
 +---------+---------+-----------------------------------------------------+
 | RSA     | SHA-256 | SHA256:HhFUG9tCzedGLi1twOAvCOSwZukT1xE25U1iHU+Nfzo  |
 +---------+---------+-----------------------------------------------------+
 | ECDSA   | SHA-256 | SHA256:rQ3b3/ZbaMw9NyBMn2M5By9g5416r+IMbX/yVOpn7yQ  |
 +---------+---------+-----------------------------------------------------+
 | ED25519 | SHA-256 | SHA256:TlKHVDUYIkegWXn4BIIdiNdrHnlNQxnpAOlP6V7gF+Y  |
 +---------+---------+-----------------------------------------------------+

 +---[RSA 2048]----+ +---[ECDSA 256]---+ +--[ED25519 256]--+
 |    . ..B+O*o..o | |                 | | . =++=BB=o+o    |
 |     * o O O*o+ .| |                 | |. + +=++Bo.  .   |
 |    * + o * oO.. | |           + o   | | . .oo Xo..      |
 |   + o . o oo B o| |         .+ = B  | |    * =.*.       |
 |    o   S . .= ..| |        S .o % +.| |   o B.+S.       |
 |     . . . .  E  | |         *. +o*+.| |    . ++E        |
 |        .      . | |        o.o+ .E+=| |     . o.        |
 |                 | |          +oo=++B| |      .          |
 |                 | |          .oo+BO*| |                 |
 +----[SHA256]-----+ +----[SHA256]-----+ +----[SHA256]-----+

Change 742458 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] profile::gitlab-runner add hieradata for protected GitLab Runners

https://gerrit.wikimedia.org/r/742458

Jelto changed the task status from Open to In Progress.Mon, Nov 29, 1:20 PM
Jelto triaged this task as High priority.

I just noticed that runners in WMCS have a dedicated disc for /var/lib/docker. I added a similar 40G disc to the ganeti vm gitlab-runner1001:

gitlab-runner1001:~$ lsblk 
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda    254:0    0   20G  0 disk 
├─vda1 254:1    0   19G  0 part /
├─vda2 254:2    0    1K  0 part 
└─vda5 254:5    0  975M  0 part [SWAP]
vdb    254:16   0   40G  0 disk 
└─vdb1 254:17   0   40G  0 part /var/lib/docker

Change 742458 merged by Jelto:

[operations/puppet@production] profile::gitlab-runner add hieradata for protected GitLab Runners

https://gerrit.wikimedia.org/r/742458

Change 740691 merged by Jelto:

[operations/puppet@production] site: use gitlab_runner role on gitlab-runner1001

https://gerrit.wikimedia.org/r/740691

Change 742966 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] profile::gitlab_runner rename hiera file to gitlab_runner

https://gerrit.wikimedia.org/r/742966

Change 742966 merged by Jelto:

[operations/puppet@production] profile::gitlab_runner rename hiera file to gitlab_runner

https://gerrit.wikimedia.org/r/742966

Change 742986 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] profile::gitlab_runner unregister runner gitlab-runner1001

https://gerrit.wikimedia.org/r/742986

Change 742986 merged by Jelto:

[operations/puppet@production] profile::gitlab_runner unregister runner gitlab-runner1001

https://gerrit.wikimedia.org/r/742986

Deploy of puppet role gitlab::runner to gitlab-runner1001 was successful. The runner showed up in GitLab Runner menu. For now I disabled the runner again, so it de-registerd from GitLab until we figured out what group and/or projects we want to use for the runner.

@brennen and I discussed that a group runner (for example for the repos group would mean quite broad access and could cause some security risk.

My proposal would be to create the Runners as specific runners in a placeholder project (maybe under releng/trusted-runners?). So by default only this placeholder project has access to the runner. For other projects we have to explicitly allow access to the specific runners. Furthermore the runner will run tagged jobs only and will be locked, to be sure that only reviewed changes and projects run jobs there.