Page MenuHomePhabricator

Puppetise gitlab-ansible playbook
Closed, ResolvedPublic

Description

This task is to track progress converting the work from S&F on the gitlab-ansible playbook into puppet.

currently the ansible playbook does three high level things and i have tried to map theses to puppet modules

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Currently we manage /etc/sysctl.d/ with purge => true. however we see that the gitlab omniauth package manages the following files.

Notice: /Stage[main]/Sysctl/File[/etc/sysctl.d/90-omnibus-gitlab-kernel.sem.conf]/ensure: removed (corrective)
Notice: /Stage[main]/Sysctl/File[/etc/sysctl.d/90-omnibus-gitlab-kernel.shmall.conf]/ensure: removed (corrective)
Notice: /Stage[main]/Sysctl/File[/etc/sysctl.d/90-omnibus-gitlab-kernel.shmmax.conf]/ensure: removed (corrective)
Notice: /Stage[main]/Sysctl/File[/etc/sysctl.d/90-omnibus-gitlab-net.core.somaxconn.conf]/ensure: removed (corrective)

We should pupptise theses or change the purge properties on gitlab

jbond triaged this task as Medium priority.
jbond added a project: Puppet.
jbond added a subscriber: Sergey.Trofimovsky.SF.

this is what we get from those files

kernel.sem = 250 32000 32 262
kernel.shmall = 4194304
kernel.shmmax = 17179869184
net.core.somaxconn = 1024

Change 692609 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] gitlab: manage sysctl files

https://gerrit.wikimedia.org/r/692609

Change 692609 merged by Jbond:

[operations/puppet@production] gitlab: manage sysctl files

https://gerrit.wikimedia.org/r/692609

Change 692614 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] gitlab: disable grafana, node-exporter, promethous and alertmanager

https://gerrit.wikimedia.org/r/692614

Change 692614 merged by Jbond:

[operations/puppet@production] gitlab: disable grafana, node-exporter, promethous and alertmanager

https://gerrit.wikimedia.org/r/692614

Reedy renamed this task from Puppitise gitlab-ansible playbook to Puppetise gitlab-ansible playbook.May 19 2021, 3:30 AM

So I think we're at a point where S&F feels pretty confident with the state of the Ansible playbook, and we'd like to try running it against gitlab1001 early this week. That is, unless anything here should block that.

Proposal after discussing with @wkandek: We go ahead with playbook and have Jelto pick up puppetizing the remainder of it (with oversight) as an onboarding exercise after he starts.

@jbond - thoughts?

@brennen sounds fine to me, its possible when the play book is run that there may still be some puppetised bits that are getting in the way so it may need a bit of massaging.

As to continued puppetizing ill touch base with @wkandek regarding that however i will likley continue with some of it as i need to spin up my own instance in wmf-cloud so i can play with the API for adding/managing groups

@brennen sounds fine to me, its possible when the play book is run that there may still be some puppetised bits that are getting in the way so it may need a bit of massaging.

Cool, we'll give it a shot here shortly and see what we wind up with.

Change 712322 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab::backup move backup cronjobs to puppet

https://gerrit.wikimedia.org/r/712322

Change 719041 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/gitlab-ansible@master] remove backup crontab managed by Ansible

https://gerrit.wikimedia.org/r/719041

Change 712322 merged by Jelto:

[operations/puppet@production] gitlab::backup move backup cronjobs to puppet

https://gerrit.wikimedia.org/r/712322

Change 719041 merged by Jelto:

[operations/gitlab-ansible@master] remove backup crontab managed by Ansible

https://gerrit.wikimedia.org/r/719041

Change 722370 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] modules::gitlab add missing fields from ansible gitlab.rb template

https://gerrit.wikimedia.org/r/722370

Change 722370 merged by Jelto:

[operations/puppet@production] modules::gitlab add missing fields from ansible gitlab.rb template

https://gerrit.wikimedia.org/r/722370

Change 724430 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] profile::gitlab start using gitlab module

https://gerrit.wikimedia.org/r/724430

The preparation of GitLab puppet code is mostly done. I would like to deploy https://gerrit.wikimedia.org/r/724430 to gitlab2001 while puppet is disabled on gitlab1001. The switch from ansible to the GitLab puppet module involved quite a bit of refactoring and I would like to test the change without impacting production. A test installation in WMCS looks promising.

My goal is to disable puppet not more than 3 days on gitlab1001. One day to test the change on gitlab2001, one day to do a reimage and a fresh install on gitlab2001 and then another day to deploy on production Gitlab or rollback. I would like to do the reimage to make sure we don't miss anything from ansible playbook. Maybe @Arnoldokoth can support here when doing the reimage.

@brennen @thcipriani : tagging here for awareness and coordination. I would like to do this somewhere next week.

I'd like to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/724430 tomorrow on gitlab2001 (replica). For that I will disable puppet on gitlab1001 (production GitLab). Let me know if this interferes with some deployments planned for production GitLab.

Mentioned in SAL (#wikimedia-operations) [2021-10-06T10:50:05Z] <jelto> disable puppet on gitlab1001 to test puppetized code on GitLab replica - T283076

Change 724430 merged by Jelto:

[operations/puppet@production] profile::gitlab start using gitlab module

https://gerrit.wikimedia.org/r/724430

Change 726888 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] profiles/hiera::gitlab fix ssl configuration

https://gerrit.wikimedia.org/r/726888

Change 726888 merged by Jelto:

[operations/puppet@production] profiles/hiera::gitlab fix ssl configuration

https://gerrit.wikimedia.org/r/726888

I rolled out the puppetised changes to gitlab2001 (gitlab-replica). Apart from a minor ssl fix everything else seems to work. However I'm a bit limited in testing application specific features due to some 2FA issues (https://phabricator.wikimedia.org/T292431). I'm part of wmf-team-sre team which enforces 2FA but I can't setup 2FA because I have no GitLab password. However I'm quite confident that we can roll out the change on production GitLab soon (Thursday or Friday). Browsing repos, pulling and metrics are working fine.

@brennen if you have a user on gitlab-replica.wikimedia.org could you check some features which require login?

@Arnoldokoth I would like to test the change also on a fresh host which was not setup by ansible before. Could you do a reimage of gitlab2001 (the replica not production GitLab) by the end of you workday? Thanks!

For extra safety I've done some backups on the replica before applying the change:

cp /etc/gitlab/gitlab.rb /etc/gitlab/gitlab.rb.bak
cp -r /etc/ssh-gitlab /etc/ssh-gitlab-bak
/usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE=yes SKIP=builds,artifacts,registry GITLAB_BACKUP_MAX_CONCURRENCY=4 GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY=1
/usr/bin/gitlab-ctl backup-etc

Mentioned in SAL (#wikimedia-operations) [2021-10-07T16:56:02Z] <arnoldokoth> down timing gitlab2001 for re-imaging (T283076)

Mentioned in SAL (#wikimedia-operations) [2021-10-07T18:07:10Z] <arnoldokoth> gitlab2001 re-image complete (T283076)

@Jelto

As agreed we reimaged gitlab2001 together in a call. We confirmed a few things:

  • Arnold can run wmf-auto-reimage with sudo but
    • only when he uses full path to the command
    • it asks for mgmt password which he doesn't have yet and for mgmt hostname
    • separate from the permission issue this script is not usable for VMs, so that only applies to things like gerrit and phab but not gitlab

To reimage a VM we could have gone 2 routes, either decom the VM and makevm both with cookbooks or only set the VM to reboot from PXE once, reinstall Debian, revoke and sign new puppet certs and run puppet.

We did the latter, configured the VM to boot from networking, rebooted it, watched the Debian installer from console, revoked puppet cert on the master, signed new puppet signing request, ran puppt on gitlab2001.. waited a bit and ...

Everything worked, even after the first puppet run without errors or warnings :)

Icinga became all green again after refreshing things manually and https://gitlab-replica.wikimedia.org was up. though of course empty and needs a restore from backup next as expected.

So yea.. went fine, ansible->puppet conversion worked great.

Also we spoke about the !log command and Arnold used that though we still need to solve privileges so that he can schedule downtimes and run the gnt-instance commands himself. I ran those in a shared screen, so don't be confused by the logs.

Nice work, all!

if you have a user on gitlab-replica.wikimedia.org could you check some features which require login?

Apologies I missed this earlier; I'm on train this week and losing track of other stuff as a result.

I logged in with "Brennen Bearnes" user and found that while I was authed fine, I kept getting redirected to the root page. I suspect that's just a settings issue, but I'm forgetting the specifics.

@brennen The VM has been reinstalled and puppet reinstalled gitlab but the actual gitlab data needs to be manually imported from backup. And this still needs to happen now. Part of the reason to do this is to confirm also that the restore works.

Change 728380 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] modules::gitlab::ssh explicitly add git user and enable login

https://gerrit.wikimedia.org/r/728380

I imported the latest data to gitlab2001 and everything looks fine except pulling over ssh. I prepared a patch (728380) for the missing git user, which is needed for pulling over ssh.

This should not be a problem on gitlab1001 (the git user is already created by ansible and we don't do a reimage on gitlab1001). So I would like to re-enable puppet on gitlab1001 soon. I was thinking about doing it today at around 15:30 UTC. Any concerns from your side @brennen, @jbond or @Dzahn ?

Mentioned in SAL (#wikimedia-operations) [2021-10-08T15:29:27Z] <jelto> enable puppet on gitlab1001 again for T283076

Puppet on gitlab1001 is enabled again and the puppet run was successful. Web interface works, pulling over ssh works and metrics look fine.

No concerns and nice work, Jelto. 👍

Login to gitlab.wikimedia.org seems to be broken for 2fa users currently (recurring prompt for 2fa code after authentication with idp), and while I was able to login with a test user ("Rando McRandomface"), I can't go anywhere without getting redirected to /.

This seems a lot like this behavior on the replica:

I logged in with "Brennen Bearnes" user and found that while I was authed fine, I kept getting redirected to the root page. I suspect that's just a settings issue, but I'm forgetting the specifics.

Could be more than one thing going on, but that seems like the mostly likely culprit. Going to look at config & settings.

I think this is the culprit:

brennen@gitlab1001:~$ sudo grep session_duration /opt/gitlab/embedded/service/gitlab-rails/config/gitlab.yml                                                                                
    ## cas3-specific settings, specifically session_duration:
      session_duration: 1

Sure enough, manually setting this to 604800 and restarting fixes all the auth weirdness.

I'm pretty sure a value of 1 here was just mistakenly set in gitlab-ansible but never actually had any effect until my upstream changes for T288757 came out with 14.3.whatever. Puppet patch incoming.

Change 728618 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[operations/puppet@production] gitlab: set session duration to 604800 seconds

https://gerrit.wikimedia.org/r/728618

Change 728618 merged by Dzahn:

[operations/puppet@production] gitlab: set session duration to 604800 seconds

https://gerrit.wikimedia.org/r/728618

@brennen Given the configuration has been moved from Ansible to Puppet may we archive the Gerrit repo? ( https://gerrit.wikimedia.org/r/admin/repos/operations/gitlab-ansible , just prefix description with [ARCHIVED] and turn it read-only.

I'm going to disable puppet on production GitLab (gitlab1001) soon for around two hours to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/728380 on the GitLab replica. If everything looks fine I'm going to deploy on production GitLab too. I expect something around 5-10 minutes of downtime for production GitLab.

Mentioned in SAL (#wikimedia-operations) [2021-10-15T08:58:27Z] <jelto> jelto@gitlab1001:~$ sudo disable-puppet "disable puppet on gitlab1001 to test 728380 on GitLab replica - T283076"

Change 728380 merged by Jelto:

[operations/puppet@production] gitlab::ssh explicitly add git user with fixed id

https://gerrit.wikimedia.org/r/728380

GitLab on the replica looks fine and change of the uid/gid was successful. I used the following steps:

sudo /usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE=yes SKIP=builds,artifacts,registry GITLAB_BACKUP_MAX_CONCURRENCY=4 GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY=1
sudo /usr/bin/gitlab-ctl backup-etc

sudo gitlab-ctl stop

sudo run-puppet-agent
sudo gitlab-ctl stop

sudo find /proc -uid 498
sudo find /proc -gid 498
sudo ps -u git

sudo find / -uid 498 -exec chown 915:915 {} +
sudo find / -gid 498 -exec chgrp 915 {} +

sudo gitlab-ctl start

However my primarily issue (git user is unable to login and pull over ssh due to ! password instead of *) still exists.

Oct 15 09:29:36 gitlab2001 sshd[23093]: User git not allowed because account is locked

I'm trying to find a solution for that

Change 731017 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] isystemd::sysuser: create option to allow users to login

https://gerrit.wikimedia.org/r/731017

Change 731017 merged by Jbond:

[operations/puppet@production] isystemd::sysuser: create option to allow users to login

https://gerrit.wikimedia.org/r/731017

I re-enabled puppet on gitlab1001 and uid/gid change and git user configuration was successful.

jelto@gitlab1001:~$ id git
uid=915(git) gid=915(git) groups=915(git)
jelto@gitlab2001:~$ id git
uid=915(git) gid=915(git) groups=915(git)

From my side puppetisation is finished. If you are missing anything @brennen feel free to mention it here. Otherwise this task can be closed.

For documentation puprose, I used the same commands as on the replica:

sudo /usr/bin/gitlab-backup create CRON=1 STRATEGY=copy GZIP_RSYNCABLE=yes SKIP=builds,artifacts,registry GITLAB_BACKUP_MAX_CONCURRENCY=4 GITLAB_BACKUP_MAX_STORAGE_CONCURRENCY=1
sudo /usr/bin/gitlab-ctl backup-etc

sudo gitlab-ctl stop

# make sure no process is running anymore
sudo find /proc -uid 498
sudo find /proc -gid 498
sudo ps -u git

sudo puppet agent --enable

sudo run-puppet-agent
sudo gitlab-ctl stop # puppet starts gitlab again

# make sure no process is running anymore
sudo find /proc -uid 498
sudo find /proc -gid 498
sudo ps -u git

sudo find / -uid 498 -exec chown 915:915 {} +
sudo find / -gid 498 -exec chgrp 915 {} +

sudo gitlab-ctl start

Thanks @Dzahn, @jbond and @brennen for all the help :)

Mentioned in SAL (#wikimedia-operations) [2021-10-15T17:05:17Z] <mutante> gitlab2001 - temp stopped puppet - debugging gitlab restore script with Arnold - T283076

Looks good from my end - looks like there's some ongoing work with restore scripts, but feel free to resolve once that's handled. Thanks!

That's right. The restore script works when started manually but it does not work, and unfortunately breaks things, when systemd starts it.

We did a debugging session on this, added some code to write actions to a logfile, removed echo's that directly output to shell, watched it fail etc.

Something goes wrong with properly stopping services before it attempts to restart them. Then that fails because the port is already (still) in use.

Even if you use gitlab-ctl stop you can still see gitlab processes in output of ps.

Also if you use gitlab-ctl stop then something restarts things automatically (what exactly?) and then.. everything is fixed again. :p

Stopping everything is currently not part of the script though.

The current status is that the timer/job is deactivated and removed on both hosts, while the script is installed and can be started manually.

This should still be fixed.

I identified at least two issues which prevent us from having a successful restore:

One is puppet agent runs automatically enable stopped GitLab services

Notice: /Stage[main]/Gitlab/Service[gitlab-ce]/ensure: ensure changed 'stopped' to 'running' (corrective)

So if puppet agent is running while doing a restore GitLab services are started automatically.

Additionally we use the wrong GitLab configuration file (gitlab.rb) during and after the restore. The idea for the configuration file was that we do a temporary copy (gitlab.rb.bak) before restoring, so that instance-specific configuration is not lost. However on proudction GitLab gitlab.rb.bak also exsists. So when we restore the config backup (extract the config backup tar file), we overwrite this temporary copy. The replica then tries to start with the production configuration resulting in errors like using the wrong IP address:

listen tcp [...]:9229: bind: cannot assign requested address

So we have to make sure GitLab is not started by puppet agent runs during the restore.
And we have to make sure to use the correct gitlab.rb configuration file for the replica.

So we have to make sure GitLab is not started by puppet agent runs during the restore.

We can write our own /var/lib/puppet/state/agent_catalog_run.lock
but even better I think is to use disable-puppet.

It can be started as "disable-puppet anyothercommand ...".

Check out modules/base/files/puppet/disable-puppet

Change 734339 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] gitlab: disable puppet and rename files

https://gerrit.wikimedia.org/r/734339

Change 734339 abandoned by AOkoth:

[operations/puppet@production] gitlab: disable puppet and rename files

Reason:

create separate changes as requested by reviewer

https://gerrit.wikimedia.org/r/734339

Change 734664 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] gitlab: rename config & secrets backup file

https://gerrit.wikimedia.org/r/734664

Change 734664 merged by Dzahn:

[operations/puppet@production] gitlab: rename config & secrets backup file

https://gerrit.wikimedia.org/r/734664

I identified at least two issues which prevent us from having a successful restore:
So we have to make sure GitLab is not started by puppet agent runs during the restore.
And we have to make sure to use the correct gitlab.rb configuration file for the replica.

@Jelto Unfortunately we have spread the info a bit across 3 tickets I guess. So I wanted to point here: T285867#7466814 and here: T274463#7466622

I made a fix in the puppet code to ensure the timer only runs on the passive host https://gerrit.wikimedia.org/r/c/operations/puppet/+/735437 and then re-enabled the timer again after Arnold had requested it. He then tested the restore script started by systemd and it works :)

This was after the issues you mentioned were fixed by Arnold in https://gerrit.wikimedia.org/r/c/operations/puppet/+/734664 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/734741

Great work, thank you @Arnoldokoth and @Dzahn . I updated the documentation in GitLab/Backup_and_Restore and GitLab/Replica.

I also archived operations/gitlab-ansible and left a note to use puppet instead.

I'm going to close this this task. Feel free to open if anything puppet related comes up again.