Page MenuHomePhabricator

Migrate gitlab-test instance to puppet
Closed, ResolvedPublic

Description

The WMCS project gitlab-test hosts a instance called gitlab-ansible-test, which was used for pre-production testing of GitLab configuration changes. The instance was setup before the migration from Ansible to Puppet (see T283076). The project gitlab-test also contains a dedicated puppet host and a gitlab-puppet-test machine which was mostly used to test changes for the pupptisation of Ansible code (see T283076). This instance is out of date and has no public ip (because of quota limits).

We have to decide if a dedicated puppet host is needed and if needed, how we want to manage this.
The old instances and the puppet host should be replace by a new instance, which uses the same code as production GitLab (and if needed, a fresh puppet host)

So I see the following steps here:

  • cleanup old instances
    • gitlab-ansible-test
    • gitlab-puppet-test-7
    • puppet-jelto-6
  • decide if a dedicated puppet host is needed not needed
  • create new gitlab-test instance (gitlab-prod-1001 in devtools project)
  • review and adapt hiera data for gitlab role for WMCS (754063) (this one is done but some follow-ups, in progress)
  • setup gitlab-test (now as gitlab-prod-1001.devtools) using puppet
  • solve issue with installation of gitlab role
    • gitlab-ce package installation fails because of postgres version (?)
    • additional firewall rules for floating IP
    • fix listing addresses for floating IP
    • fix failing certbot runs
  • migrate to wmcloud.org DNS zone
  • add gitlab.devtools.wmflabs.org gitlab.devtools.wmcloud.org to CAS-SSO
  • apply gitlab-setings
  • document everything

[] automate creation of new ephemeral test instances see T302976

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

per the gitlab IC meeting we had today, the plan is:

  • give Jelto / Arnold / Brennen access to existing project "devtools" that has Gerrit and Phabricator instances it (todo for Daniel)
  • ensure we don't run quota issues in that project to create new gitlab instances (todo for Daniel)
  • create one new instance we always keep "like production" and is intended to permanently exist (todo for Daniel)
  • additional instances for other types of testing are expected to exist temporarily. individuals create them, use them and then remove them again (so quota should allow for maybe 3 of those at a time?)
  • we can either use our existing local puppetmaster we already have in this project without additional work or the production puppetmaster. Using the local puppetmaster has the advantage that we can test puppet changes fully before merging them in Gerrit but has the disadvantage that there is an extra sync step between "upstream" puppetmaster and project puppetmaster. puppet agents can be configured to use either of them
  • look at the existing floating IP and DNS setup in gitlab-test and copy it over to devtools (todo for Daniel)
  • entirely delete the gitlab-test project with the instance inside it. we don't need it, nobody seems to use it (maybe verify nobody logged in recently one last time)
  • apply prod puppet role on new gitlab instance in devtools, check for errors, fix them, until it works (todo for 'to be determined')
  • use this for testing and call it done
  • bonus: Other people like for example Mukunda who wants to integrate Phabricator with gitlab have a single project where both instances can talk to each other, doesn't need security / firewall rules etc
Dzahn changed the task status from Open to In Progress.Jan 19 2022, 5:22 PM

After logging out and back in on Horizon you should see the new project in the drop-down list.

fyi, currently we have this situation:

Screenshot from 2022-01-19 09-26-49.png (365×505 px, 39 KB)

Let me go through the list to explain:

doc1002 - temp instance that was created to see if the role(doc) (doc.wikimedia.org) works on bullseye, but the "move doc to bullseye" project is stalled on the releng side. I have already created the new ganeti instances for this quite some time ago. should be deleted? but we need to check with hashar, maybe

doc - as above but on stretch, should be deleted? but we need to check with hashar, maybe

deploy-1002 - This is permanent. It is our local deployment server. This is so that gerrit/phabricator changes can be deployed like they are deployed in production, from a deployment server. Btw the "-" between name and number is supposed to be the correct naming scheme in this project.

phabricator-stage-1002 - This is a phabricator instance that is meant to be used for testing "new stuff". The type that does not have to be "like production". Hence it's called "stage". Anyone can use it. Would point Mukunda to this one first.

puppetmaster-1001 - This is our local puppetmaster. it stays around permanently. we can use it for gitlab instances too.

phabricator-prod-1001 - This is a phabricator instance that is meant to always be "like production", so unlike "stage" this should not be changed manually so that others can trust it is like prod.

gerrit-prod-1001 - As above but for Gerrit. The "gerrit-stage" should also exist in theory but does not right now.

So following this pattern I would make a new gitlab-prod-1001.

Mentioned in SAL (#wikimedia-cloud) [2022-01-19T17:36:47Z] <mutante> - added brennen, aokoth and jelto as users and projectadmins (T297411)

Dzahn changed the task status from In Progress to Stalled.Jan 19 2022, 5:58 PM

And yes, we are at quota limit. I think it's just "instance count" and not more specific about CPU/disk etc.

Part of the reason is that we also have the 2 doc instances in there that are just "kind of" related. (releng yes, but not really devtools).

I made T299561 to explain all that and ask for quota increase to unblock us here.

Mentioned in SAL (#wikimedia-cloud) [2022-01-21T21:57:23Z] <mutante> - deleted instances "doc" and "doc1002" to make room for gitlab instance T299561 - T297411

Mentioned in SAL (#wikimedia-cloud) [2022-01-21T22:11:59Z] <mutante> - created new instance gitlab-prod-1001 T297411

Dzahn changed the task status from Stalled to In Progress.Jan 21 2022, 10:18 PM
Dzahn removed Dzahn as the assignee of this task.EditedJan 21 2022, 10:21 PM

current status here:

We now have the new instance gitlab-prod-1001. It has 2 CPUs and 4GB RAM.

When applying the prod puppet role (role::gitlab) via Horizon and running puppet the status is:

Function lookup() did not find a value for the name 'profile::gitlab::active_host'

So next thing needed here is to add values in Hiera for cloud.

This could be done via Horizon or the repo and I would greatly prefer if we use the repo.

We should add a section for gitlab to puppet/hieradata/cloud/eqiad1/devtools/common.yaml in the puppet repo, merge continue here.

For now I removed the role again so that we don't get pinged for failed puppet on the instance. Will follow-up next Gitlab IC meeting.

btw, you can see here in Phabricator when someone changes roles on cloud VPS instances. example:

https://phabricator.wikimedia.org/rCLIP864817a030ae844bcb37b34f2a5b7669e603b610

@Dzahn thanks a lot for setting up gitlab-prod-1001!

So next thing needed here is to add values in Hiera for cloud.

This could be done via Horizon or the repo and I would greatly prefer if we use the repo.

We should add a section for gitlab to puppet/hieradata/cloud/eqiad1/devtools/common.yaml in the puppet repo, merge continue here.

I prepared https://gerrit.wikimedia.org/r/c/operations/puppet/+/754063 for that. I amended a patch to use devtools instead of gitlab-test. This change adds the missing key profile::gitlab::active_host and various others.

We still need one additional floating IP in devtools (see TODO in change above). So I added this to the request in T299561.

Change 754063 merged by Dzahn:

[operations/puppet@production] gitlab: update cloud hiera, refactor naming

https://gerrit.wikimedia.org/r/754063

@Jelto Thank you!

I deployed your amended change. looked good now. compiled and deployed in production, confirmed noop.

Then re-applied the gitlab role on our cloud instance.

We will need next: "did not find a value for the name 'profile::gitlab::service_ip_v4'" (edit: so, yea, that is exactly expected, we wait for the floating IP). ACK

Also thanks for catching the floating IP part of our quota request!

Thanks to @aborrero our quota increase request is resolved.

We should now be able to have 2 additional temp. VMs at a time. So 2 of us could do a temp test at a time without touching the "prod" instance.

And we can now add the new floating IP to Hiera.

I clicked "allocate IP to project" and added description "gitlab-ssh" and we got assigned

185.15.56.79

Then I associated that IP with gitlab-prod-1001 172.16.2.73.

Change 758793 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] hiera/cloud/gitlab-test: add floating ip

https://gerrit.wikimedia.org/r/758793

Change 758793 merged by Dzahn:

[operations/puppet@production] hiera/cloud/gitlab-test: add floating ip

https://gerrit.wikimedia.org/r/758793

tested applying the puppet role after floating IP was added.

we will need "profile::gitlab::monitoring_whitelist" in Hiera next

Change 758889 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab: add profile::gitlab::monitoring_whitelist in cloud Hiera

https://gerrit.wikimedia.org/r/758889

Change 758889 merged by Dzahn:

[operations/puppet@production] gitlab: add profile::gitlab::monitoring_whitelist in cloud Hiera

https://gerrit.wikimedia.org/r/758889

The merge above fixed: did not find a value for the name 'profile::gitlab::monitoring_whitelist'

next issue is: parameter 'exporters' expects a Hash value, got Array

Change 758894 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab: parameter for exporters expects Hash but is array by default

https://gerrit.wikimedia.org/r/758894

Change 758894 merged by Dzahn:

[operations/puppet@production] gitlab: parameter for exporters expects Hash but is array by default

https://gerrit.wikimedia.org/r/758894

@LSobanski

The "role::gitlab" (except same thing as prod) class is now applied on gitlab-prod-1001.devtools in cloud VPS andDOES NOT FAIL anymore. as in "puppet agent finishes". This is a nice step to achieve for now.

Notice: Applied catalog in 475.70 seconds

That being said we are aware there are some more follow-ups we'll have to fix, among them:

"(Network is unreachable - connect(2) for "acmechief1001.eqiad.wmnet" port 8140)"

can't get certs from acmechief in prod

This is not the first time cloud VPS projects run into this though, so there should be ways around it hopefully.

  • installed gitlab-ce package post-installation script subprocess returned error exit status 1
  • nginx initial setup race condition?
  • Checking if a newer PostgreSQL version is available and attempting automatic upgrade to it: NOT OK

Change 759299 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab: add parameter to allow usign either acmechief or certbot

https://gerrit.wikimedia.org/r/759299

Change 759299 merged by Dzahn:

[operations/puppet@production] gitlab: parameter to allow using either acmechief or certbot for certs

https://gerrit.wikimedia.org/r/759299

Mentioned in SAL (#wikimedia-operations) [2022-02-02T22:26:02Z] <mutante> gitlab - introducing parameter to fetch TLS certs either with acmechief or certbot (if in cloud). Boolean $use_acmechief = lookup('profile::gitlab::use_acmechief'), confirmed noop in prod on gitlab1001.wikimedia.org ( T297411)

Alright! the TLS cert issue should be fixed with the above ^

at least in this sense:

Feb 02 22:36:55 gitlab-prod-1001 systemd[1]: certbot.service: Succeeded.

Also the nginx issues disappeared from the puppet run!

We are getting there, step by step.

But we still have an issue with the installation of the gitlab-ce package.

The following NEW packages will be installed:
  gitlab-ce
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 0 B/964 MB of archives.
After this operation, 2638 MB of additional disk space will be used.
Selecting previously unselected package gitlab-ce.
(Reading database ... 58564 files and directories currently installed.)
Preparing to unpack .../gitlab-ce_14.4.5-ce.0_amd64.deb ...
Unpacking gitlab-ce (14.4.5-ce.0) ...
Setting up gitlab-ce (14.4.5-ce.0) ...
Checking PostgreSQL executables:Starting Chef Infra Client, version 15.17.4
resolving cookbooks for run list: ["gitlab::config", "postgresql::bin"]
Synchronizing Cookbooks:
  - gitlab (0.0.1)
  - postgresql (0.1.0)
  - package (0.1.0)
  - logrotate (0.1.0)
  - redis (0.1.0)
  - registry (0.1.0)
  - monitoring (0.1.0)
  - mattermost (0.1.0)
  - consul (0.1.0)
  - gitaly (0.1.0)
  - praefect (0.1.0)
  - gitlab-kas (0.1.0)
  - gitlab-pages (0.1.0)
  - letsencrypt (0.1.0)
  - nginx (0.1.0)
  - runit (5.1.3)
  - acme (4.1.3)
  - crond (0.1.0)
Installing Cookbook Gems:
Compiling Cookbooks...
Converging 4 resources
Recipe: postgresql::bin
  * ruby_block[check_postgresql_version] action run (skipped due to not_if)
  * ruby_block[check_postgresql_version_is_deprecated] action run (skipped due to not_if)
  * ruby_block[Link postgresql bin files to the correct version] action run (skipped due to only_if)
  * template[/opt/gitlab/etc/gitlab-psql-rc] action create (up to date)

Running handlers:
Running handlers complete
Chef Infra Client finished, 0/4 resources updated in 08 seconds
Checking PostgreSQL executables: OK
Checking if a newer PostgreSQL version is available and attempting automatic upgrade to it:Traceback (most recent call last):
	7: from /opt/gitlab/embedded/bin/omnibus-ctl:23:in `<main>'
	6: from /opt/gitlab/embedded/bin/omnibus-ctl:23:in `load'
	5: from /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/omnibus-ctl-0.6.0/bin/omnibus-ctl:31:in `<top (required)>'
	4: from /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/omnibus-ctl-0.6.0/lib/omnibus-ctl.rb:746:in `run'
	3: from /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/omnibus-ctl-0.6.0/lib/omnibus-ctl.rb:204:in `block in add_command_under_category'
	2: from /opt/gitlab/embedded/service/omnibus-ctl/pg-upgrade.rb:134:in `block in load_file'
	1: from /opt/gitlab/embedded/service/omnibus-ctl/pg-upgrade.rb:134:in `new'
/opt/gitlab/embedded/service/omnibus-ctl/lib/gitlab_ctl/pg_upgrade.rb:28:in `initialize': undefined method `[]' for nil:NilClass (NoMethodError)
Checking if a newer PostgreSQL version is available and attempting automatic upgrade to it: NOT OK
Error ensuring PostgreSQL is updated. Please check the logs
dpkg: error processing package gitlab-ce (--configure):
 installed gitlab-ce package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 gitlab-ce
E: Sub-process /usr/bin/dpkg returned an error code (1) (corrective)
Notice: /Stage[main]/Gitlab/Exec[Reconfigure GitLab]: Dependency Package[gitlab-ce] has failures: true
Warning: /Stage[main]/Gitlab/Exec[Reconfigure GitLab]: Skipping because of failed dependencies
Warning: /Stage[main]/Gitlab/Service[gitlab-ce]: Skipping because of failed dependencies

I am trying to manually remove and install the package next.

Dzahn updated the task description. (Show Details)

this is not puppet-related anymore. Manually installing gitlab-ce also does this:

Checking PostgreSQL executables: OK
Checking if a newer PostgreSQL version is available and attempting automatic upgrade to it:Traceback (most recent call last):
	7: from /opt/gitlab/embedded/bin/omnibus-ctl:23:in `<main>'
	6: from /opt/gitlab/embedded/bin/omnibus-ctl:23:in `load'
	5: from /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/omnibus-ctl-0.6.0/bin/omnibus-ctl:31:in `<top (required)>'
	4: from /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/omnibus-ctl-0.6.0/lib/omnibus-ctl.rb:746:in `run'
	3: from /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/omnibus-ctl-0.6.0/lib/omnibus-ctl.rb:204:in `block in add_command_under_category'
	2: from /opt/gitlab/embedded/service/omnibus-ctl/pg-upgrade.rb:134:in `block in load_file'
	1: from /opt/gitlab/embedded/service/omnibus-ctl/pg-upgrade.rb:134:in `new'
/opt/gitlab/embedded/service/omnibus-ctl/lib/gitlab_ctl/pg_upgrade.rb:28:in `initialize': undefined method `[]' for nil:NilClass (NoMethodError)
Checking if a newer PostgreSQL version is available and attempting automatic upgrade to it: NOT OK
Error ensuring PostgreSQL is updated. Please check the logs
dpkg: error processing package gitlab-ce (--configure):
 installed gitlab-ce package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 gitlab-ce
E: Sub-process /usr/bin/dpkg returned an error code (1)

Change 762495 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: rename test instance, use letsencrypt certs

https://gerrit.wikimedia.org/r/762495

Change 762495 merged by Jelto:

[operations/puppet@production] gitlab: rename test instance, use letsencrypt certs

https://gerrit.wikimedia.org/r/762495

Change 762803 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: add ferm rules and fix listen_addresses for test instance

https://gerrit.wikimedia.org/r/762803

Change 762803 merged by Jelto:

[operations/puppet@production] gitlab: add ferm rules and fix listen_addresses for test instance

https://gerrit.wikimedia.org/r/762803

Change 762823 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] hiera::role::common::idp add gitlab-replica to CAS-SSO

https://gerrit.wikimedia.org/r/762823

  • installed gitlab-ce package post-installation script subprocess returned error exit status 1
  • nginx initial setup race condition?
  • Checking if a newer PostgreSQL version is available and attempting automatic upgrade to it: NOT OK

Puppet runs on gitlab-prod-1001.devtools.eqiad1.wikimedia.cloud now look good and succeed. I think it was mostly due to firewall/ferm issues and a bad nginx configuration (wrong cert path and listening addresses). The issues should be fixed with the changes above. I had to implement some WMCS/labs specific code because the floating IP is behaving a little bit different from production instances.

The GitLab test instance looks healthy now. The last missing parts are login using SSO (https://gerrit.wikimedia.org/r/762823) and adjust the settings slightly (gitlab-settings). Then we can cleanup the old ansible-managed test instance and also update docs.

Change 762823 merged by Jelto:

[operations/puppet@production] gitlab: move gitlab test instance to wmcloud.org

https://gerrit.wikimedia.org/r/762823

Migration of new test instance to wmcloud.org zone was successful. SSO login using wmcloud idp also works. I would consider the test instance under https://gitlab.devtools.wmcloud.org/explore usable now.

I'll try to update the docs accordingly and remove the old ansible-managed test instance.

I've noticed that after some arbitrary time networking to the new instance stops working (no http/https/ssh possible). I try to find out what's going on there. A soft reboot from horizon webinterfaces fixes the issue.

Jelto lowered the priority of this task from Medium to Low.Feb 17 2022, 12:53 PM

I cleaned up the old gitlab-ansible-test instance together with floating IP and disk. I also added some docs in https://wikitech.wikimedia.org/wiki/GitLab/Test_Instance.

In yesterdays meeting we discussed that we may need some more automation for ephemeral/temporary GitLab test instances. So that bigger changes can be tested without altering the production-like test instance. I started to document the needed steps for a new test instance here: https://wikitech.wikimedia.org/wiki/GitLab/Test_Instance#Setup_new_test_instances I see quite some room for more automation and optimization.

I'll change the priority to low. The new test instance was most urgent. Having temporary test instances is not needed currently but will be relevant in the near future.

SSH access to the test instance is not working because of different networking behavior on WMCS/VPS. The public floating IP ("service ip") is NATed to the VM. So we can not bind on this address directly.
I requested a second networking port in T302803 and hope we can map/NAT the floating IP to this second port to replicate the production configuration (with NGINX and git SSH daemon listening on a different address).

I also created a new network security group for GitLab, which allows SSH access from everywhere.

Change 767473 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: update sevice_ip and ferm_drange for wmcs

https://gerrit.wikimedia.org/r/767473

Change 767473 merged by Jelto:

[operations/puppet@production] gitlab: update sevice_ip and ferm_drange for wmcs

https://gerrit.wikimedia.org/r/767473

With the help of @Majavah the correct configuration of private and public/floating IP was found. https and cloning over SSH works now. Thanks again!

The test instance https://gitlab.devtools.wmcloud.org/ should be usable for testing now.

I will create a dedicated, low priority task for the automation of new ephemeral test instances, so we can close this task soon.

Change 767484 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: remove realm check, move listen_addresses to hiera

https://gerrit.wikimedia.org/r/767484

Change 767484 merged by Jelto:

[operations/puppet@production] gitlab: remove realm check, move listen_addresses to hiera

https://gerrit.wikimedia.org/r/767484

Jelto claimed this task.

I created a dedicated task to automate the test instance creation: T302976

I also changed the flavor of gitlab-prod-1001 to g3.cores4.ram8.disk20 because of some hard to reproduce networking issues (no http/ssh connection possible, reboot needed). I assume 4GB memory was not enough and critical processes got killed. This is a bit hard to troubleshoot without metrics and monitoring in WMCS. I'll keep an eye on the test instance to check if it happens again.

I'll close this task. The test instance should be fully functional now: https://gitlab.devtools.wmcloud.org/

If you need anything more or find any bugs, feel free to re-open this task.

Change 777345 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: add backup and restore intervals to cloud hiera

https://gerrit.wikimedia.org/r/777345

Change 777345 merged by Jelto:

[operations/puppet@production] gitlab: add backup and restore intervals to cloud hiera

https://gerrit.wikimedia.org/r/777345

Mentioned in SAL (#wikimedia-cloud) [2022-04-18T19:07:16Z] <mutante> - gitlab-prod-1001 randomly stopped working. we got the "puppet failed" mails without having made changes and can't ssh to the instance anymore when trying to check out why. trying soft reboot via Horizon T297411

Mentioned in SAL (#wikimedia-cloud) [2022-04-18T19:08:19Z] <mutante> - gitlab-prod-1001 is indeed back after soft rebooting the instance. uptime 1 min T297411

running puppet after it came back showed:

Notice: /Stage[main]/Profile::Gitlab/Interface::Alias[gitlab service IP]/Interface::Ip[gitlab service IP ipv6]/Exec[ip addr add ::/128 preferred_lft 0 dev eth0]/returns: executed successfully (corrective)

puppet runs on the test instance gitlab-prod-1001 fail with

Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppet]
Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppet]

I'm reopening this task until we found a fix for that.

Does this only affect this instance or maybe all users who have a local puppetmaster in their VPS project? It seems like we haven't touched anything and it was working before and the error makes me think something changed somewhere upstream or alternatively..someone tried to switch between local project puppetmaster and regular global puppet master.

Change 835082 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: fix ssh listen address for gitlab test instance

https://gerrit.wikimedia.org/r/835082

Change 835082 merged by Jelto:

[operations/puppet@production] gitlab: fix ssh listen address for gitlab test instance

https://gerrit.wikimedia.org/r/835082

Change 835089 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: set ssh listen address for gitlab test instance

https://gerrit.wikimedia.org/r/835089

Change 835089 merged by Jelto:

[operations/puppet@production] gitlab: set ssh listen address for gitlab test instance

https://gerrit.wikimedia.org/r/835089

gitlab-prod-1001 had failing puppet runs (root@ mail):

Sep 24 08:12:47 gitlab-prod-1001 systemd[1]: Starting OpenBSD Secure Shell server (GitLab endpoint)...
Sep 24 08:12:47 gitlab-prod-1001 sshd[16673]: error: Bind to port 22 on 172.16.7.146 failed: Address already in use.
Sep 24 08:12:47 gitlab-prod-1001 sshd[16673]: fatal: Cannot bind any address.

The admin sshd was listening on 0.0.0.0, which interfered with GitLabs ssh daemon.

I fixed it by setting profile::ssh::server::listen_addresses explicitly for the admin sshd. See changes above.

@Dzahn @brennen I'm closing this task. From my understanding the test instance is functional and the puppet error should be fixed. Feel free to re-open in case anything is missing.