Page MenuHomePhabricator

Migrate gitlab-test instance to bullseye
Open, HighPublic

Description

GitLab test instance gitlab-prod-1001.devtools.eqiad1.wikimedia.cloud is running buster. It should be upgraded to bullseye to make it similar with production.

Things to consider:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I wanted to click and create the instance gitlab-prod-1002 for this.

But we are out of quota in the cloud VPS project again. (due to puppetdb, deployment server, vrts etc).

We need to make a quota increase request -> https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_Instances#Increase_quotas_for_projects

(unless we want to delete the 1001 instance first but I would recommend against this. standard procedure is to setup new versions in parallel and it's often useful that way)

I noticed we were nowadays only at "9/10" instances quota-wise.. so I jumped on that and created gitlab-prod-1002 just now. ..which filled up the quota but we got away with it without another increase request.

instance created, puppetmaster changed via Hiera:

-server = puppetmaster.cloudinfra.wmflabs.org
-ca_server = puppetmaster.cloudinfra.wmflabs.org
+server = puppetmaster-1001.devtools.eqiad1.wikimedia.cloud
+ca_server = puppetmaster-1001.devtools.eqiad1.wikimedia.cloud

and FAIL..

Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]
Info: Retrieving pluginfacts
Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]
Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]
Info: Retrieving plugin
Error: /File[/var/lib/puppet/lib]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]
Error: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve file metadata for puppet:///plugins: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]
Info: Loading facts
Error: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Error: Could not send report: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]

this is fixed by rm -rf /var/lib/puppet/ssl and running puppet again to create new requests.

keeps coming up every time we switch masters, more or less, since years: T187042#3963751 et al

next issue:

Setting up gitlab-ce (15.5.7-ce.0) ...
It looks like there was a problem with public attributes; run gitlab-ctl reconfigure manually to fix.
dpkg: error processing package gitlab-ce (--configure):
 installed gitlab-ce package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 gitlab-ce
E: Sub-process /usr/bin/dpkg returned an error code (1) (corrective)
Notice: /Stage[main]/Gitlab/Exec[Reconfigure GitLab]: Dependency Package[gitlab-ce] has failures: true
Warning: /Stage[main]/Gitlab/Exec[Reconfigure GitLab]: Skipping because of failed dependencies
Warning: /Stage[main]/Gitlab/Service[gitlab-ce]: Skipping because of failed dependencies

and after that:

Error: Systemd start for ssh-gitlab failed!
journalctl log for ssh-gitlab:
-- Journal begins at Wed 2022-05-18 13:34:37 UTC, ends at Wed 2023-01-11 23:50:20 UTC. --
Jan 11 23:45:54 gitlab-prod-1002 systemd[1]: Starting OpenBSD Secure Shell server (GitLab endpoint)...
Jan 11 23:45:54 gitlab-prod-1002 sshd[34988]: error: Bind to port 22 on 172.16.7.146 failed: Address already in use.
Jan 11 23:45:54 gitlab-prod-1002 sshd[34988]: fatal: Cannot bind any address.

running gitlab-ctl reconfigure manually results in:

Running handlers:
[2023-01-11T23:56:28+00:00] ERROR: Running exception handlers
There was an error running gitlab-ctl reconfigure:

directory[/etc/letsencrypt/live/gitlab.devtools.wmcloud.org] (letsencrypt::enable line 19) had an error: Chef::Exceptions::EnclosingDirectoryDoesNotExist: Parent directory /etc/letsencrypt/live does not exist, cannot create /etc/letsencrypt/live/gitlab.devtools.wmcloud.org

This does seem like a general issue with Letsencrypt / acme_chief in cloud as in the past.

Error: Cannot create /etc/gitlab/config_backup/latest; parent directory /etc/gitlab/config_backup does not exist
Error: /Stage[main]/Gitlab::Backup/File[/etc/gitlab/config_backup/latest]/ensure: change from 'absent' to 'directory' failed: Cannot create /etc/gitlab/config_backup/latest; parent directory /etc/gitlab/config_backup does not exist

Thanks for looking at the test instance!
We had the same issue with gitlab-prod-1001 afair.

There is a wiki page about the initial configuration for a new test instance.
Regarding letsencrypt it's mentioned:

"Run initial cerbot command (see todo below, will be added to puppet):"

certbot certonly --standalone --preferred-challenges http -d <instance-name>.devtools.wmcloud.org

Note there is a task to automate this more T302976.

Furthermore we had some help from @taavi regarding the public IP addresses assigned to GitLab for public http and SSH endpoints: T302803#7745265

We had the same issue with gitlab-prod-1001 afair.

There is a wiki page about the initial configuration for a new test instance.
Regarding letsencrypt it's mentioned:

"Run initial cerbot command (see todo below, will be added to puppet):"

certbot certonly --standalone --preferred-challenges http -d <instance-name>.devtools.wmcloud.org

Thank you for the docs!

I tried that. I created an account using our team email address.

Currently getting this though:

Account registered.
Requesting a certificate for gitlab-prod-1002.devtools.wmcloud.org
Performing the following challenges:
http-01 challenge for gitlab-prod-1002.devtools.wmcloud.org
Waiting for verification...
Challenge failed for domain gitlab-prod-1002.devtools.wmcloud.org
http-01 challenge for gitlab-prod-1002.devtools.wmcloud.org
Cleaning up challenges
Some challenges have failed.

IMPORTANT NOTES:
 - The following errors were reported by the server:

   Domain: gitlab-prod-1002.devtools.wmcloud.org
   Type:   dns
   Detail: DNS problem: NXDOMAIN looking up A for
   gitlab-prod-1002.devtools.wmcloud.org - check that a DNS record
   exists for this domain; DNS problem: NXDOMAIN looking up AAAA for
   gitlab-prod-1002.devtools.wmcloud.org - check that a DNS record
   exists for this domain

which at least kind of makes sense, since that wmcloud.org there is external name.

How do you expect Let's Encrypt to perform the http-01 validation since gitlab-prod-1002 is not currently publicly accessible? Please be extra careful here that you don't use all of the (per-IP so shared for all of WMCS) LE API request quotas.

I don't expect anything, I just blindly follow docs to test if they work and report what happens... and I specifically left a comment that I think the error makes sense. Maybe you missed that.

Yes, floating IP / DNS records have to be setup and that makes sense.

185.15.56.117 is now new floating IP associated with gitlab-prod-1002 (we got away with existing quota).

Existing floating IP is not mapped to instance "mapped fixed address" but wikitech says "link it with new vm" which was done and is the same for gerrit-prod

new name gitlab-bullseye.devtools.wmcloud.org. points to floating IP


added hiera keys in Horizon hiera

profile::gitlab::cert_path: /etc/letsencrypt/live/gitlab-prod-1002.devtools.wmcloud.org/fullchain.pem
profile::gitlab::key_path: /etc/letsencrypt/live/gitlab-prod-1002.devtools.wmcloud.org/privkey.pem
profile::gitlab::passive_host: gitlab-prod-1002.devtools.wmcloud.org
profile::gitlab::service_ip_v4: 185.15.56.117
profile::gitlab::service_ip_v6: '::'
profile::gitlab::service_name: gitlab-bullseye.devtools.wmcloud.org

but stopping here for right now

LSobanski triaged this task as High priority.

185.15.56.117 is now new floating IP associated with gitlab-prod-1002 (we got away with existing quota).

Existing floating IP is not mapped to instance "mapped fixed address" but wikitech says "link it with new vm" which was done and is the same for gerrit-prod

new name gitlab-bullseye.devtools.wmcloud.org. points to floating IP


added hiera keys in Horizon hiera

profile::gitlab::cert_path: /etc/letsencrypt/live/gitlab-prod-1002.devtools.wmcloud.org/fullchain.pem
profile::gitlab::key_path: /etc/letsencrypt/live/gitlab-prod-1002.devtools.wmcloud.org/privkey.pem
profile::gitlab::passive_host: gitlab-prod-1002.devtools.wmcloud.org
profile::gitlab::service_ip_v4: 185.15.56.117
profile::gitlab::service_ip_v6: '::'
profile::gitlab::service_name: gitlab-bullseye.devtools.wmcloud.org

but stopping here for right now

Thanks for creating the new IP!

I think we need some help from @taavi for the configuration of the floating IP. According to T302803#7745265 the floating IP is not mapped directly to the instance but to some secondary Neutron port? That way we are able to have two interfaces on the WMCS instance but use one single floating IP externally(?). @taavi are you able to re-create that setup for gitlab-prod-1002 in devtools project as well (similar to gitlab-prod-1001)

I think we need some help from @taavi

(or any other Cloud VPS admin :-)

for the configuration of the floating IP. According to T302803#7745265 the floating IP is not mapped directly to the instance but to some secondary Neutron port? That way we are able to have two interfaces on the WMCS instance but use one single floating IP externally(?).

It's a single interface with two IP addresses assigned to it. We can also do two interfaces, but I believe you're using the interface::alias define which works on a single interface.

@taavi are you able to re-create that setup for gitlab-prod-1002 in devtools project as well (similar to gitlab-prod-1001)

Sure! Is it fine to re-use the existing port/extra IP?

Puppet fails on the instance gitlab-prod-1002, from today email:

Failed resources if any
* Service[ssh-gitlab]
Last run log
NOTICE: ensure changed 'stopped' to 'running' (corrective)
NOTICE: executed successfully (corrective)
NOTICE: ensure changed 'stopped' to 'running' (corrective)
ERR: Systemd start for ssh-gitlab failed!
journalctl log for ssh-gitlab:
-- Journal begins at Wed 2022-05-18 13:34:37 UTC, ends at Tue 2023-01-24 07:45:49 UTC. --
Jan 24 07:45:49 gitlab-prod-1002 systemd[1]: Starting OpenBSD Secure Shell server (GitLab endpoint)...
Jan 24 07:45:49 gitlab-prod-1002 sshd[240041]: error: Bind to port 22 on 185.15.56.117 failed: Address already in use.
Jan 24 07:45:49 gitlab-prod-1002 sshd[240041]: fatal: Cannot bind any address.
Jan 24 07:45:49 gitlab-prod-1002 systemd[1]: ssh-gitlab.service: Main process exited, code=exited, status=255/EXCEPTION
Jan 24 07:45:49 gitlab-prod-1002 systemd[1]: ssh-gitlab.service: Failed with result 'exit-code'.
Jan 24 07:45:49 gitlab-prod-1002 systemd[1]: Failed to start OpenBSD Secure Shell server (GitLab endpoint).

ERR: change from 'stopped' to 'running' failed: Systemd start for ssh-gitlab failed!
NOTICE: Applied catalog in 12.81 seconds

Yea, that's known. Per comments above about setting up the second IP for ssh to listen on. This is WIP.

@taavi are you able to re-create that setup for gitlab-prod-1002 in devtools project as well (similar to gitlab-prod-1001)

Sure! Is it fine to re-use the existing port/extra IP?

We already created new IP

185.15.56.117

for the service IP on this one. So if we can use that (and same port, yea).. then we don't have to touch the existing setup before this is ready.

@taavi are you able to re-create that setup for gitlab-prod-1002 in devtools project as well (similar to gitlab-prod-1001)

Sure! Is it fine to re-use the existing port/extra IP?

We already created new IP

185.15.56.117

for the service IP on this one. So if we can use that (and same port, yea).. then we don't have to touch the existing setup before this is ready.

It's not possible to map a floating IP directly as an additional IP that the VM can bind to. Instead it'll involve an extra internal IP that the VM can bind to, and the floating IP will be mapped to that. See T302803 where the current setup was done.

Sure! Is it fine to re-use the existing port/extra IP?

Yes, it's fine.

It's not possible to map a floating IP directly as an additional IP that the VM can bind to.

ACK, thanks. I will remove that new floating IP again. It can be ignored for now and you can re-use the existing one. Thanks!

Mentioned in SAL (#wikimedia-cloud) [2023-01-28T16:26:23Z] <taavi> adjust gitlab-prod-1002 network port settings to allow adding the secondary IP, requested in T318521

Ok, gitlab-prod-1002 can now also assign the IP address 172.16.7.146 (NOT the public/floating ip!) to its primary interface. You will need to remove it from gitlab-prod-1001's interface beforehand for it to work properly. The public/floating IP 185.15.56.79 remains mapped to the additional internal IP address.