Page MenuHomePhabricator

Migrate gitlab-test instance to bullseye
Closed, ResolvedPublic

Description

GitLab test instance gitlab-prod-1001.devtools.eqiad1.wikimedia.cloud is running buster. It should be upgraded to bullseye to make it similar with production.

Things to consider:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I wanted to click and create the instance gitlab-prod-1002 for this.

But we are out of quota in the cloud VPS project again. (due to puppetdb, deployment server, vrts etc).

We need to make a quota increase request -> https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_Instances#Increase_quotas_for_projects

(unless we want to delete the 1001 instance first but I would recommend against this. standard procedure is to setup new versions in parallel and it's often useful that way)

I noticed we were nowadays only at "9/10" instances quota-wise.. so I jumped on that and created gitlab-prod-1002 just now. ..which filled up the quota but we got away with it without another increase request.

instance created, puppetmaster changed via Hiera:

-server = puppetmaster.cloudinfra.wmflabs.org
-ca_server = puppetmaster.cloudinfra.wmflabs.org
+server = puppetmaster-1001.devtools.eqiad1.wikimedia.cloud
+ca_server = puppetmaster-1001.devtools.eqiad1.wikimedia.cloud

and FAIL..

Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]
Info: Retrieving pluginfacts
Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]
Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]
Info: Retrieving plugin
Error: /File[/var/lib/puppet/lib]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]
Error: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve file metadata for puppet:///plugins: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]
Info: Loading facts
Error: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Error: Could not send report: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs]

this is fixed by rm -rf /var/lib/puppet/ssl and running puppet again to create new requests.

keeps coming up every time we switch masters, more or less, since years: T187042#3963751 et al

next issue:

Setting up gitlab-ce (15.5.7-ce.0) ...
It looks like there was a problem with public attributes; run gitlab-ctl reconfigure manually to fix.
dpkg: error processing package gitlab-ce (--configure):
 installed gitlab-ce package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 gitlab-ce
E: Sub-process /usr/bin/dpkg returned an error code (1) (corrective)
Notice: /Stage[main]/Gitlab/Exec[Reconfigure GitLab]: Dependency Package[gitlab-ce] has failures: true
Warning: /Stage[main]/Gitlab/Exec[Reconfigure GitLab]: Skipping because of failed dependencies
Warning: /Stage[main]/Gitlab/Service[gitlab-ce]: Skipping because of failed dependencies

and after that:

Error: Systemd start for ssh-gitlab failed!
journalctl log for ssh-gitlab:
-- Journal begins at Wed 2022-05-18 13:34:37 UTC, ends at Wed 2023-01-11 23:50:20 UTC. --
Jan 11 23:45:54 gitlab-prod-1002 systemd[1]: Starting OpenBSD Secure Shell server (GitLab endpoint)...
Jan 11 23:45:54 gitlab-prod-1002 sshd[34988]: error: Bind to port 22 on 172.16.7.146 failed: Address already in use.
Jan 11 23:45:54 gitlab-prod-1002 sshd[34988]: fatal: Cannot bind any address.

running gitlab-ctl reconfigure manually results in:

Running handlers:
[2023-01-11T23:56:28+00:00] ERROR: Running exception handlers
There was an error running gitlab-ctl reconfigure:

directory[/etc/letsencrypt/live/gitlab.devtools.wmcloud.org] (letsencrypt::enable line 19) had an error: Chef::Exceptions::EnclosingDirectoryDoesNotExist: Parent directory /etc/letsencrypt/live does not exist, cannot create /etc/letsencrypt/live/gitlab.devtools.wmcloud.org

This does seem like a general issue with Letsencrypt / acme_chief in cloud as in the past.

Error: Cannot create /etc/gitlab/config_backup/latest; parent directory /etc/gitlab/config_backup does not exist
Error: /Stage[main]/Gitlab::Backup/File[/etc/gitlab/config_backup/latest]/ensure: change from 'absent' to 'directory' failed: Cannot create /etc/gitlab/config_backup/latest; parent directory /etc/gitlab/config_backup does not exist

Thanks for looking at the test instance!
We had the same issue with gitlab-prod-1001 afair.

There is a wiki page about the initial configuration for a new test instance.
Regarding letsencrypt it's mentioned:

"Run initial cerbot command (see todo below, will be added to puppet):"

certbot certonly --standalone --preferred-challenges http -d <instance-name>.devtools.wmcloud.org

Note there is a task to automate this more T302976.

Furthermore we had some help from @taavi regarding the public IP addresses assigned to GitLab for public http and SSH endpoints: T302803#7745265

We had the same issue with gitlab-prod-1001 afair.

There is a wiki page about the initial configuration for a new test instance.
Regarding letsencrypt it's mentioned:

"Run initial cerbot command (see todo below, will be added to puppet):"

certbot certonly --standalone --preferred-challenges http -d <instance-name>.devtools.wmcloud.org

Thank you for the docs!

I tried that. I created an account using our team email address.

Currently getting this though:

Account registered.
Requesting a certificate for gitlab-prod-1002.devtools.wmcloud.org
Performing the following challenges:
http-01 challenge for gitlab-prod-1002.devtools.wmcloud.org
Waiting for verification...
Challenge failed for domain gitlab-prod-1002.devtools.wmcloud.org
http-01 challenge for gitlab-prod-1002.devtools.wmcloud.org
Cleaning up challenges
Some challenges have failed.

IMPORTANT NOTES:
 - The following errors were reported by the server:

   Domain: gitlab-prod-1002.devtools.wmcloud.org
   Type:   dns
   Detail: DNS problem: NXDOMAIN looking up A for
   gitlab-prod-1002.devtools.wmcloud.org - check that a DNS record
   exists for this domain; DNS problem: NXDOMAIN looking up AAAA for
   gitlab-prod-1002.devtools.wmcloud.org - check that a DNS record
   exists for this domain

which at least kind of makes sense, since that wmcloud.org there is external name.

How do you expect Let's Encrypt to perform the http-01 validation since gitlab-prod-1002 is not currently publicly accessible? Please be extra careful here that you don't use all of the (per-IP so shared for all of WMCS) LE API request quotas.

I don't expect anything, I just blindly follow docs to test if they work and report what happens... and I specifically left a comment that I think the error makes sense. Maybe you missed that.

Yes, floating IP / DNS records have to be setup and that makes sense.

185.15.56.117 is now new floating IP associated with gitlab-prod-1002 (we got away with existing quota).

Existing floating IP is not mapped to instance "mapped fixed address" but wikitech says "link it with new vm" which was done and is the same for gerrit-prod

new name gitlab-bullseye.devtools.wmcloud.org. points to floating IP


added hiera keys in Horizon hiera

profile::gitlab::cert_path: /etc/letsencrypt/live/gitlab-prod-1002.devtools.wmcloud.org/fullchain.pem
profile::gitlab::key_path: /etc/letsencrypt/live/gitlab-prod-1002.devtools.wmcloud.org/privkey.pem
profile::gitlab::passive_host: gitlab-prod-1002.devtools.wmcloud.org
profile::gitlab::service_ip_v4: 185.15.56.117
profile::gitlab::service_ip_v6: '::'
profile::gitlab::service_name: gitlab-bullseye.devtools.wmcloud.org

but stopping here for right now

LSobanski triaged this task as High priority.

185.15.56.117 is now new floating IP associated with gitlab-prod-1002 (we got away with existing quota).

Existing floating IP is not mapped to instance "mapped fixed address" but wikitech says "link it with new vm" which was done and is the same for gerrit-prod

new name gitlab-bullseye.devtools.wmcloud.org. points to floating IP


added hiera keys in Horizon hiera

profile::gitlab::cert_path: /etc/letsencrypt/live/gitlab-prod-1002.devtools.wmcloud.org/fullchain.pem
profile::gitlab::key_path: /etc/letsencrypt/live/gitlab-prod-1002.devtools.wmcloud.org/privkey.pem
profile::gitlab::passive_host: gitlab-prod-1002.devtools.wmcloud.org
profile::gitlab::service_ip_v4: 185.15.56.117
profile::gitlab::service_ip_v6: '::'
profile::gitlab::service_name: gitlab-bullseye.devtools.wmcloud.org

but stopping here for right now

Thanks for creating the new IP!

I think we need some help from @taavi for the configuration of the floating IP. According to T302803#7745265 the floating IP is not mapped directly to the instance but to some secondary Neutron port? That way we are able to have two interfaces on the WMCS instance but use one single floating IP externally(?). @taavi are you able to re-create that setup for gitlab-prod-1002 in devtools project as well (similar to gitlab-prod-1001)

I think we need some help from @taavi

(or any other Cloud VPS admin :-)

for the configuration of the floating IP. According to T302803#7745265 the floating IP is not mapped directly to the instance but to some secondary Neutron port? That way we are able to have two interfaces on the WMCS instance but use one single floating IP externally(?).

It's a single interface with two IP addresses assigned to it. We can also do two interfaces, but I believe you're using the interface::alias define which works on a single interface.

@taavi are you able to re-create that setup for gitlab-prod-1002 in devtools project as well (similar to gitlab-prod-1001)

Sure! Is it fine to re-use the existing port/extra IP?

Puppet fails on the instance gitlab-prod-1002, from today email:

Failed resources if any
* Service[ssh-gitlab]
Last run log
NOTICE: ensure changed 'stopped' to 'running' (corrective)
NOTICE: executed successfully (corrective)
NOTICE: ensure changed 'stopped' to 'running' (corrective)
ERR: Systemd start for ssh-gitlab failed!
journalctl log for ssh-gitlab:
-- Journal begins at Wed 2022-05-18 13:34:37 UTC, ends at Tue 2023-01-24 07:45:49 UTC. --
Jan 24 07:45:49 gitlab-prod-1002 systemd[1]: Starting OpenBSD Secure Shell server (GitLab endpoint)...
Jan 24 07:45:49 gitlab-prod-1002 sshd[240041]: error: Bind to port 22 on 185.15.56.117 failed: Address already in use.
Jan 24 07:45:49 gitlab-prod-1002 sshd[240041]: fatal: Cannot bind any address.
Jan 24 07:45:49 gitlab-prod-1002 systemd[1]: ssh-gitlab.service: Main process exited, code=exited, status=255/EXCEPTION
Jan 24 07:45:49 gitlab-prod-1002 systemd[1]: ssh-gitlab.service: Failed with result 'exit-code'.
Jan 24 07:45:49 gitlab-prod-1002 systemd[1]: Failed to start OpenBSD Secure Shell server (GitLab endpoint).

ERR: change from 'stopped' to 'running' failed: Systemd start for ssh-gitlab failed!
NOTICE: Applied catalog in 12.81 seconds

Yea, that's known. Per comments above about setting up the second IP for ssh to listen on. This is WIP.

@taavi are you able to re-create that setup for gitlab-prod-1002 in devtools project as well (similar to gitlab-prod-1001)

Sure! Is it fine to re-use the existing port/extra IP?

We already created new IP

185.15.56.117

for the service IP on this one. So if we can use that (and same port, yea).. then we don't have to touch the existing setup before this is ready.

@taavi are you able to re-create that setup for gitlab-prod-1002 in devtools project as well (similar to gitlab-prod-1001)

Sure! Is it fine to re-use the existing port/extra IP?

We already created new IP

185.15.56.117

for the service IP on this one. So if we can use that (and same port, yea).. then we don't have to touch the existing setup before this is ready.

It's not possible to map a floating IP directly as an additional IP that the VM can bind to. Instead it'll involve an extra internal IP that the VM can bind to, and the floating IP will be mapped to that. See T302803 where the current setup was done.

Sure! Is it fine to re-use the existing port/extra IP?

Yes, it's fine.

It's not possible to map a floating IP directly as an additional IP that the VM can bind to.

ACK, thanks. I will remove that new floating IP again. It can be ignored for now and you can re-use the existing one. Thanks!

Mentioned in SAL (#wikimedia-cloud) [2023-01-28T16:26:23Z] <taavi> adjust gitlab-prod-1002 network port settings to allow adding the secondary IP, requested in T318521

Ok, gitlab-prod-1002 can now also assign the IP address 172.16.7.146 (NOT the public/floating ip!) to its primary interface. You will need to remove it from gitlab-prod-1001's interface beforehand for it to work properly. The public/floating IP 185.15.56.79 remains mapped to the additional internal IP address.

Thanks @taavi. for the record, I cleaned up this:

  • disassociated floating IP 185.15.56.117 from instance
  • released floating IP 185.15.56.117

Mentioned in SAL (#wikimedia-operations) [2023-01-31T18:26:37Z] <mutante> gitlab-prod-1001.devtools (cloud) - ip addr del 172.16.7.146/21 dev eth0 - T318521

Mentioned in SAL (#wikimedia-operations) [2023-01-31T18:44:36Z] <mutante> gitlab-prod-1001.devtools (cloud) - rebooted VM ; ip addr del 172.16.7.146/32 dev eth0 - T318521

Ok, gitlab-prod-1002 can now also assign the IP address 172.16.7.146 (NOT the public/floating ip!) to its primary interface. You will need to remove it from gitlab-prod-1001's interface beforehand for it to work properly.

There is both 172.167.7.146/21 and 172.167.7.146/32 on eth0. First I removed the former.. this kills the ssh connection, even though the IP associated with the hostname is the 172.16.2.73. Then rebooted the machine via Horizon, got back on and removed 172.16.7.146/32 from eth0.

Then ran puppet on the new machine 1002, and it added the IP there:

Notice: /Stage[main]/Profile::Gitlab/Interface::Alias[gitlab service IP]/Interface::Ip[gitlab service IP ipv6]/Exec[ip addr add ::/128 preferred_lft 0 dev ens3]/returns: executed successfully (corrective).

We now have

inet 172.16.7.146/32 scope global ens3

on the new VM.

Mentioned in SAL (#wikimedia-cloud) [2023-01-31T22:39:26Z] <mutante> remove role::gitlab from gitlab-prod-1001. to be replaced with gitlab-prod-1002. T318521

  • removed Hiera key/values from "1002" VM..

gitlab-ssh and gitlab-https-public are now using 172.16.7.146 to bind to:

-&D_SERVICE(tcp, 22, (185.15.56.117));
+&D_SERVICE(tcp, 22, (172.16.7.146));

These puppet errors are gone.. until it gets to "Exec[Reconfigure GitLab]/" which still fails for some other reason(s).

Change 885445 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] gitlab/cloud: set sshd listen address for gitlab-prod-1002

https://gerrit.wikimedia.org/r/885445

Change 885445 merged by Dzahn:

[operations/puppet@production] gitlab/cloud: set sshd listen address for gitlab-prod-1002

https://gerrit.wikimedia.org/r/885445

Puppet fails on the instance gitlab-prod-1002, from today email:

This is fixed now. Puppet run does not fail any longer:

Notice: Applied catalog in 25.11 seconds

I removed the gitlab-bulleye.devtools.wmcloud.org DNS entry which was pointing to 185.15.56.117 (and confusing me quite a bit).

I re-run the cerbot command using the old/common gitlab.devtools.wmcloud.org address (which points to 185.15.56.79).

sudo certbot certonly --standalone --preferred-challenges http -d gitlab.devtools.wmcloud.org

Certbot is happy now and nginx also.

I removed the hiera configuration in the new gitlab-test instance to use the gitlab-prod.wmflabs.org webproxy address. Instead the instance is reachable at the old/common address gitlab.devtools.wmcloud.org without a webproxy.

The new instance is working now under gitlab.devtools.wmcloud.org. Tomorrow I'll transfer a backup of the old test instance to the new one.

GitLab test instance under gitlab.devtools.wmcloud.org works again and data from the buster instance were transferred to the new bullseye instance.

I used roughly the following steps:

  • I created a data and config backup on the old instance
  • I moved the config backup also to /mnt/gitlab-backup (which is a separate volume mounted)
  • I shut down gitlab-prod1001
  • I removed the volume gitlab-prod-1001-gitlab-backup form gitlab-prod1001 and added it to gitlab-prod-1002
  • I created a fstab entry for the volume and fixed the mount point to /srv/gitlab-backup (to align it withproduction)
  • I run the gitlab-restore.sh script on gitlab-prod-1002
  • I restarted gitlab-runner.service on gitlab-runner-1002 and gitlab-runner-1003

I'm cleaning up some hiera keys, docs and buster dependencies before closing this task. I'll try to update the configuration steps mentioned in https://wikitech.wikimedia.org/wiki/GitLab/Test_Instance. But the networking setup in WMCS was quite confusing again so I guess some steps will be missing.

Change 888193 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab: use /srv/gitlab-backup in WMCS

https://gerrit.wikimedia.org/r/888193

Change 888194 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] aptrepo: remove gitlab package for buster

https://gerrit.wikimedia.org/r/888194

Change 888194 merged by Jelto:

[operations/puppet@production] aptrepo: remove gitlab package for buster

https://gerrit.wikimedia.org/r/888194

Change 888193 merged by Jelto:

[operations/puppet@production] gitlab: use /srv/gitlab-backup in WMCS

https://gerrit.wikimedia.org/r/888193

I've done some cleanup in the puppet code (both hiera and removing buster dependencies) and I updated https://wikitech.wikimedia.org/wiki/GitLab/Test_Instance as much as possible. I also shut down gitlab-prod-1001.devtools.eqiad1.wikimedia.cloud.

Next week I'll delete the old buster instance gitlab-prod-1001.devtools.eqiad1.wikimedia.cloud and close this task.

I deleted gitlab-prod-1001.devtools.eqiad1.wikimedia.cloud.

I'm closing this task, migration to bullseye complete.