I run the script in repos/releng/gitlab-trusted-runner/ manually:
gitlab1001 and gitlab2001 will be decommissioned soon in T307142. So regarding GitLab this should be resolved soon.
Thu, Jun 23
Wed, Jun 22
Docker cache is cleaned every 24h on GitLab Runner nodes now. So failing jobs due to full docker volume should happen less frequent.
I can confirm, pipeline works again for me. Re-configuring the Cloud Runners was block because of a failing pipeline. But this works again and Cloud Runner are using the new double star config now.
Mon, Jun 20
Fri, Jun 10
Thu, Jun 9
Wed, Jun 8
We gathered some experience regarding failover when migrating GitLab to the new physical hosts in T307142.
puppet runs on the test instance gitlab-prod-1001 fail with
Jun 3 2022
After migrating to new hosts (T307142) we got a bacula alert about backups on gitlab1001 (the old production machine):
Jun 2 2022
Migration of production GitLab from gitlab1001 to gitlab1004 was successful. Downtime was around 65 minutes.
Jun 1 2022
May 31 2022
Checklist for gitlab migation from gitlab1001 to gitlab1004:
Checklist for todays gitlab-replica migation from gitlab2001 to gitlab1003:
May 30 2022
Backup size decreased after cleanup of big projects. Thanks again to @brennen and @Dzahn for finding and coordinating this!
We are down from 50GB to 10GB for one backup. That also means disk pressure on the backup volume decreased a lot (see disk usage over time dashboard).
May 27 2022
May 23 2022
So it did succeed, good! Not sure about that internal API error though.
@Dzahn thanks for testing the partman config! I'm happy it worked first time!
May 20 2022
May 19 2022
I solved the installation/puppet issues with gitlab1003. The gitlab-ce package was installed and login using CAS/IDP worked. Synced backups for the backup-restore cycle were also present already.
May 17 2022
- Trusted Runner automation and access request
May 12 2022
Reopening, puma still fails to stop:
May 11 2022
Thanks for opening the task!
May 10 2022
@thcipriani I added some more open topics to the description. Can you take a look? I would like to know what is needed from your perspective until Cloud Runners can be available instance wide.
After yesterdays incident mw2412 got depooled again to restore the state before the incident (see SAL). I'm going to adjust this and pool mw2412 again. This host is ready for production similar to the other hosts of mw241[2-9].
May 9 2022
mw241[2-9] where pooled in an incident this morning (accidentally depool and pool of codfw datacenter) . I run a scap pull on all machines to make sure they are up to date.
May 6 2022
I added more restrictive CPU and memory limits to the Cloud Runner configuration (0.1 CPU and 200Mi Memory). I also set the timeout for jobs to 300s which is the minimum.
May 4 2022
That's related to T295481.
https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner has CI for provisioning the managed Kubernetes cluster and setup of Kubernetes Runner now. Thats mostly done using Terraform and Helm. So we have working Cloud Runners with autoscaling (min 1 and max 2 nodes).
May 3 2022
I would suggest to treat gitlab-runner hosts a little different. For the Runner hosts we can basically can apply the puppet role, add data to Hiera and remove the old ganeti VMs.
We have a k8s cluster on Digital Ocean that we're using to prove the viability of ^ model. We talked it over with ServiceOps and WMCS and that's a good path for the time-being if everything seems to work correctly. In future, we'll continually evaluate whether a third party cloud is the right place to run this.
May 2 2022
Apr 28 2022
Apr 25 2022
Apr 19 2022
This has been implemented in https://gerrit.wikimedia.org/r/732093, I'm closing this task.
We discussed in last ITC meeting that a dedicated GitLab update and maintenance window is not needed now. The last downtimes for updates and maintenance lasted between 2 to 5 minutes and were announced some hours ahead. With current usage of GitLab we agreed that this is not an issue. Also a fixed window would slow down progress on infrastructure tasks around GitLab because we have to wait for the next window.
Apr 8 2022
I have access now, thanks a lot!
I like the idea of putting the bullseye runner runner-1020 into the gitlab-runners project. That reduces overhead around the puppet and hiera configuration.
Apr 7 2022
Thanks for the quick help! However I still have problems accessing some task. For example I can not access T304938, which is marked as security.
Apr 6 2022
Backup is present in bacula for the new folder structure:
Apr 5 2022
Apr 4 2022
Mar 30 2022
@Arnoldokoth and I updated production instance gitlab1001 and gitlab-runners successfully.
Mar 29 2022
@Arnoldokoth and I updated the test instance gitlab-prod-1001.devtools.eqiad1.wikimedia.cloud and the replica gitlab2001.wikimedia.org to gitlab-ce 14.9.1-ce.0.
Mar 28 2022
This will happen tomorrow/Tuesday due to scheduling conflicts.
Mar 25 2022
@Arnoldokoth and I will do the upgrade of GitLab + Runners on Monday after 4pm UTC.
Mar 24 2022
Mar 23 2022
I mirrored wmf-sre-laptop to GitLab and created a very basic proof-of-concept CI to build the Debian package on Trusted Runners. The current implementation has limitations and is not complete. I created T304491 to further discuss the whole topic of Debian package builds on GitLab CI, as this is a bit out of scope for this task.
Mar 15 2022
Mar 9 2022
Mar 3 2022
I created a dedicated task to automate the test instance creation: T302976
Mar 2 2022
With the help of @Majavah the correct configuration of private and public/floating IP was found. https and cloning over SSH works now. Thanks again!
The keepalived VIP configuration was not clear for me from looking at the horizon interface.
Thanks a lot for setting up the additional port! I can confirm that the port is present in Horizon Interface.
Mar 1 2022
SSH access to the test instance is not working because of different networking behavior on WMCS/VPS. The public floating IP ("service ip") is NATed to the VM. So we can not bind on this address directly.
I requested a second networking port in T302803 and hope we can map/NAT the floating IP to this second port to replicate the production configuration (with NGINX and git SSH daemon listening on a different address).