Fri, Sep 17
My current plan for cluster-wise migration and deploying services with helm3 is:
Wed, Sep 15
The warning was removed in the GitLab upgrade to 14.x (https://phabricator.wikimedia.org/T289802.) The grep was removed in https://gerrit.wikimedia.org/r/719930. So I'm closing this issue.
Feel free to open again.
Tue, Sep 14
Thanks for looking this up!
@Arnoldokoth some thoughts on the wrong settings on the replica after restore:
Wed, Sep 8
Some additional RBAC requirements:
@dancy Thanks for finding this issue!
I updated gitlab-ce to 14.2.3-ce.0 on apt1001
I updated gitlab-runner to 14.2.0 and gitlab-ce to 14.1.5-ce.0 on apt1001.
@Jelto I'll coordinate with you tomorrow, but if you want to go ahead with the upgrade on gitlab2001 before I'm online, feel free.
Mon, Sep 6
I started to collect some thoughts for the long-term GitLab Runner setups here: https://wikitech.wikimedia.org/wiki/GitLab/Gitlab_Runner#Future_Gitlab_Runner_setup_(T286958)
Fri, Sep 3
Tue, Aug 31
Mon, Aug 30
Sat, Aug 28
I tested the behavior of gitlab2001 during a rolling restart of puma workers:
Thu, Aug 26
There is a ClusterRole named deploy already for the aggregation of view and pods/portForward permissions. So I would prefer using the names <service-name> and <service-name>-rw for the two users. If we use the deploy in the name of the new user this may be confused with the read-only user and the deploy ClusterRole.
And some more input regarding RBAC and the replacement of Tiller service account:
Wed, Aug 25
Tue, Aug 24
The helm binary in helmfile can be set using the --helm-binary option or by setting helmBinary in the helmfile.yaml.
It can be set globally (like in admin-ng). But it is also possible to to change the helm binary depending on the environment.
Mon, Aug 23
gitlab1001 just had another rolling restart of puma workers:
I created a dedicated task for the reduced availability for the puma worker/exporter: https://phabricator.wikimedia.org/T289454
Sun, Aug 22
I like the idea of having a semi-formal update window for GitLab as well.
Aug 19 2021
I would like to either finish this task or add additional requirements. Currently we are collecting metrics of all GitLab components on gitlab1001 and gitlab2001. We have Grafana dashboards and basic alerts in icinga.
Aug 18 2021
From my side everything is done. Thanks everyone. I'm going to close this ticket. Feel free to open it in case I missed anything.
Aug 17 2021
Aug 13 2021
Aug 12 2021
The puma workers get killed either if they hit memory limit (which is puma['per_worker_max_memory_mb'] = 1024 ~1gb by default) or automatically after 12 hours. I can't see one of the workers hitting the limit in the past. They are getting killed because the uptime is above 12 hours and during the restart the service availability is reduced. My assumption is that some parts of the puma webserver/GitLab have memory leaks. This is the reason for the filling memory and why GitLab uses tools like puma_worker_killer. I try to debug and troubleshoot this problem on gitlab2001 a little bit further. I would like to have a smooth restart of the puma workers without reduced availability.
Aug 11 2021
Should be fixed with https://gerrit.wikimedia.org/r/710676. No new error message from root@gitlab1001
Aug 6 2021
Aug 5 2021
I restored the backup of gitlab1001 to gitlab2001 using the restore instructions of S&F.
I enhanced the guide and moved it to wikitech: https://wikitech.wikimedia.org/wiki/GitLab/Backup_and_Restore#Restore
Aug 4 2021
Aug 2 2021
GitLab rails service has reduced availability roughly every ~24 hours (plus some offsets). See this Grafana dashboard.
Jul 30 2021
Basic Icinga alerts for the public https and SSH endpoints of GitLab are in place now: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=gitlab.wikimedia.org
Jul 27 2021
The scrape configuration for GitLab is in place and Prometheus collects metrics.
Jul 23 2021
I modified and ran the install script using ansibles --check (dry-run) flag against gitlab2001. Looks good, there are two errors due to check mode usage.
Jul 21 2021
During the refresh of old mw app servers in eqiad we noticed that thumbor machines thumbor1001 and thumbor1002 are renamed/reimaged mw hosts. As mentioned in T280203 and T233196 these machines are end of life and have to be refreshed.
Jul 15 2021
Jul 14 2021
Jul 13 2021
Jul 2 2021
Jul 1 2021
Jun 30 2021
Jun 28 2021
I merged and rolled out the additional parameter for intra-service dependencies in systemd::timer::job and the change for a weekly rebuild of production-images.
Jun 25 2021
Jun 22 2021
Jun 16 2021
@Dzahn I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/697850 is not merged and deployed, so the fileset for GitLab doesn't exist.
Jun 15 2021
So we have two options for a rebuild:
Jun 14 2021
Jun 11 2021
I checked the current exporter configuration.
Host/node metrics are available by node_exporter. The node_exporter in GitLab is disabled because we are using the stand-alone node_exporter. The host metrics are available in Grafana already. So host/node metrics should be fine.
Jun 10 2021
Jun 9 2021
@Dzahn and I checked the backups on gitlab1001.