Page MenuHomePhabricator

Evaluate a high available GitLab architecture
Open, MediumPublic

Description

In T307142 GitLab was migrated to new physical machines. Currently we have a total of four machines. One is used for production GitLab and two for GitLab replicas in both DCs. We don't need two replicas for the current setup, this is more due to historic reasons (T296713). So we have two spare machines which could be used to improve the availability and reliability of GitLab. This task should define the future architecture and define usage for the two GitLab machines gitlab1003 (old replica) and gitlab2003 (insetup).

GitLab Omnibus for up to 1000 users

GitLab has multiple reference architectures for different use cases. Currently we use the reference architecture for up to 1000 users. This is is a single host called omnibus setup. All needed services are running on a single machine and are managed by GitLab Debian package. So maintenance and backup is quite easy. The downside is that for every update GitLab needs to be restarted and is not available during that time. Furthermore switching between instances in case of an incident (failover) takes roughly 1 to 2 hours and is a manual process.

Side note: we are using much bigger machines than the suggested 8 vCPU and 7.2 GB memory. We are running nodes with 20 physical cores and 128GB of RAM.
See also resource utilization of GitLab nodes: https://grafana.wikimedia.org/d/R_1IvBZnz/gitlab-omnibus-overview?orgId=1&refresh=1m&var-node=gitlab1004
The only resource we are utilizing a lot is disk space. Sometimes memory is saturated while doing backups too.

GitLab HA for 1000+ users

The next bigger architecture is designed for 2000 users and consists of around 7 different nodes and doesn't offer HA. The 3000 user reference architecture offers HA but also needs dozens of nodes/machines. For smaller installations some node counts can be scaled down, which is still more than 10 nodes.

Using remaining hosts for different purpose

At the moment maintenance downtimes are quite short and happen roughly once a moth for the current setup. Furthermore we scaled the existing omnibus setup vertically quite a lot and I'd estimate it can serve more than 1000 users. It also seems that GitLab HA requires a lot of different nodes which also need maintenance and configuration. So it should we discussed if there is need for high availability for now and the future.

If we come to the conclusion that a single host setup (omnibus) and a replica is enough for now, we can also think about using the remaining two hosts for different purposes than HA in the GitLab realm. For example:

  • as dedicated backup nodes as discussed in T274463#8118962
  • as a GitLab mirror T291322
  • as additional Trusted Runners

Event Timeline

LSobanski triaged this task as Medium priority.Nov 28 2022, 4:20 PM
LSobanski moved this task from Incoming to Backlog on the collaboration-services board.

Gitlab needs regular maintenance windows, at least once a month, if not more often and they usually last around 15 minutes.

This is problematic for the migration of mission critical gitops workflows like operations/puppet and operations/deployment-charts, where we're used to about 5-10 minutes of planned downtime every quarter.

I would probably consider this a blocker for the migration of said repositories.

Gitlab needs regular maintenance windows, at least once a month, if not more often and they usually last around 15 minutes.

This is problematic for the migration of mission critical gitops workflows like operations/puppet and operations/deployment-charts, where we're used to about 5-10 minutes of planned downtime every quarter.

I would probably consider this a blocker for the migration of said repositories.

I agree, this also makes scheduling of maintenance on the GitLab side quite complicated.

As pointed out in the initial task description the smallest reference architecture which offers true HA is the 3000 user architecture. However this architecture consists of 25 machines (only for a production environment without staging). I'm quite sure this is outside of what our current team can support and what's reasonable for such a service.

GitLab was quite stable in the past and we have tested plans to failover GitLab to another DC within a few hours. So I'd like to shift the discussion more towards zero downtime upgrades. The goal here is to have enough nodes to be able to perform upgrades without disrupting users. Having a high available postgres for example would be out of scope for this.

What I find in the docs is that we need at least a load balancer and two puma nodes for the http frontend. For git repository storage we would also need a Gitaly cluster. Unfortunately the docs state a minimum of six nodes for a gitaly cluster. All of the backend services could be placed on a single non-HA backend node. I'm still not 100% what happens if one of the backend services like postgres or redis get updated. But that might just happen for major releases. And zero downtime upgrades are only possible for minor releases (see requirements).

Distributing GitLab components across multiple nodes would increase the complexity of the GitLab infrastructure significantly compared to single-node (omnibus) setup. Also the automation, cookbooks, backups would need major refactoring. I have concerns increasing the complexity of the GitLab infrastructure. So an alternative could be to have fixed weekly deployments windows where updates can could be deployed. This would mean we might not need the deployment window every week and security updates might be delayed to the next window. I'm thinking of maybe 15 minutes late Monday and 15 minutes early Friday for example (as there are not a lot deployments on Friday).

I'm happy to discuss this more next week.

We discussed this topic last week. A short summary:

We agreed that a full HA setup is not within what out team can build and support currently. However we want to explore the smallest possible setup that enables zero downtime deployments. The documentation is not very clear about that and research and tests are needed here. This knowledge will also help us to scale GitLab if the single host omnibus setup is not enough. So I'll use this task to further explore this topic.

To unblocker critical gitops workflows we agreed that a fixed maintenance window in the deployment calendar makes sense now. So I'll re-open T336470. We aim for a GitLab maintenance window outside of typical train windows, most probably on Fridays. Also we have to define which kind of security updates we install immediately and which can wait until the next maintenance window. But that's to be discussed in T336470.

One important note (thanks @eoghan for pointing this out): GitLab HA is marked as a premium feature here. The 2000 users reference architecture and zero downtime upgrades are marked as "free". So we have to double check which features are premium and which are free.