Page MenuHomePhabricator

Eqiad: 2 VM request for GitLab
Closed, ResolvedPublic

Description

Cloud VPS Project Tested: gitlab-test
Site/Location:EQIAD
Number of systems: 2
Service: GitLab
Networking Requirements: accessible over http
Processor Requirements: 8
Memory: 12GB
Disks: 100GB (SSD preferred)
Other Requirements:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Machines that are directly exposed to the Internet and are managed manually are more of a challenge to security practices than internal machines I would think.

regarding the request for 24GB of RAM:

This would make these the VMs with the most memory globally... more than _anything_ else. To give you an idea .. all existing ganeti VMs are between 1 and 8 GB RAM, with the only exceptions of a puppetdb and a webperf machine that have especially high needs that have 16GB. Nothing has more than this.

Is it really needed? Do we know where all this RAM would be used?

Why does a testing service need to be in production? Stuff in production realm should have production-level stability, and not be used for testing. Can you use a cloud-provided VM instead?

Why does a testing service need to be in production? Stuff in production realm should have production-level stability, and not be used for testing. Can you use a cloud-provided VM instead?

I second this. Is this going to be a testing service? If so why is the request not for cloud VPS space?

It would make more sense and has much more relaxed requirements for stuff running, for instance we won't need to have debian packages and/or proper puppetization to run services in cloud VPS, while it's a requirement in production.

This is not a testing service. We have the gitlab-test project in labs. This is our initial small production GitLab that folks can use.

regarding the request for 24GB of RAM:

This would make these the VMs with the most memory globally... more than _anything_ else. To give you an idea .. all existing ganeti VMs are between 1 and 8 GB RAM, with the only exceptions of a puppetdb and a webperf machine that have especially high needs that have 16GB. Nothing has more than this.

Is it really needed? Do we know where all this RAM would be used?

Initially the contractors have spec'd for 12GB RAM per instance. Which is a bump up from the reference architecture for 8GB. I bumped up the specs to match the typical CI runners (albeit with larger disks). I'll update the request to 12GB if that's a more reasonable size, it's definitely enough to start.

Do these machines just have to talk to each other (on what port/protocol btw?) or does it _really_ require that they are in wikimedia.org directly exposed to the Internet and without any caching layer in front of them? The latter is really only done for special cases anymore like monitoring services and Gerrit because we don't want them to be dependent on a working caching layer.

I'll get specific ports for these.

In terms of caching I think we want a similar setup to gerrit as this is eventually meant to replace gerrit.

This is not a testing service. We have the gitlab-test project in labs. This is our initial small production GitLab that folks can use.
[...]

I assumed it's sort-of testing service, considering T274953 implies it will be manually maintained by people with full root rather than provisioned by puppet. As @Joe says, if it's going to be in production realm, it needs to be fully puppetized.

This is not a testing service. We have the gitlab-test project in labs. This is our initial small production GitLab that folks can use.
[...]

I assumed it's sort-of testing service, considering T274953 implies it will be manually maintained by people with full root rather than provisioned by puppet. As @Joe says, if it's going to be in production realm, it needs to be fully puppetized.

This will not be manually maintained.

This is not a testing service. We have the gitlab-test project in labs. This is our initial small production GitLab that folks can use.
[...]

I assumed it's sort-of testing service, considering T274953 implies it will be manually maintained by people with full root rather than provisioned by puppet. As @Joe says, if it's going to be in production realm, it needs to be fully puppetized.

This will not be manually maintained.

Might want to update T274953 then...

The contractors will work outside of our puppet based setup standards for the next 6 months, while we are hiring new SREs to take over and rework the implementation as an immediate follow-up step.

Risks: non WMF employees with access to the production network, non reproducible setup

Both of those sound very much "manually maintained"

This will not be manually maintained.

Might want to update T274953 then...

The contractors will work outside of our puppet based setup standards for the next 6 months, while we are hiring new SREs to take over and rework the implementation as an immediate follow-up step.

Risks: non WMF employees with access to the production network, non reproducible setup

Both of those sound very much "manually maintained"

Made a bit of an update there, but the text there did/does match my understanding. We're working with contractors on the initial setup and using puppet to setup the boxen is the plan; however, our operations/puppet repository has a non-insignificant learning curve, so the plan is to puppetize the setup outside of our operations/puppet repository.

This will not be manually maintained.

Might want to update T274953 then...

The contractors will work outside of our puppet based setup standards for the next 6 months, while we are hiring new SREs to take over and rework the implementation as an immediate follow-up step.

Risks: non WMF employees with access to the production network, non reproducible setup

Both of those sound very much "manually maintained"

Made a bit of an update there, but the text there did/does match my understanding. We're working with contractors on the initial setup and using puppet to setup the boxen is the plan; however, our operations/puppet repository has a non-insignificant learning curve, so the plan is to puppetize the setup outside of our operations/puppet repository.

Hi! I'm happy that a bit more clarity has been reached in the meanwhile - the way things will be done here is quite outside of our normal operating standards, so at the very least we will need to discuss this request at the SRE meeting on monday.

Given I foresee the questions that will be asked, it would be useful to get an answer to the following questions in advance:

  • Is there a specific reason why this setup can't be done in Cloud VPS, where rules are more relaxed? We've got quite a few projects that ran that way until they were ready to productionize, including real usage by real users, think of the "spaces" project for instance
  • What are the request flows in/out of gitlab we expect? An architectural diagram of how this has been designed might help a lot our understanding here

Made a bit of an update there, but the text there did/does match my understanding. We're working with contractors on the initial setup and using puppet to setup the boxen is the plan; however, our operations/puppet repository has a non-insignificant learning curve, so the plan is to puppetize the setup outside of our operations/puppet repository.

I have serious reservations about this plan. Let me explain them.

Those VMs will still be running a puppet agent that will enforce changes that go to the base system via operations/puppet, and that's really non-negotiable for security and automation reasons.

So if you want to run a separate puppet codebase, you will need to do so using "puppet apply", with all the potential consequences there - two puppet runs from different codebases might very well get to conflict with each other.

Our puppet codebase has a steep learning curve, but it is not different than any other large puppet installation. Moreover, writing new modules from scratch does not require significant prior knowledge of our code. Depending on our +2s may slow things down, but this is the price of security, consistency, and code tidiness. If that's your concern, let’s work together towards finding the best solution for it. I'm uneasy with the idea that a completely unreviewed piece of infrastructure that exposes a public service would reside in our production network.

I have to admit I'm a bit perplexed - puppetizing a gitlab installation is painful , given the standard installation uses chef (a puppet competitor!) for setup/updates, so I'm not really sure what is the best way to do it. I'm just feeling that if we go down the described path, when these VMs will be handed over to SRE for "proper" puppetization of an already-running setup, that will require multiple times the effort than it would require working together from the get go. At the very least, having a channel of communications between the engineers working on the project and us would be beneficial.

Please note that if this is intended to be a pilot installation, whose data will be ephemeral and eventually dismissed, my recommendation would still be to use Cloud VPS for it. That would nullify most of the concerns expressed above.

will need to access to one another

That's default for all our infra.

Networking Requirements: external IP

For that traffic flow and overall network diagram would be useful.
Some questions:
Will both servers have the same role? If so how will the users be balanced?
Will you need a Virtual IP shared between the two?
Unless technical limitations the preferred way to expose services to the public is to have them on private IPs and front them with the LVS.
This to protect the services and infra by leveraging our standard tools and workflows.

This will not be manually maintained.

Might want to update T274953 then...

The contractors will work outside of our puppet based setup standards for the next 6 months, while we are hiring new SREs to take over and rework the implementation as an immediate follow-up step.

Risks: non WMF employees with access to the production network, non reproducible setup

Both of those sound very much "manually maintained"

Made a bit of an update there, but the text there did/does match my understanding. We're working with contractors on the initial setup and using puppet to setup the boxen is the plan; however, our operations/puppet repository has a non-insignificant learning curve, so the plan is to puppetize the setup outside of our operations/puppet repository.

In addition to considerations that have been raised already, I would like to add some comments about security. Our puppet repository, and services that are setup and reviewed, follow our internal security standards. This can start from permissions, to networking access, and software versions. Having said that, we have to consider, if we were to go down this road, where will WMF's responsibility and accountability stand, in case of a security breach originating from those machines? Given that privacy and security is something we take seriously, I think we should sit down, discuss, and address *every* concern.

Hi, thanks for this request.

I see a lot of people have commented already, but I have a number of questions as well. Some technical ones are inline but I got more general comment as well:

I may be misunderstanding this completely, but the way it is phrased makes me think that despite this not being of a testing nature, Cloud VPS is still a better place for this request. There is precedent anyway, see discourse, that is spaces.wmflabs.org. It will allowto avoid the learning curve issues you refer to regarding operations/puppet, and create something that will be more easy to productionize later on. It will also allow the contractors to be more flexible with regard to resource allocation, (re)-creation/teardown of VMs, testing etc.

This is not a testing service. We have the gitlab-test project in labs. This is our initial small production GitLab that folks can use.

If this is a production setup, I am assuming it will need a disaster recovery setup as well. I think it should happen from day #1. Experience has shown that adding disaster recovery after the fact (see gerrit/phabricator) is costing eventually way more in time and resources then if done with that in mind since the start of the project.

Initially the contractors have spec'd for 12GB RAM per instance. Which is a bump up from the reference architecture for 8GB. I bumped up the specs to match the typical CI runners (albeit with larger disks). I'll update the request to 12GB if that's a more reasonable size, it's definitely enough to start.

Any idea why the contractors have increased the reference requirements by 50%? It feels like a lot of extra resources diverging from upstream. The reason I am asking is that

  • 24GB is almost half a ganeti box, that severely impacts our planning
  • migrating between nodes (that’s how we achieve high availability of VMs) 24GB of RAM is extremely difficult, consuming tons of bandwidth while having a high probability of being stuck in an endless loop due to memory dirtying
  • we have no precedent for this. The most we have is 8GB and that’s already having issues.

Since memory is easy to add to a VM, may I suggest instead going with the upstream reference of 8GB and increase if needed according to usage?

Made a bit of an update there, but the text there did/does match my understanding. We're working with contractors on the initial setup and using puppet to setup the boxen is the plan; however, our operations/puppet repository has a non-insignificant learning curve, so the plan is to puppetize the setup outside of our operations/puppet repository.

I ‘ve commented on T274953 but I ‘ll summarize for convenience here. Puppetizing outside the operations/puppet repo is bound to be full of problems.

Whoa, catching up on scrollback overnight. My question is: is this the first anyone in SRE has heard about any of this?

Whoa, catching up on scrollback overnight. My question is: is this the first anyone in SRE has heard about any of this?

I guess it's fair to say there's been some miscommunication and unclarity about the specifics, and many in the team being surprised about them - something we should definitely work on in the future. :) After meeting with RelEng and Wolfgang I've sent an email to the team with context and some more details.

Updated task description for the network to be merely accessible over http vs external IP after discussing with @mark

The original thinking was that this is a Gerrit replacement, Gerrit isn't behind caching layers, so this shouldn't be either. The specific reasoning for Gerrit is that operations/puppet needs to be available when a large amount of infra is not available, as ops/puppet isn't going to be the first repo we're moving to GitLab then there's no specific requirement direct access to these VM instances: removed from request.

Quick reply to cover open questions heard through other avenues. If there are additional questions that must be answered before setting up these VMs please ask here and I'll do my best to answer.

  1. What to do with VLANs, LVS, TLS, caching?

    My understanding is that the most common setup is that service is on an internal VLAN (no direct internet access, but able to access all production network), exposed to the Internet via LVS and TLS/caching via varnish/ats. While it's far from my expertise, I believe the standard setup should be fine.
  1. What are the traffic flows?

    The answer to this may help to better inform the answer to question 1.

    Traffic flows:
    1. UI + Git over HTTPS to GitLab server
    2. SSH to GitLab server for contractors/folks working on administration
    3. SSH to GitLab server for git (this will need input from contractors for how to run this -- currently controlled via gitlab-shell that is used via an ssh authorized_keys file -- can be run on a different openssh from the system openssh if needs be. I think keeping in mind that this is a future need should be sufficient).
    4. Likely: HTTPS from GitLab server to webhook endpoints, for bot integrations and similar.
  1. Where will tests be run?

    In the very near term tests will still be run on WMCS. Network segmentation of runners from the GitLab instance is possible (see runner docs). Runners need to have HTTPs access to the instance to POST to the GitLab API using their tokens (see execution flow diagram), as well as cloning repositories and downloading artifacts.

    Following the migration of GitLab to k8s our audacious goal would be to migrate test runners off of the WMCS infrastructure and onto k8s.

Just to make sure, this is a request for 2 VMs, but BOTH in eqiad, so a 1001,1002 situation, NOT a 1001,2001 setup, where one is in eqiad and one is in codfw., right? And you are accepting that there won't be anything in codfw to fail-over to?

Ready to create Ganeti VM gitlab1001.eqiad.wmnet in the ganeti01.svc.eqiad.wmnet cluster on row A with 8 vCPUs, 12GB of RAM, 100GB of disk in the private network.

Just to make sure, this is a request for 2 VMs, but BOTH in eqiad, so a 1001,1002 situation, NOT a 1001,2001 setup, where one is in eqiad and one is in codfw., right? And you are accepting that there won't be anything in codfw to fail-over to?

That is correct, 2VMs in eqiad. Accepting and acknowledging a lack of fail-over for this iteration. Thank you!

ACK, the first VM has been created in eqiad and next is doing the OS install, in progress.

Change 666430 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: add MAC address for gitlab1001 VM

https://gerrit.wikimedia.org/r/666430

Change 666430 merged by Dzahn:
[operations/puppet@production] DHCP: add MAC address for gitlab1001 VM

https://gerrit.wikimedia.org/r/666430

Change 666456 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: add MAC address for gitlab1002

https://gerrit.wikimedia.org/r/666456

Change 666456 merged by Dzahn:
[operations/puppet@production] DHCP: add MAC address for gitlab1002

https://gerrit.wikimedia.org/r/666456

Change 666463 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: add partman recipe line for gitlab

https://gerrit.wikimedia.org/r/666463

Change 666463 merged by Dzahn:
[operations/puppet@production] install_server: add partman recipe line for gitlab

https://gerrit.wikimedia.org/r/666463

Change 666472 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add gitlab VMs with placeholder role

https://gerrit.wikimedia.org/r/666472

Change 666472 merged by Dzahn:
[operations/puppet@production] site: add gitlab VMs with placeholder role

https://gerrit.wikimedia.org/r/666472

Change 666477 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: fix typo in regex for gitlab nodes

https://gerrit.wikimedia.org/r/666477

Change 666477 merged by Dzahn:
[operations/puppet@production] site: fix typo in regex for gitlab nodes

https://gerrit.wikimedia.org/r/666477

VMs have been created, with 8 VCPUs, 12 GB RAM, 100 GB disk as requested. both in eqiad. gitlab1001,gitlab1002.

empty puppet role has been applied which adds the requested gitlab-roots admin group.

filling that group with members will have to happen separately

Dzahn claimed this task.

gitlab1001.eqiad.wmnet has address 10.64.0.93
gitlab1001.eqiad.wmnet has IPv6 address 2620:0:861:101:10:64:0:93

gitlab1002.eqiad.wmnet has address 10.64.0.111
gitlab1002.eqiad.wmnet has IPv6 address 2620:0:861:101:10:64:0:111

Dzahn removed Dzahn as the assignee of this task.Feb 23 2021, 10:45 PM
Dzahn removed a project: Patch-For-Review.

Would be better to have a (Wiki/Google) doc to discuss those details, but here we go in the meantime.

My understanding is that the most common setup is that service is on an internal VLAN (no direct internet access, but able to access all production network), exposed to the Internet via LVS and TLS/caching via varnish/ats. While it's far from my expertise, I believe the standard setup should be fine.

That's correct, and the HTTP proxies for external Internet access. So if external access in needed please make sure GitLab supports it.

Traffic flows:
   1. UI + Git over HTTPS to GitLab server
   2. SSH to GitLab server for contractors/folks working on administration
   3. SSH to GitLab server for git (this will need input from contractors for how to run this -- currently controlled via gitlab-shell that is used via an ssh authorized_keys file -- can be run on a different openssh from the system openssh if needs be. I think keeping in mind that this is a future need should be sufficient).
   4. Likely: HTTPS from GitLab server to [[https://docs.gitlab.com/ce/user/project/integrations/webhooks.html | webhook endpoints]], for bot integrations and similar.

1/2/3 are fine on a network point of view using LVS.
For (4) see my comment about HTTP proxies if those endpoint are outside the prod infra (WMCS is outside prod for example).

In the very near term tests will still be run on WMCS.  Network segmentation of runners from the GitLab instance is possible ([[https://docs.gitlab.com/runner/security/#network-segmentation | see runner docs]]). Runners need to have HTTPs access to the instance to POST to the GitLab API using their tokens (see [[https://docs.gitlab.com/runner/#runner-execution-flow | execution flow diagram]]), as well as cloning repositories and downloading artifacts.

FYI for prod<->WMCS traffic flows (still a draft): https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Production_Cloud_services_relationship
But in short, having the runners reach GitLab on its LVS endpoint is fine.

Following the migration of GitLab to k8s our audacious goal would be to migrate test runners off of the WMCS infrastructure and onto k8s.

At that point we should also look at separating GitLab (and ci/cd in general) to its own vlan. Might be worth doing during this test phase (but would required re-imaging the servers), or when the k8s/final plans become more concrete.

debt edited projects, added GitLab; removed GitLab (Initialization).

Please recreate the 2 VMs in the VLAN that allows for direct external IP addresses.

After speaking to Brandon it is clear that we should treat gitlab the same way as gerrit and maintain independent from the caching layer as it might be needed to fix the caching layer.

Mentioned in SAL (#wikimedia-operations) [2021-03-03T16:26:09Z] <mutante> deleting gitlab VMs - we have to start over and decom old VMs, then create new VMs with public IPs (T274459)

Change 668056 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] Revert "site: add gitlab VMs with placeholder role"

https://gerrit.wikimedia.org/r/668056

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: gitlab1002.eqiad.wmnet

  • gitlab1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: gitlab1001.eqiad.wmnet

  • gitlab1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

Change 668056 merged by Dzahn:
[operations/puppet@production] Revert "site: add gitlab VMs with placeholder role"

https://gerrit.wikimedia.org/r/668056

dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_A --vcpus 8 --memory 12 --disk 100 --network public gitlab1001.wikimedia.org
Ready to create Ganeti VM gitlab1001.wikimedia.org in the ganeti01.svc.eqiad.wmnet cluster on row A with 8 vCPUs, 12GB of RAM, 100GB of disk in the public network.
>>> Is this correct?
Type "go" to proceed or "abort" to interrupt the execution
> go
START - Cookbook sre.ganeti.makevm for new host gitlab1001.wikimedia.org
Allocated IPv4 208.80.154.6/26
Set DNS name of IP 208.80.154.6/26 to gitlab1001.wikimedia.org
Allocated IPv6 2620:0:861:1:208:80:154:6/64 with DNS name gitlab1001.wikimedia.org
Generating the DNS records from Netbox data. It will take a couple of minutes.
Dzahn raised the priority of this task from Medium to High.Mar 3 2021, 7:41 PM
Dzahn moved this task from Completed to Initialization on the GitLab board.
Dzahn edited projects, added GitLab (Initialization); removed GitLab.

Please recreate the 2 VMs in the VLAN that allows for direct external IP addresses.

After speaking to Brandon it is clear that we should treat gitlab the same way as gerrit and maintain independent from the caching layer as it might be needed to fix the caching layer.

I deleted the existing VMs, reverted firewall changes, DHCP, site.pp entries. Now recreating gitlab1001.wikimedia.org with public IP.

What about the second VM though? Should really both be in public? From what I heard the second one is supposed to run a database. That would sound more like it should stay private as long as both hosts can talk to each other?

Change 668183 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add gitlab1001.wikimedia.org

https://gerrit.wikimedia.org/r/668183

Change 668198 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: add gitlab1001.wikimedia.org

https://gerrit.wikimedia.org/r/668198

Change 668198 merged by Dzahn:
[operations/puppet@production] DHCP: add gitlab1001.wikimedia.org

https://gerrit.wikimedia.org/r/668198

Change 668183 merged by Dzahn:
[operations/puppet@production] site: add gitlab1001.wikimedia.org

https://gerrit.wikimedia.org/r/668183

Mentioned in SAL (#wikimedia-operations) [2021-03-03T21:58:32Z] <mutante> puppetmaster1001 - signing puppet cert for gitlab1001.wikmedia.org (T274459)

@wkandek @thcipriani A new VM gitlab1001.wikimedia.org in the public network has been created while gitlab1001.eqiad.wmnet has been deleted. Still using the same puppet role name so shell access is moving automatically with it.

Is the second VM needed already at this point and should that be in the private or public network? If we are copying gerrit setup there is only a single machine per DC so far (and we used prod mysql db until Gerrit eventually did not need that anymore).

Dzahn lowered the priority of this task from High to Medium.Mar 5 2021, 9:36 PM
Dzahn changed the task status from Open to Stalled.Mar 8 2021, 6:34 PM

Setting this to stalled because 1 VM has been created and whether the second one is still needed is TBD for now.

Do you want to keep this open? Or simply close and reopen if/once you want a second VM?

Let's close and reopen if a second server becomes necessary.