Page MenuHomePhabricator

Reduce usage of public IPv4 addresses on GitLab hosts
Open, MediumPublic

Description

GitLab hosts use two public IPv4 addresses per host. With two hosts in codfw and two in eqiad 8 IPv4 addresses would be needed. Two of those hosts are still in setup and the old VMs gitlab1001 and gitlab2001 will be decommissioned soon, which will release some IPs. But long term that is not a scalable solution. During setup of new hosts in T307142 some discussions happened with Infrastructure Foundations about the usage of IPv4 addresses too.

This task is to track and discuss measures to reduce the number of used IPv4 addresses. There are two IPv4 addresses configured per GitLab host currently:

Primary Interface

The primary interface is used to connect to the host over SSH for management purpose. This interface is configured with a public IP.

Proposal:

  • Make the primary interface private (for example by moving gitlab1001.wikimedia.org to gitlab1001.eqiad.wmnet). SSH access is still possible with the standard bastion/jumphost configuration. That would halve IPv4 usage.

Second/service Interface

The second interface is used to serve http/https traffic for gitlab.wikimedia.org. Furthermore a second SSH daemon is listening on that address as well to separate it from the management SSH daemon. This services are used directly by end-users and need some kind of public endpoint.

Possible options:

  • We can think about load balancing http/https and ssh traffic for the GitLab hosts and use a private second address. Some research is needed if existing loadbalancing infrastructure can handle SSH as well.
  • GitLab replicas (non-production instances) could drop the public address while they are replicas. However this would significantly decrease the use of that replicas for tests and failovers.

Next steps:

I would like to experiment with private primary interfaces on the replicas. If that works fine, we can migrate production GitLab to a private primary interface as well.

Event Timeline

Jelto triaged this task as Medium priority.Jun 9 2022, 11:39 AM

moving gitlab1001.wikimedia.org to gitlab1001.eqiad.wmnet

This is possible but would require reaching out to dcops to physically connect it to a different network and then a complete reimage. Basically like an install from scratch.

Some research is needed if existing loadbalancing infrastructure can handle SSH as well.

This would be recreating the setup we have for git-ssh.wikimedia.org on phabricator. Which is just that, an ssh behind LVS for the general public to push git to.
It is what we are trying to get rid of in another place and recreating it here with all the complexity it comes with, LVS/pybal alerts when rebooting, harder to migrate to other hosts when/if needed etc.

First and foremost though, the reason why gitlab has all public IPs is because we were trying to emulate the gerrit setup. And gerrit has public IPs and is not behind LVS because we wanted it that way. We wanted to be able to still use Gerrit and merge changes even if the caching layer is down for some reason. For the same reason icinga has a public IP. Certain services were not supposed to rely on loadbalancers.

So.. if this is supposed to replace Gerrit completely in the long run then we would first have to talk about that and make a conscious decision across SRE that this is not a valid concern anymore.

I am wondering meanwhile. are we _actually_ running out of IPs or is this more of a theoretical problem? The same standards should apply to alert1001.wikimedia.org, cloudcontrol2001-dev.wikimedia.org, archiva1002 et al.

Thanks for opening this task!

Going that way would have multiple benefits:

  • reducing our public IP usage, which will become more and more needed as we grow and become unable to procure more public IPs, so better to look at it early in a project)
  • reducing our dependency on the public vlan, related to the above point, which facilitates a per rack subnets design (having a public subnet in all the racks would be wasteful)
  • better security (less hosts directly exposed to the Internet)
  • Easier migration and scaling (the public IP can stay the same and only redirect to different backend nodes), not tied to per vlan hosts IPs. That contradicts Daniel's point, but maybe we're looking a different aspects of it?
  • Standardizing our infra, by having a single public entry point to our services. which might be a bit more difficult at the initial setup, but then benefits from our existing tooling, monitoring, team, incident response, etc...

That's why it's important to weight the pros and cons.

Your point about being able to merge a change if the LVS is down is worth discussing indeed, and goes along how to make a change when Gerrit/Gitlab is down (for example directly on the Puppetmaster) or tunneling around the LVS.

On the testing/implementation:
As Daniel's mentioned, moving a host from the public vlan to a private one is best done with a re-image.

It's also not easily possible (or recommended) for a server to have both a public and private IPs on their NICs. The way LVS works is that the host only have a private IP on its NIC, and the public (virtual) IP on its loopback.

Another takeaway from the previous comment is that the LVS layer can handle SSH fine.

The same standards should apply to alert1001.wikimedia.org, cloudcontrol2001-dev.wikimedia.org, archiva1002 et al.

Indeed, we're progressively looking at all of them, especially when they're getting refreshed. There are cases where the public vlan is still the best option, which is not an issue and is valuable to document why.

First and foremost though, the reason why gitlab has all public IPs is because we were trying to emulate the gerrit setup. And gerrit has public IPs and is not behind LVS because we wanted it that way. We wanted to be able to still use Gerrit and merge changes even if the caching layer is down for some reason.

That's a good point and I like the idea of discussing this with SRE before we agree on a solution here. And thanks for bringing up git-ssh.wikimedia.org.

[...]

First of all thanks for the additional context!

It's also not easily possible (or recommended) for a server to have both a public and private IPs on their NICs. The way LVS works is that the host only have a private IP on its NIC, and the public (virtual) IP on its loopback.

Can you explain this a little bit more? One machine can only be in either the private or public VLAN?
Mixing private and public addresses (private on primary and one additional public address for secondary) would be a reasonable intermediate step for the GitLab hosts. Migrating both interfaces to private IPs and configuring LVS sounds more complex. I don't think we have resources to implement and test this for all four new hosts in the near future.

It's also not easily possible (or recommended) for a server to have both a public and private IPs on their NICs. The way LVS works is that the host only have a private IP on its NIC, and the public (virtual) IP on its loopback.

Can you explain this a little bit more? One machine can only be in either the private or public VLAN?

While having multiple vlans on a host is possible, only special servers are configured that way (some WMCS, LVS, Ganeti).
We try to stay away from such config and it brings a lot of edge cases in term of server configuration (puppetization, routing), provisioning (interface selection, IP allocation, naming, scripts) and switch configuration to a lesser extent.

Mixing private and public addresses (private on primary and one additional public address for secondary) would be a reasonable intermediate step for the GitLab hosts. Migrating both interfaces to private IPs and configuring LVS sounds more complex. I don't think we have resources to implement and test this for all four new hosts in the near future.

I understand and agree that would have been a good way forward if it was common in our infra.

What do you think of keeping those hosts in the public vlan for the time being, but getting them set up with a LVS VIP as well?
So they would have 3 public IPs: interface IP, secondary IP, LVS IP, the first two dedicated to each host (as it's setup right now), and the 3rd one shared.

This is more "wasteful" with IPs on the short to medium term, but allows the service to be fronted by LVS without impacting the current way of doing things (and thus much less time consuming).
Later on, if it's working as expected, progressively re-image them as private hosts, keeping the LVS VIP.

One small problem that we have... the additional IPs are puppetized in a way that assigns them as /32 on the host:

inet 208.80.154.15/32 scope global eno1
    valid_lft forever preferred_lft forever

This means that when the import from puppetdb Netbox script is run, it creates a /32 VIP IP:
https://netbox.wikimedia.org/ipam/ip-addresses/10983/
in addition to the existing /26 one:
https://netbox.wikimedia.org/ipam/ip-addresses/10940/

While we could change the import script to check if the same IP exists already without a prefix (or with the subnet prefix) it would just make it fail and not help reconcile Netbox with reality.
I think that in this case we should instead make the puppetization assign it with the subnet prefix.
@ayounsi thoughts?

My understanding is that it comes down to where we want to implement workarounds to make the current setup works as using interface IPs as VIPs (or having multiple IPs on interfaces) it overall not our best practices.
It also needs to take into consideration how the service failover works, for example if the IP moves between hosts, Netbox will diverge from reality when the VIP changes host if we use 208.80.154.15/26 in that case using the /32 but not assigned to an interface might be more suitable.

@Volans , @ayounsi , @cmooney , @BBlack and I had a chat about this topic during the SRE summit.

We talked about multiple options which would reduce IPv4 usage of GitLab nodes:

  • put GitLab behind pybal/lvs loadbalancer
  • disable git over ssh and use https only
  • use non-default SSH port
  • use IPv4 anycast address range
  • use IPv6 only

Put GitLab behind loadbalancer

GitLab could be hosted behind lvs/pybal. That would remove the usage of any additional public IPv4 address. It was also confirmed that SSH (for using git over ssh) can be loadbalanced with the existing infrastructure too.

This option would add a dependency and mean GitLab is only usable if the loadbalancer is working. Major incident involving the loadbalancer could be more difficult because deploying fixes and changes from GitLab is not working (at least not with additional tunnels).

We agreed this option would solve the IP address usage issue but comes with quite a big cost. We would introduce additional dependencies and complexity (especially regarding failover between instances and during big incidents). Furthermore it was noted that GitLab doesn't need loadbalancing at the moment, there is only one active host.

Disable git over ssh

GitLab offers git over SSH and https to push and pull code changes. This is similar to Gerrit. Disabling git over SSH would halve the required IPv4 addresses.

As this is also a product decision, I'd be interested what folks from RelEng think about that? (@brennen, @dancy, @thcipriani, @dduvall?)

Use non-default SSH port

It would also be possible to move the SSH daemon for git to a non-default port similar to Gerrit. That would require explicitly adding the port when configuring the git remote or configuration of a custom SSH config.

As this is also a product decision, I'd be interested what folks from RelEng think about that? (@brennen, @dancy, @thcipriani, @dduvall?)

For this and the previous option I have mixed feelings. My understanding is that GitLab should make it easier to contribute code for everyone. If we start cutting features or use non-standard configurations that makes it harder to use GitLab.

Use IPv4 anycast

We also talked about assigning a IPv4 anycast address to the active GitLab host and use that for the public https/ssh endpoint. That address could be "assigned" (routed from all DCs to the GitLab host) to any gitlab host. This also supports failover between data centers. With this approach, we could reduce the IP usage to one for the active host.

This option also introduces dependencies. Changes to the anycast setup will be more complicated if GitLab is using anycast too. During the conversation it was not clear if such major refactoring on the anycast setup is likely or not. Furthermore GitLab replicas will not be usable anymore with only one single anycast address. So maybe another anycast address for the replica is needed to make the replica similar with the production instance.

I hope my notes cover all discussed options. Feel free to add anything if I forgot something.

Thanks for the detailed write up as always @Jelto 🎉

For this and the previous option I have mixed feelings. My understanding is that GitLab should make it easier to contribute code for everyone. If we start cutting features or use non-standard configurations that makes it harder to use GitLab.

+1 this is my hesitation with either removing ssh or changing it to a non-standard port.

I'd like to avoid both of those options from a product perspective.

Disabling SSH

Disabling git over ssh means they'll need to use tokens and passwords (with workarounds like storing a token in plaintext in ~/.netrc, for example). Plus it limits GitLab's usability in environments where public keys are a requirement.

Non standard port

The non-default ssh port in gerrit is a hassle.

After many years of using it, I still don't remember the port number (guess: 29418? yes! but I was like 60% sure without checking).

While this is suboptimal for new gerrit users, at least gerrit's upstream documentation also refers to this strange port. This would be different for GitLab, meaning much of the upstream documentation would be slightly wrong—bad usability :(

Seconding what @thcipriani said, I'm strongly against disabling git over ssh. Using HTTP only requires plaintext passwords to be stored on users' systems which is a big downgrade in security. And changing to a non-standard ssh port works against the goal of providing an easy user experience.