Page MenuHomePhabricator

Reduce usage of public IPv4 addresses on GitLab hosts
Open, MediumPublic

Description

GitLab hosts use two public IPv4 addresses per host. With two hosts in codfw and two in eqiad 8 IPv4 addresses would be needed. Two of that hosts are still in setup and the old VMs gitlab1001 and gitlab2001 will be decommissioned soon, which will release some IPs. But long term that is not a scalable solution. During setup of new hosts in T307142 some discussions happened with Infrastructure Foundations about the usage of IPv4 addresses too.

This task is to track and discuss measures to reduce the number of used IPv4 addresses. There are two IPv4 addresses configured per GitLab host currently:

Primary Interface

The primary interface is used to connect to the host over SSH for management purpose. This interface is configured with a public IP.

Proposal:

  • Make the primary interface private (for example by moving gitlab1001.wikimedia.org to gitlab1001.eqiad.wmnet). SSH access is still possible with the standard bastion/jumphost configuration. That would halve IPv4 usage.

Second/service Interface

The second interface is used to serve http/https traffic for gitlab.wikimedia.org. Furthermore a second SSH daemon is listening on that address as well to separate it from the management SSH daemon. This services are used directly by end-users and need some kind of public endpoint.

Possible options:

  • We can think about load balancing http/https and ssh traffic for the GitLab hosts and use a private second address. Some research is needed if existing loadbalancing infrastructure can handle SSH as well.
  • GitLab replicas (non-production instances) could drop the public address while they are replicas. However this would significantly decrease the use of that replicas for tests and failovers.

Next steps:

I would like to experiment with private primary interfaces on the replicas. If that works fine, we can migrate production GitLab to a private primary interface as well.

Event Timeline

Jelto triaged this task as Medium priority.Jun 9 2022, 11:39 AM

moving gitlab1001.wikimedia.org to gitlab1001.eqiad.wmnet

This is possible but would require reaching out to dcops to physically connect it to a different network and then a complete reimage. Basically like an install from scratch.

Some research is needed if existing loadbalancing infrastructure can handle SSH as well.

This would be recreating the setup we have for git-ssh.wikimedia.org on phabricator. Which is just that, an ssh behind LVS for the general public to push git to.
It is what we are trying to get rid of in another place and recreating it here with all the complexity it comes with, LVS/pybal alerts when rebooting, harder to migrate to other hosts when/if needed etc.

First and foremost though, the reason why gitlab has all public IPs is because we were trying to emulate the gerrit setup. And gerrit has public IPs and is not behind LVS because we wanted it that way. We wanted to be able to still use Gerrit and merge changes even if the caching layer is down for some reason. For the same reason icinga has a public IP. Certain services were not supposed to rely on loadbalancers.

So.. if this is supposed to replace Gerrit completely in the long run then we would first have to talk about that and make a conscious decision across SRE that this is not a valid concern anymore.

I am wondering meanwhile. are we _actually_ running out of IPs or is this more of a theoretical problem? The same standards should apply to alert1001.wikimedia.org, cloudcontrol2001-dev.wikimedia.org, archiva1002 et al.

Thanks for opening this task!

Going that way would have multiple benefits:

  • reducing our public IP usage, which will become more and more needed as we grow and become unable to procure more public IPs, so better to look at it early in a project)
  • reducing our dependency on the public vlan, related to the above point, which facilitates a per rack subnets design (having a public subnet in all the racks would be wasteful)
  • better security (less hosts directly exposed to the Internet)
  • Easier migration and scaling (the public IP can stay the same and only redirect to different backend nodes), not tied to per vlan hosts IPs. That contradicts Daniel's point, but maybe we're looking a different aspects of it?
  • Standardizing our infra, by having a single public entry point to our services. which might be a bit more difficult at the initial setup, but then benefits from our existing tooling, monitoring, team, incident response, etc...

That's why it's important to weight the pros and cons.

Your point about being able to merge a change if the LVS is down is worth discussing indeed, and goes along how to make a change when Gerrit/Gitlab is down (for example directly on the Puppetmaster) or tunneling around the LVS.

On the testing/implementation:
As Daniel's mentioned, moving a host from the public vlan to a private one is best done with a re-image.

It's also not easily possible (or recommended) for a server to have both a public and private IPs on their NICs. The way LVS works is that the host only have a private IP on its NIC, and the public (virtual) IP on its loopback.

Another takeaway from the previous comment is that the LVS layer can handle SSH fine.

The same standards should apply to alert1001.wikimedia.org, cloudcontrol2001-dev.wikimedia.org, archiva1002 et al.

Indeed, we're progressively looking at all of them, especially when they're getting refreshed. There are cases where the public vlan is still the best option, which is not an issue and is valuable to document why.

First and foremost though, the reason why gitlab has all public IPs is because we were trying to emulate the gerrit setup. And gerrit has public IPs and is not behind LVS because we wanted it that way. We wanted to be able to still use Gerrit and merge changes even if the caching layer is down for some reason.

That's a good point and I like the idea of discussing this with SRE before we agree on a solution here. And thanks for bringing up git-ssh.wikimedia.org.

[...]

First of all thanks for the additional context!

It's also not easily possible (or recommended) for a server to have both a public and private IPs on their NICs. The way LVS works is that the host only have a private IP on its NIC, and the public (virtual) IP on its loopback.

Can you explain this a little bit more? One machine can only be in either the private or public VLAN?
Mixing private and public addresses (private on primary and one additional public address for secondary) would be a reasonable intermediate step for the GitLab hosts. Migrating both interfaces to private IPs and configuring LVS sounds more complex. I don't think we have resources to implement and test this for all four new hosts in the near future.

It's also not easily possible (or recommended) for a server to have both a public and private IPs on their NICs. The way LVS works is that the host only have a private IP on its NIC, and the public (virtual) IP on its loopback.

Can you explain this a little bit more? One machine can only be in either the private or public VLAN?

While having multiple vlans on a host is possible, only special servers are configured that way (some WMCS, LVS, Ganeti).
We try to stay away from such config and it brings a lot of edge cases in term of server configuration (puppetization, routing), provisioning (interface selection, IP allocation, naming, scripts) and switch configuration to a lesser extent.

Mixing private and public addresses (private on primary and one additional public address for secondary) would be a reasonable intermediate step for the GitLab hosts. Migrating both interfaces to private IPs and configuring LVS sounds more complex. I don't think we have resources to implement and test this for all four new hosts in the near future.

I understand and agree that would have been a good way forward if it was common in our infra.

What do you think of keeping those hosts in the public vlan for the time being, but getting them set up with a LVS VIP as well?
So they would have 3 public IPs: interface IP, secondary IP, LVS IP, the first two dedicated to each host (as it's setup right now), and the 3rd one shared.

This is more "wasteful" with IPs on the short to medium term, but allows the service to be fronted by LVS without impacting the current way of doing things (and thus much less time consuming).
Later on, if it's working as expected, progressively re-image them as private hosts, keeping the LVS VIP.

One small problem that we have... the additional IPs are puppetized in a way that assigns them as /32 on the host:

inet 208.80.154.15/32 scope global eno1
    valid_lft forever preferred_lft forever

This means that when the import from puppetdb Netbox script is run, it creates a /32 VIP IP:
https://netbox.wikimedia.org/ipam/ip-addresses/10983/
in addition to the existing /26 one:
https://netbox.wikimedia.org/ipam/ip-addresses/10940/

While we could change the import script to check if the same IP exists already without a prefix (or with the subnet prefix) it would just make it fail and not help reconcile Netbox with reality.
I think that in this case we should instead make the puppetization assign it with the subnet prefix.
@ayounsi thoughts?

My understanding is that it comes down to where we want to implement workarounds to make the current setup works as using interface IPs as VIPs (or having multiple IPs on interfaces) it overall not our best practices.
It also needs to take into consideration how the service failover works, for example if the IP moves between hosts, Netbox will diverge from reality when the VIP changes host if we use 208.80.154.15/26 in that case using the /32 but not assigned to an interface might be more suitable.