Page MenuHomePhabricator

SVC DNS zonefiles and source of truth
Open, MediumPublic

Description

I'm opening this task as a follow up of a chat that happened on IRC in #wikimedia-serviceops on Nov. 24th to try to find what's the best solution to be the source of truth of the .svc.$dc.wmnet DNS records with the goal of being flexible enough to support the various current use cases and not require to repeat the information into multiple data sources.

The current options were discussed, without any clear winner so far.

Data requirements

  • Netbox: allocated IPs for IPAM purposes, better if decorated with their DNS name
  • DNS: IPs <-> DNS name mapping (both direct and reverse), ideally auto-generated from some datasource, plus some CNAMEs
  • Puppet: IPs <-> DNS mapping with additional metadata that is needed to define the service in Puppet

Functional requirements

  • Aim to have a single source of truth from where it's easy to check existing records and add new records
  • Enforce the use of the same last octet in both eqiad and codfw subnets for a given service IP
  • Support one-DC-only use case records were the IP is allocated in the DC that is primary for that service and the IP with the last octet in the other DC is just reserved and no DNS record is created for it.

Current use cases (as of Sep. 2021)

  • Normal IP <-> DNS name 1:1 mappings for most services. Those could be automatically managed by Netbox with the current automation.
  • Special IP <-> DNS name many:1 mappings for some k8s-related services (notably staging), where the round robin DNS is used as a load-balancing system. Those cases aren't supported in Netbox unless we develop a custom plugin that holds some additional data to support this use case
  • CNAMEs:
    • swift -> ms-fe (@fgiunchedi ?)
    • kubestagemaster -> kubestagemaster1001 / kubestagemaster2001 (@akosiaris ?)
    • staging -> kubestage1001 / kubestage2001
    • termbox-test -> staging
    • prometheus -> prometheus3001 / prometheus4001 / prometheus5001 (@fgiunchedi ?)
  • SVC records that point to host IPs outside of the SVC subnets. Are those really required or just tech debt that should be fixed?

DNS Records with non-standard TTL:

  • prometheus CNAMEs. 5M (@fgiunchedi ?)
  • We have just one for oresrdb that has a 5M TTL instead of the default 1H and that's not currently supported by the Netbox automation for lack of a place where to store that information properly. This record too points to a host IP and is not using a service IP.

Event Timeline

Volans triaged this task as Medium priority.Dec 14 2020, 9:57 AM
Volans created this task.

CNAMEs: we only have one right now for swift that points to ms-fe.svc.$dc.wmnet. @fgiunchedi are both records needed?

Yes, although swift.svc can be an A too if that makes any difference

  • DNS Records with non-standard TTL. We have just one for oresrdb that has a 5M TTL instead of the default 1H and that's not currently supported by the Netbox automation for lack of a place where to store that information properly. This record too points to a host IP and is not using a service IP.

We can get rid of the non-standard TTL for that one. In fact, we can get rid of the entire RR. It was put there to facilitate oresrdb maintenance, but due to various difficulties that never materialized well. We can point to the server directly instead.

  • SVC records that point to host IPs outside of the SVC subnets. Are those really required or just tech debt that should be fixed?

e.g. ganeti01.svc records? Yes those are needed. It's a floating IP between the various node. It can't be an LVS IP and it can't be the IP of a node.

If some snowflakes (that can't be changed) prevent us from managing the bulk of records with Netbox we could move them to a different "namespace".
For example we could say: .svc. are for internal LVS services, so ganeti01.svc should be renamed ganeti01.infra. That way we don't have to find a solution that fits everything.

Change 655426 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ores: Switch from oresrdb.svc to host names

https://gerrit.wikimedia.org/r/655426

  • DNS Records with non-standard TTL. We have just one for oresrdb that has a 5M TTL instead of the default 1H and that's not currently supported by the Netbox automation for lack of a place where to store that information properly. This record too points to a host IP and is not using a service IP.

We can get rid of the non-standard TTL for that one. In fact, we can get rid of the entire RR. It was put there to facilitate oresrdb maintenance, but due to various difficulties that never materialized well. We can point to the server directly instead.

  • Patch suggested above moves with deprecating that DNS RR, so we should be able to remove and not need to cater to this.
  • As far as CNAMEs go, can we support them? We 've been experimenting a bit with moving staging RR A records to a single CNAME pointing to one of the hosts. We won't have as nice test coverage as expected, but due to the point below that part might also be in a state of flux
  • I 'll add a use case btw we might experiment with this quarter. Now that we have a newer calico version, we can experiment with https://docs.projectcalico.org/networking/advertise-service-ips. That means a whole new set of service IPs that will probably show up and be used throughout the fleet. Given it's probably going to be more automated than the current manual pybal/LVS setup, I expect them to proliferate faster than LVS service IPs (and if all goes according to plan a lot of internal LVS service IPs might be moved to this scheme in the coming quarters). On the plus side, we can at least address this by adding a new set of zones (e.g. k8s.svc.(eqiad|codfw).wmnet) and delegating to CoreDNS which a) runs on kubernetes already b) is already populating DNS records for those IPs for use in the cluster. Staging in codfw will be our first testbed for that.

If some snowflakes (that can't be changed) prevent us from managing the bulk of records with Netbox we could move them to a different "namespace".
For example we could say: .svc. are for internal LVS services, so ganeti01.svc should be renamed ganeti01.infra. That way we don't have to find a solution that fits everything.

That could work, but I think it's a single use case, which is probably why we never went that way. While that IP is a floating one, it is guaranteed to always be owned by 1 node (alongside the regular one it has). If we could someway treat it like a node when generating the zones it would also solve the problem (I think).

Change 655443 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Remove oresrdb.svc RRs

https://gerrit.wikimedia.org/r/655443

If we could someway treat it like a node when generating the zones it would also solve the problem (I think).

I'm not sure I understood what you have in mind. We are already managing virtual IPs that are not attached to any device in Netbox, for example the ganeti ones are: https://netbox.wikimedia.org/ipam/ip-addresses/?q=ganeti01 . The difference in those cases is that the IPs are not in same .svc. zones, so right now the direct records for eqaid/codfw are manually set in the wmnet zonefile while the reverse PTRs are auto-generated by Netbox as part of the regular reverse zonefiles.

Change 655426 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: Switch from oresrdb.svc to host names

https://gerrit.wikimedia.org/r/655426

Change 655443 merged by Alexandros Kosiaris:
[operations/dns@master] Remove oresrdb.svc RRs

https://gerrit.wikimedia.org/r/655443

  • DNS Records with non-standard TTL. We have just one for oresrdb that has a 5M TTL instead of the default 1H and that's not currently supported by the Netbox automation for lack of a place where to store that information properly. This record too points to a host IP and is not using a service IP.

We can get rid of the non-standard TTL for that one. In fact, we can get rid of the entire RR. It was put there to facilitate oresrdb maintenance, but due to various difficulties that never materialized well. We can point to the server directly instead.

  • Patch suggested above moves with deprecating that DNS RR, so we should be able to remove and not need to cater to this.

Done. This use case no longer exists.

If we could someway treat it like a node when generating the zones it would also solve the problem (I think).

I'm not sure I understood what you have in mind.

That makes two of us :-). I was thinking of abusing the system a bit and creating a regular node entry (something akin say to https://netbox.wikimedia.org/ipam/ip-addresses/4599/) but now that I look at it again, I can the drawbacks of this.

We are already managing virtual IPs that are not attached to any device in Netbox, for example the ganeti ones are: https://netbox.wikimedia.org/ipam/ip-addresses/?q=ganeti01 . The difference in those cases is that the IPs are not in same .svc. zones, so right now the direct records for eqaid/codfw are manually set in the wmnet zonefile while the reverse PTRs are auto-generated by Netbox as part of the regular reverse zonefiles.

I am not sure I follow either. Why does it make a diff if the IPs are in the same .svc. zone or not ? Since reverse RRs are being generated ok, what is the issue with the forward ones specifically?

Has been a while since we discussed this but the problem still stands and I think we need to get some progress here. What would be the best way forward? I can setup a meeting with all people involved if that might help to outline next steps.

Has been a while since we discussed this but the problem still stands and I think we need to get some progress here.

+1

What would be the best way forward? I can setup a meeting with all people involved if that might help to outline next steps.

I think we first need to recap a bit where we are at and what is still a problem. I think some of the use cases have been fixed (e.g. oresdb no longer exists). Hopefully that would make it easier to see what the next steps are.

I think we first need to recap a bit where we are at and what is still a problem. I think some of the use cases have been fixed (e.g. oresdb no longer exists). Hopefully that would make it easier to see what the next steps are.

@akosiaris I've updated the Current use cases (as of Sep. 2021) section of the task description with the current status. Although a couple of use cases have disappeared, we have more exceptions in the existing categories than before.

ayounsi added subscribers: aborrero, MoritzMuehlenhoff.

I pinged a few persons on the task description to know if those records are needed (and document their usecase if so).

Thanks for chasing that info down!

So for

CNAMEs:
swift -> ms-fe
kubestagemaster -> kubestagemaster1001 / kubestagemaster2001
staging -> kubestage1001 / kubestage2001
termbox-test -> staging
prometheus -> prometheus3001 / prometheus4001 / prometheus5001

We could investigate following the helm-charts way. I 've gone ahead back when we introduced helm-charts to allow having a service without LVS (but with discovery, although that is not a requirement).

As far as

SVC records that point to host IPs outside of the SVC subnets. Are those really required or just tech debt that should be fixed?
ganeti01 (@MoritzMuehlenhoff ?)
ganeti-test01 (@MoritzMuehlenhoff ?)

goes, I can tell you that this are not going away. They are really required by the software. It's a floating IP that moves between nodes and is owned by the master one each time. We should support that use case, we can't remove it.

As far as

SVC records that point to host IPs outside of the SVC subnets. Are those really required or just tech debt that should be fixed?
ganeti01 (@MoritzMuehlenhoff ?)
ganeti-test01 (@MoritzMuehlenhoff ?)

goes, I can tell you that this are not going away. They are really required by the software. It's a floating IP that moves between nodes and is owned by the master one each time. We should support that use case, we can't remove it.

Sure, but I think there are various options. We're currently using a DNS name $name.svc.$dc.wmnet but an IP from the $name.$dc.wmnet subnets.
We could either use an IP from the SVC subnet for example, or move that to some other dedicated small subnet, and so on.

  • SVC records that point to host IPs outside of the SVC subnets. Are those really required or just tech debt that should be fixed?

This one is in use by labstore1005/labstore1006. It is a VIP that can be assigned to one of the two at a given time.

It is 100% technical debt that should be fixed. But also a major project. Good news is that we're currently working on it already: T291405: [NFS] Reduce or eliminate bare-metal NFS servers
So maybe the right thing to do is to leave it as is, and perhaps next Q we can drop it entirely?

I have no opinion on how that data should be stored wrt. netbox, puppet, etc. Whatever fits with you folks.

As far as

SVC records that point to host IPs outside of the SVC subnets. Are those really required or just tech debt that should be fixed?
ganeti01 (@MoritzMuehlenhoff ?)
ganeti-test01 (@MoritzMuehlenhoff ?)

goes, I can tell you that this are not going away. They are really required by the software. It's a floating IP that moves between nodes and is owned by the master one each time. We should support that use case, we can't remove it.

Sure, but I think there are various options. We're currently using a DNS name $name.svc.$dc.wmnet but an IP from the $name.$dc.wmnet subnets.
We could either use an IP from the SVC subnet for example, or move that to some other dedicated small subnet, and so on.

Currently, an IP from the SVC subnet would be a weird mess to get it to work. Routers have the entire /24 as discard in their aggregate route configuration so traffic for it is blackholed, unless a more specific (like a /32) is announced from somewhere (e.g. pybal)/configured on the router as a static route (which would need a destination, so hardcoding the master(s) on the router config).

Currently, the IP needs to be on the same subnet as the nodes that are master/master candidates. The reason for that is that that IP is a secondary IP on the same interface as the node. i.e.

akosiaris@ganeti1009:~$ ip -4 addr ls dev private
4: private: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 10.64.32.97/22 brd 10.64.35.255 scope global private
       valid_lft forever preferred_lft forever
    inet 10.64.32.173/32 scope global private:0
       valid_lft forever preferred_lft forever

That being said, it is possible to use a different interface by passing --master-netdev to gnt-cluster {init|modify}. That makes the idea of the dedicated small subnet approach (which would not have the issues the SVC subnet has) plausible. It's a big change across multiple systems though (switches, cr and the ganeti clusters themselves).

About the Ganeti, as I understand it, the issue is that they have .svc. in their fqdn, while not an SVC IP.

I agree that changing the IP to something in the SVC subnet or a dedicated subnet is more troubles than we need.

Is there a downside of renaming them to something like ganeti01.eqiad.wmnet, ganeti01-vip.eqiad.wmnet, ganeti01.infra.eqiad.wmnet ?

About the Ganeti, as I understand it, the issue is that they have .svc. in their fqdn, while not an SVC IP.

I agree that changing the IP to something in the SVC subnet or a dedicated subnet is more troubles than we need.

Is there a downside of renaming them to something like ganeti01.eqiad.wmnet, ganeti01-vip.eqiad.wmnet, ganeti01.infra.eqiad.wmnet ?

A small one is that we will need to update any callers to it (e.g. netbox IIRC) and our muscle memories, but other than that, no.

As far as Prometheus goes the svc CNAME in PoPs is there for symmetry reasons with eqiad/codfw (where prometheus is LVS'd), so prometheus is always prometheus.svc.$site.wmnet.

re: non-standard TTL for prometheus, likely an artifact of a migration and can be set to standard

re: swift vs ms-fe, back when we had a single swift cluster it made more sense to alias one to the other. However with more swift clusters nowadays (thanos, and soon moss) eventually the "swift" name should be deprecated I think (i.e. when all but "media storage" is really off swift.svc cluster)

About the Ganeti, as I understand it, the issue is that they have .svc. in their fqdn, while not an SVC IP.

I agree that changing the IP to something in the SVC subnet or a dedicated subnet is more troubles than we need.

Is there a downside of renaming them to something like ganeti01.eqiad.wmnet, ganeti01-vip.eqiad.wmnet, ganeti01.infra.eqiad.wmnet ?

A small one is that we will need to update any callers to it (e.g. netbox IIRC) and our muscle memories, but other than that, no.

Btw, when I was mentioning in T270071#6736069 If we could someway treat it like a node when generating the zones it would also solve the problem (I think). that's more or less what I was referring to.

@Volans do you find any of the above proposals helpful ?

@akosiaris my bad, I didn't understood you meant that there ;) Yeah i think that if we could just rename the ganeti DNS record to be ouside of the svc namespace and back into the $dc.wmnet zonefile that should just work. There is surely a bunch of place where we need to update it but I guess we could have both names for a short period to make sure we've converted everything before removing the svc name.

@akosiaris my bad, I didn't understood you meant that there ;) Yeah i think that if we could just rename the ganeti DNS record to be ouside of the svc namespace and back into the $dc.wmnet zonefile that should just work. There is surely a bunch of place where we need to update it but I guess we could have both names for a short period to make sure we've converted everything before removing the svc name.

Yup, +1