Page MenuHomePhabricator

service::catalog entries and dnsdisc for Kubernetes services under Ingress
Closed, ResolvedPublic

Description

Current state: One LVS + dnsdisc for a service on Kubernetes

Each service running on Kubernetes does reserve a TCP port on which each Kubernetes Node listens on. We then configure a service::catalog entry including LVS and dnsdisc for each of those services.
This gives us all the benifits of being able to pool/depool services quickly and indepentendly as well as central configuration for monitoring by also using well known (to other SREs) techniques.

On the downside, adding dedicated LVS involves quite a bit of manual labor which we would like to decrease, especially for low-traffic services and services not-yet in full production mode.

New state: One "meta" LVS (Kubernetes Ingress) + dedicated dnsdisc for a service on Kubernetes

On each Kubernetes Node (well, on most of them) we now run envoy, which binds to tcp/30443 and has a dedicated LVS and dnsdic setup k8s-ingress-wikikube (like all the other services decribed above). This envoy does TLS termination and can fan out to multiple "workload/real services" (running on Kubernetes) without the need for them to have a dedicated LVS setup.

We would like to keep adding those services to the service::catalog and add dnsdisc by using the same LVS VIPs as k8s-ingress-wikikube. So we are still able to pool/depool them individually and benefit from standard monitoring setups.

Unfortunately that created a relationship between the dnsdisc record for the services and the dnsdisc record for k8s-ingress-wikikube. Services may be pooled/depooled individually, but depooling k8s-ingress-wikikube in one DC means all services depending on that will have to be depooled as well.

The guinea pig for this is miscweb, where I've already removed the LVS stanza from service::catalog in:

The second patch was needed because I obviously messed up the port in the first patch but also because monitor.pp does not create icinga host config for servies without lvs: stanza in service::catalog. The latter is going away with the new probes structure now paging (T291946) and monitoring: slowly phased out, so I guess that may be ignored.

The open question now is how to account for the relationship between dnsdisc records.

Possible easy way forward

After another chat with @Joe today we agreed on going what we think is the less work intensive way forward for now. This is accepting the loss of possibility to "easily" depool services under Ingress (via conftool) but sticking with an individual service::catalog entry as well as CNAME (pointing to k8s-ingress-wikikube.discovery.wmnet in default state).

We have two options there:

  1. We could use CNAME's in the usual form (like SERVICE.discovery.wmnet) which has potential benefits as existing tooling is/might be tailored towards that. The downside is that those CNAME's might be easily confused with dnsdisc records, leading people to look into the wrong direction.
  2. Use a different DNS domain (we already have SERVICE.k8s-staging.discovery.wmnet) for those CNAME's to not confuse them with "real" dnsdisc. This could be more work as existing tooling might need to be adapted and we would need to refactor names again if we should later decide to implement some relationship between dnsdisc.

Further things to consider regarding service::catalog entries for services under ingress:

  • The monitoring: stanza can't be added as having that without lvs: breaks icinga. Can potentially be ignored (T291946), see above.
  • It's currently not clear (to me) if the absence of the discovery stanza has any implications besides not having dnsdisc

Event Timeline

JMeybohm renamed this task from service:.catalog entries and dnsdisc for Kubernetes sevrices under Ingress to service:.catalog entries and dnsdisc for Kubernetes services under Ingress.Apr 11 2022, 8:29 AM
JMeybohm renamed this task from service:.catalog entries and dnsdisc for Kubernetes services under Ingress to service::catalog entries and dnsdisc for Kubernetes services under Ingress.Apr 11 2022, 1:23 PM

Change 780651 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add datahub-gms to the service catalog

https://gerrit.wikimedia.org/r/780651

Change 780658 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Add a CNAME reference for datahub-gms.discovery.wmnet

https://gerrit.wikimedia.org/r/780658

  • The monitoring: stanza can't be added as having that without lvs: breaks icinga. Can potentially be ignored (T291946), see above.

I am not sure this is true. I see helm-charts not having an lvs: stanza and still having monitoring and icinga having those services just fine. See

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=chartmuseum1001&service=helm-charts+eqiad+port+443%2Ftcp+-+helm-charts+IPv4

and

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=chartmuseum2001&service=helm-charts+codfw+port+443%2Ftcp+-+helm-charts+IPv4

configuration is at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/service.yaml#3242

The same holds true for the releases service (couple of lines below in service.yaml). In icinga we have

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=releases2002&service=releases+codfw+port+443%2Ftcp+-+MediaWiki-+Parsoid-+MobileApps+and+other+Wikimedia+realease+files+-https%3A%2F%2Freleases.wikimedia.org-+IPv4

and https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=releases1002&service=releases+eqiad+port+443%2Ftcp+-+MediaWiki-+Parsoid-+MobileApps+and+other+Wikimedia+realease+files+-https%3A%2F%2Freleases.wikimedia.org-+IPv4

This is accepting the loss of possibility to "easily" depool services under Ingress (via conftool) but sticking with an individual service::catalog entry as well as CNAME (pointing to k8s-ingress-wikikube.discovery.wmnet in default state).

Do I understand correctly that this means we won't be creating the relationship mentioned in the task description?

The monitoring: stanza can't be added as having that without lvs: breaks icinga. Can potentially be ignored (T291946), see above.

I am not sure this is true. I see helm-charts not having an lvs: stanza and still having monitoring and icinga having those services just fine.

I'm happy to add a monitoring stanza to the new datahub-gms service CR as a verification for this, if that helps.

  • The monitoring: stanza can't be added as having that without lvs: breaks icinga. Can potentially be ignored (T291946), see above.

I am not sure this is true. I see helm-charts not having an lvs: stanza and still having monitoring and icinga having those services just fine.

I think this is a lucky coincidence as for both services host config is created by via monitoring::service resource (in modules/profile/manifests/chartmuseum.pp and modules/profile/manifests/releases/common.pp. As there are no "real hosts" with case of ingress services, monitoring::service resources usually don't exist.

This is accepting the loss of possibility to "easily" depool services under Ingress (via conftool) but sticking with an individual service::catalog entry as well as CNAME (pointing to k8s-ingress-wikikube.discovery.wmnet in default state).

Do I understand correctly that this means we won't be creating the relationship mentioned in the task description?

Yes. At least not in the first step. Implementing that would need quite some time and we're currently short on that. So we would sacrifice the "easy individual depooling" functionality for now and create kind of a tight relationship with a CNAME. Of cause we can still depool services individually via changes to the dns repo (pointing their discovery name to k8s-ingress-wikikube.svc.eqiad.wmnet or k8s-ingress-wikikube.svc.codfw.wmnet directly).

Change 780658 merged by Btullis:

[operations/dns@master] Add a CNAME reference for datahub-gms.discovery.wmnet

https://gerrit.wikimedia.org/r/780658

Change 786322 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Update miscweb relates records for use with k8s ingress

https://gerrit.wikimedia.org/r/786322

Change 786323 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Remove miscweb discovery resources

https://gerrit.wikimedia.org/r/786323

Change 786322 merged by JMeybohm:

[operations/dns@master] Update miscweb relates records for use with k8s ingress

https://gerrit.wikimedia.org/r/786322

Change 786323 merged by JMeybohm:

[operations/puppet@production] Remove miscweb discovery resources

https://gerrit.wikimedia.org/r/786323

Change 786977 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] trafficserver: change miscweb backend back to miscweb.discovery.wmnet

https://gerrit.wikimedia.org/r/786977

Change 786977 merged by JMeybohm:

[operations/puppet@production] trafficserver: change miscweb backend back to miscweb.discovery.wmnet

https://gerrit.wikimedia.org/r/786977

Change 787747 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add -ro and -rw discovery records for k8s-ingress-wikikube

https://gerrit.wikimedia.org/r/787747

Change 787748 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add -ro and -rw discovery records for k8s-ingress-wikikube

https://gerrit.wikimedia.org/r/787748

Change 787748 merged by JMeybohm:

[operations/puppet@production] Add -ro and -rw discovery records for k8s-ingress-wikikube

https://gerrit.wikimedia.org/r/787748

Change 787750 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add desired state for k8s-ingress-wikikube -ro and -rw discovery records

https://gerrit.wikimedia.org/r/787750

Change 787747 merged by JMeybohm:

[operations/dns@master] Add -ro and -rw discovery records for k8s-ingress-wikikube

https://gerrit.wikimedia.org/r/787747

Change 787750 merged by JMeybohm:

[operations/puppet@production] Add desired state for k8s-ingress-wikikube -ro and -rw discovery records

https://gerrit.wikimedia.org/r/787750

Change 787752 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add -ro and -rw discovery records for k8s-ingress-wikikube

https://gerrit.wikimedia.org/r/787752

Change 787752 merged by JMeybohm:

[operations/puppet@production] Add -ro and -rw discovery records for k8s-ingress-wikikube

https://gerrit.wikimedia.org/r/787752

Change 787753 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Switch miscweb and datahub-gms to new discovery records

https://gerrit.wikimedia.org/r/787753

Change 787753 merged by JMeybohm:

[operations/dns@master] Switch miscweb and datahub-gms to new discovery records

https://gerrit.wikimedia.org/r/787753

Change 787756 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Remove k8s-ingress-wikikube.discovery.wmnet

https://gerrit.wikimedia.org/r/787756

Change 787757 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Remove k8s-ingress-wikikube.discovery.wmnet

https://gerrit.wikimedia.org/r/787757

Change 787759 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] trafficserver: Switch datahub to new k8s-ingress-wikikube discovery

https://gerrit.wikimedia.org/r/787759

Change 787759 merged by JMeybohm:

[operations/puppet@production] trafficserver: Switch datahub to new k8s-ingress-wikikube discovery

https://gerrit.wikimedia.org/r/787759

Change 787756 merged by JMeybohm:

[operations/dns@master] Remove k8s-ingress-wikikube.discovery.wmnet

https://gerrit.wikimedia.org/r/787756

Change 787757 merged by JMeybohm:

[operations/puppet@production] Remove k8s-ingress-wikikube.discovery.wmnet

https://gerrit.wikimedia.org/r/787757

Change 780651 merged by Btullis:

[operations/puppet@production] Add DataHub GMS and frontend services to the service catalog

https://gerrit.wikimedia.org/r/780651