Page MenuHomePhabricator

Provide a convenient way to connect to services in kubernetes staging clusters
Closed, ResolvedPublic

Description

We currently rely on a rr-dns entry to connect to services running in Kubernetes clusters.

While this does work in general it needs manual configuration in DNS and it is different from how traffic reaches clusters in production.

With upcoming ingress, we should add an LVS service k8s-ingress-staging pointing to the istio-ingressgateway on staging kubernetes nodes (as we would do it in production).

That LVS service should be active/passive (eqiad being active by default) to retain the functionality of the "hidden" staging cluster used by SRE's to test infrastructure changes/updates etc.

LVS is now setup up and the staging clusters ingressgateway can be reached via k8s-ingress-staging.discovery.wmnet.
miscweb, being the only service currently enabled, can be accessed by with proper SNI:

curl -I --resolve "miscweb.staging.discovery.wmnet:30443:$(dig +short k8s-ingress-staging.discovery.wmnet)" 'https://miscweb.staging.discovery.wmnet:30443'

As next step, I would like to add a wildcard record to DNS (like *.k8s-staging.discovery.wmnet, name is just a proposal and what I currently used on the ingress side - easy to change that, though) pointing to whatever k8s-ingress-staging.discovery.wmnet currently points to. This would make setting up new services in staging clusters completely self-service (after the initial k8s user/namespace creation by SRE).

All that only applies to staging ofc. Production services would still need an entry in services.yaml for monitoring and individual depooling etc.

Event Timeline

Change 759253 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add k8s-ingress-staging LVS VIPs

https://gerrit.wikimedia.org/r/759253

Change 759259 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add k8s-ingress-staging to conftool-data

https://gerrit.wikimedia.org/r/759259

Change 759260 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add LVS service k8s-ingress-staging

https://gerrit.wikimedia.org/r/759260

Change 759470 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Create a new wikimedia_cluster: kubernetes-staging

https://gerrit.wikimedia.org/r/759470

Change 759470 merged by JMeybohm:

[operations/puppet@production] Create a new wikimedia_cluster: kubernetes-staging

https://gerrit.wikimedia.org/r/759470

Change 759253 merged by JMeybohm:

[operations/dns@master] Add k8s-ingress-staging LVS VIPs

https://gerrit.wikimedia.org/r/759253

Change 759259 merged by JMeybohm:

[operations/puppet@production] Add kubernetes-staging to conftool-data

https://gerrit.wikimedia.org/r/759259

Change 759260 merged by JMeybohm:

[operations/puppet@production] Add LVS service k8s-ingress-staging

https://gerrit.wikimedia.org/r/759260

Change 761357 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Move k8s-ingress-staging to state: lvs_setup

https://gerrit.wikimedia.org/r/761357

Change 761357 merged by JMeybohm:

[operations/puppet@production] Move k8s-ingress-staging to state: lvs_setup

https://gerrit.wikimedia.org/r/761357

Mentioned in SAL (#wikimedia-operations) [2022-02-09T15:30:59Z] <jayme> restarting pybal on lvs2010,lvs1020 - T300740

Mentioned in SAL (#wikimedia-operations) [2022-02-09T15:56:16Z] <jayme> restarting pybal on lvs1015,lvs2009 - T300740

Mentioned in SAL (#wikimedia-operations) [2022-02-09T15:57:00Z] <jayme> ran sudo rm /var/run/confd-template/.k8s-ingress-staging*.err on puppetmaster2001 - T300740

Change 761380 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Move k8s-ingress-staging to state: monitoring_setup

https://gerrit.wikimedia.org/r/761380

Change 761387 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add discovery record k8s-ingress-staging

https://gerrit.wikimedia.org/r/761387

Change 761380 merged by JMeybohm:

[operations/puppet@production] Move k8s-ingress-staging to state: monitoring_setup

https://gerrit.wikimedia.org/r/761380

Change 761392 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Move k8s-ingress-staging to state: poduction

https://gerrit.wikimedia.org/r/761392

Change 761393 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add k8s-ingress-staging to disc_desired_state.py

https://gerrit.wikimedia.org/r/761393

Change 761392 merged by JMeybohm:

[operations/puppet@production] Move k8s-ingress-staging to state: poduction

https://gerrit.wikimedia.org/r/761392

Change 761393 merged by JMeybohm:

[operations/puppet@production] Add k8s-ingress-staging to disc_desired_state.py

https://gerrit.wikimedia.org/r/761393

Change 761387 merged by JMeybohm:

[operations/dns@master] Add discovery record k8s-ingress-staging

https://gerrit.wikimedia.org/r/761387

Mentioned in SAL (#wikimedia-operations) [2022-02-09T17:07:26Z] <jayme> ran sudo rm /var/run/confd-template/.k8s-ingress-staging*.err on puppetmaster1001 - T300740

Change 761590 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add tcp-notls probe to k8s-ingress-staging

https://gerrit.wikimedia.org/r/761590

Change 761590 merged by JMeybohm:

[operations/puppet@production] Add tcp-notls probe to k8s-ingress-staging

https://gerrit.wikimedia.org/r/761590

I would generally agree that having all services with name *.staging.$dc.wmnet resolve to the same rr-record is what you want.

I'm not 100% sure what's the best technical solution for this, I would assume that if we have an LVS endpoint we can just write an A record for the wildcard?

I would generally agree that having all services with name *.staging.$dc.wmnet resolve to the same rr-record is what you want.

I'm not 100% sure what's the best technical solution for this, I would assume that if we have an LVS endpoint we can just write an A record for the wildcard?

Yes. But only for the DC specific names. Ideally I'd like to have a wildcard pointing to the discovery record so people don't have to care/know which DC is the active one.

Change 763717 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add *.k8s-staging.discovery.wmnet

https://gerrit.wikimedia.org/r/763717

Change 765273 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add k8s-ingress-wikikube to conftool-data

https://gerrit.wikimedia.org/r/765273

Change 765273 merged by JMeybohm:

[operations/puppet@production] Add k8s-ingress-wikikube to conftool-data

https://gerrit.wikimedia.org/r/765273

Change 763717 merged by JMeybohm:

[operations/dns@master] Add *.k8s-staging.discovery.wmnet

https://gerrit.wikimedia.org/r/763717

Change 776162 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Use *.k8s-staging.discovery.wmnet for staging certificates

https://gerrit.wikimedia.org/r/776162

Change 776163 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Use *.k8s-staging.discovery.wmnet for staging Ingress

https://gerrit.wikimedia.org/r/776163

Change 776162 merged by jenkins-bot:

[operations/deployment-charts@master] Use *.k8s-staging.discovery.wmnet for staging certificates

https://gerrit.wikimedia.org/r/776162

Change 776163 merged by jenkins-bot:

[operations/deployment-charts@master] Use *.k8s-staging.discovery.wmnet for staging Ingress

https://gerrit.wikimedia.org/r/776163

Something like curl -I https://miscweb.k8s-staging.discovery.wmnet:30443 now works by default.