Page MenuHomePhabricator

Implement POC for istio ingress
Closed, ResolvedPublic

Description

This is the follow up task for T287007.

Already merged changes regarding this:

Those changes, together with a WIP one, made it possible to generally install istio to staging-codfw but there are still some open questions/things to fix:

  • Decide how we want the kube-apiserver to reach webooks running inside of the cluster, see: T290967
  • Figure out how to deal with the internal ca that istio manages. It is by default used to secure communication with itsiod as well as establish trust between the Ingress-Gateway and services.
    • We can leave that alone as it is fully managed by istio itself and only used for istiod<->istio-ingressgateway communication in our setup
  • Make Ingress-Gateway trust Puppet-CA (e.g. tls-proxy) certificates 730591
  • Make prometheus scrape istiod and Ingress-Gateway
  • Decide on how we want to run the Ingress-Gateway and ultimately how we want PyBal to healthcheck it/the k8s nodes. See "On running the Istio-Ingressgateway"
    • We will run ingressgateway as daemonset with a exernalTrafficPolicy: Local service in front
  • Provision a default ingress gateway for staging clusters (serving staging.svc.<DC>.discovery.wmnet) Nothing we can easily do without changing the HTTP routing compared to production.
    • Create an active/passive LVS for staging and make it accessible: T300740
  • Implement something to provision k8s Secret objects (in istio-system namespace) for service certificates (currently generated via cergen) T294560
  • Bunch of docs and training session for SRE (https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress)
  • Deploy all the things to wikikube clusters

I'm keeping some additional, unordered notes at https://wikitech.wikimedia.org/wiki/User:JMeybohm/Kubernetes/Ingress

On running the Istio-Ingressgateway

Regardless of the way we'll be deploying the ingressgateway, connections to it will happen via LVS -> NodePort. See what ML did

We can use PyBal IdleConnection monitor as the Ingressgateway HTTP health endpoint is exposed on a dedicated port and PyBal can only do ProxyFetch one the service port (not a different one).

We could potentially patch PyBal to allow a different port (maybe per ProxyFetch URLs) as well [1]

Autoscaling

By default Istio configures the Ingressgateway deployment (and control-plane) with autoscaling enabled (HPA) on targetAverageUtilization.

Pro

  • Run only as much ingressgateways as we need (potentially)

Con

  • Potential extra network hop from one Node to another (running a ingressgateway Pod)
  • PyBal can't differentiate on Ingressgateway down vs. Node down. If no ingressgateway is available, the NodePort won't accept connections and PyBal would see all Nodes as down (not sure if that's actually a problem).
  • We have no experience with HPA

Daemonset

In this scenario we run Ingressgateway as Daemonset (e.g. on each Node) and set it's Service externalTrafficPolicy=Local (this ensures a connection to a Nodes NodePort will be answered by the Ingressgateway Pod on the same node).

Pro

  • No extra network hop between LVS
  • Health checking an Ingressgateway is actually health checking a Node (in contrast to some Ingressgateway potentially running on a different node)

Con

  • Waste on resources as we run one Ingressgateway per node (would need to figure out how much that is. More Gateways will also add more load to the Control Plane)

[1] https://github.com/wikimedia/PyBal/blob/b331a4a4cd62b2ec519b07a69a3cc8dd7b6711d5/pybal/monitors/proxyfetch.py#L131

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/dnsmaster+16 -6
operations/puppetproduction+1 -8
operations/dnsmaster+2 -4
operations/puppetproduction+2 -19
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+82 -33
operations/deployment-chartsmaster+20 -17
operations/deployment-chartsmaster+4 -3
operations/deployment-chartsmaster+29 -5
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+6 -0
operations/deployment-chartsmaster+13 -3
operations/puppetproduction+2 -0
operations/dnsmaster+3 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+52 -0
operations/deployment-chartsmaster+5 -3
operations/deployment-chartsmaster+18 -0
operations/dnsmaster+6 -2
operations/puppetproduction+8 -0
labs/privatemaster+16 -0
labs/privatemaster+3 -0
operations/deployment-chartsmaster+0 -10
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+0 -1
operations/deployment-chartsmaster+8 -12
operations/deployment-chartsmaster+27 -1
operations/deployment-chartsmaster+44 -5
operations/deployment-chartsmaster+1 -0
operations/deployment-chartsmaster+8 -0
operations/deployment-chartsmaster+10 -0
operations/deployment-chartsmaster+7 -12
operations/deployment-chartsmaster+12 -0
operations/deployment-chartsmaster+14 -0
operations/deployment-chartsmaster+315 -3
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+4 -3
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+143 -5
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 757898 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Allow deploy users to create ingress and certificate objects

https://gerrit.wikimedia.org/r/757898

Change 757934 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] _ingress_helpers: HTTPRoute does not require a destination

https://gerrit.wikimedia.org/r/757934

Change 757935 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add ingress support to miscweb chart

https://gerrit.wikimedia.org/r/757935

Change 757936 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] miscweb: Remove repeating settings and enable ingress

https://gerrit.wikimedia.org/r/757936

Change 757898 merged by jenkins-bot:

[operations/deployment-charts@master] Allow deploy users to create ingress and certificate objects

https://gerrit.wikimedia.org/r/757898

Change 757934 merged by jenkins-bot:

[operations/deployment-charts@master] _ingress_helpers: HTTPRoute does not require a destination

https://gerrit.wikimedia.org/r/757934

Change 758880 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Create certificates for different FQDN's in staging

https://gerrit.wikimedia.org/r/758880

Change 758881 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Deploy ingress components to staging-eqiad

https://gerrit.wikimedia.org/r/758881

Change 758880 merged by jenkins-bot:

[operations/deployment-charts@master] Create certificates for different FQDN's in staging

https://gerrit.wikimedia.org/r/758880

Change 758881 merged by jenkins-bot:

[operations/deployment-charts@master] Deploy ingress components to staging-eqiad

https://gerrit.wikimedia.org/r/758881

Change 759726 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Enable nodePort 30021 for ingressgateway status

https://gerrit.wikimedia.org/r/759726

Change 759727 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add ingress.staging switch

https://gerrit.wikimedia.org/r/759727

Change 759749 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/debs/pybal@master] Allow to configure a different port for ProxyFetch monitor

https://gerrit.wikimedia.org/r/759749

Change 759726 merged by jenkins-bot:

[operations/deployment-charts@master] Enable nodePort 30021 for ingressgateway status

https://gerrit.wikimedia.org/r/759726

Change 759727 merged by jenkins-bot:

[operations/deployment-charts@master] Add ingress.staging switch

https://gerrit.wikimedia.org/r/759727

Change 757935 merged by jenkins-bot:

[operations/deployment-charts@master] Add ingress support to miscweb chart

https://gerrit.wikimedia.org/r/757935

Change 757936 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: Remove repeating settings and enable ingress

https://gerrit.wikimedia.org/r/757936

Change 763700 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Revert \"Enable nodePort 30021 for ingressgateway status\"

https://gerrit.wikimedia.org/r/763700

Change 763701 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Increase istiod replicas to 2

https://gerrit.wikimedia.org/r/763701

Change 763700 merged by jenkins-bot:

[operations/deployment-charts@master] Revert \"Enable nodePort 30021 for ingressgateway status\"

https://gerrit.wikimedia.org/r/763700

Change 763701 merged by jenkins-bot:

[operations/deployment-charts@master] Increase istiod replicas to 2

https://gerrit.wikimedia.org/r/763701

Change 763705 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Drop unused ports from istio-ingressgateway service definition

https://gerrit.wikimedia.org/r/763705

Change 763705 merged by jenkins-bot:

[operations/deployment-charts@master] Drop unused ports from istio-ingressgateway service definition

https://gerrit.wikimedia.org/r/763705

Change 764719 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] Add a dedicated profile for k8s_wikikube

https://gerrit.wikimedia.org/r/764719

Change 764718 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add a dedicated profile for k8s_wikikube

https://gerrit.wikimedia.org/r/764718

Change 764719 merged by JMeybohm:

[labs/private@master] Add a dedicated profile for k8s_wikikube

https://gerrit.wikimedia.org/r/764719

Change 764722 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[labs/private@master] Add credentiald for cfssl-issuter to deployment_server_secrets

https://gerrit.wikimedia.org/r/764722

Change 764722 merged by JMeybohm:

[labs/private@master] Add credentiald for cfssl-issuter to deployment_server_secrets

https://gerrit.wikimedia.org/r/764722

Change 764723 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Enable ingress and cert-manager in wikikube clusters

https://gerrit.wikimedia.org/r/764723

Change 764718 merged by JMeybohm:

[operations/puppet@production] Add a dedicated profile for k8s_wikikube

https://gerrit.wikimedia.org/r/764718

Change 764728 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add k8s-inress-wikikube LVS VIPs

https://gerrit.wikimedia.org/r/764728

Change 764733 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add LVS servie k8s-ingress-wikikube

https://gerrit.wikimedia.org/r/764733

Change 764734 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Move k8s-ingress-wikikube to state: lvs_setup

https://gerrit.wikimedia.org/r/764734

Change 764735 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Move k8s-ingress-wikikube to state: monitoring_setup

https://gerrit.wikimedia.org/r/764735

Change 764736 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Move k8s-ingress-wikikube to state: production

https://gerrit.wikimedia.org/r/764736

Change 764738 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Add k8s-ingress-wikikube discovery record

https://gerrit.wikimedia.org/r/764738

Change 764739 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Add k8s-ingress-wikikube to disc_desired_state.py

https://gerrit.wikimedia.org/r/764739

Change 764749 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] miscweb: Enable ingress for all clusters

https://gerrit.wikimedia.org/r/764749

Change 764728 merged by JMeybohm:

[operations/dns@master] Add k8s-ingress-wikikube LVS VIPs

https://gerrit.wikimedia.org/r/764728

Change 764723 merged by jenkins-bot:

[operations/deployment-charts@master] Enable ingress and cert-manager in wikikube clusters

https://gerrit.wikimedia.org/r/764723

Change 764749 merged by jenkins-bot:

[operations/deployment-charts@master] miscweb: Enable ingress for all clusters

https://gerrit.wikimedia.org/r/764749

Change 764733 merged by JMeybohm:

[operations/puppet@production] Add LVS servie k8s-ingress-wikikube

https://gerrit.wikimedia.org/r/764733

Change 764734 merged by JMeybohm:

[operations/puppet@production] Move k8s-ingress-wikikube to state: lvs_setup

https://gerrit.wikimedia.org/r/764734

Mentioned in SAL (#wikimedia-operations) [2022-02-23T14:08:21Z] <jayme> restarting pybal on lvs1020,lvs2010 - T290966

Mentioned in SAL (#wikimedia-operations) [2022-02-23T14:12:45Z] <jayme> restarting pybal on lvs1019,lvs2009 - T290966

Change 764735 merged by JMeybohm:

[operations/puppet@production] Move k8s-ingress-wikikube to state: monitoring_setup

https://gerrit.wikimedia.org/r/764735

Change 764736 merged by JMeybohm:

[operations/puppet@production] Move k8s-ingress-wikikube to state: production

https://gerrit.wikimedia.org/r/764736

Change 764738 merged by JMeybohm:

[operations/dns@master] Add k8s-ingress-wikikube discovery record

https://gerrit.wikimedia.org/r/764738

Change 764739 merged by JMeybohm:

[operations/puppet@production] Add k8s-ingress-wikikube to disc_desired_state.py

https://gerrit.wikimedia.org/r/764739

Change 765502 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add tlsExtraSANs config to namespaces

https://gerrit.wikimedia.org/r/765502

Change 765502 merged by jenkins-bot:

[operations/deployment-charts@master] Add tlsExtraSANs config to namespaces

https://gerrit.wikimedia.org/r/765502

Change 765564 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add static-bugzilla.wikimedia.org gatewayHost to miscweb

https://gerrit.wikimedia.org/r/765564

Change 765564 merged by jenkins-bot:

[operations/deployment-charts@master] Add static-bugzilla.wikimedia.org gatewayHost to miscweb

https://gerrit.wikimedia.org/r/765564

Change 765572 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] trafficserver: change miscweb backend to k8s-ingress-wikikube

https://gerrit.wikimedia.org/r/765572

Change 765572 merged by Dzahn:

[operations/puppet@production] trafficserver: change miscweb backend to k8s-ingress-wikikube

https://gerrit.wikimedia.org/r/765572

Mentioned in SAL (#wikimedia-operations) [2022-02-24T22:06:05Z] <mutante> static-bugzilla.wikimedia.org - kubernetes - deployed gerrit:765572 - first prod service behind a k8s ingress (T290966)

deployed @JMeybohm's change https://gerrit.wikimedia.org/r/c/operations/puppet/+/765572

Now miscweb/https://static-bugzilla.wikimedia.org/ is behind the istio ingress.

I can see fresh traffic here:

Screenshot from 2022-02-24 13-59-38.png (442×1 px, 37 KB)

and logs here:

Screenshot from 2022-02-24 13-59-48.png (898×1 px, 287 KB)

after searching by the "authority" field. authority=static-bugzilla.wikimedia.org:30443

I got the links to the dashboards from https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress

Change 767078 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Make k8s-ingress-wikikube page

https://gerrit.wikimedia.org/r/767078

Change 767078 merged by JMeybohm:

[operations/puppet@production] Make k8s-ingress-wikikube page

https://gerrit.wikimedia.org/r/767078

Change 770504 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Remove LVS for miscweb

https://gerrit.wikimedia.org/r/770504

Change 770506 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Move miscweb from it's own LVS VIP to k8s-ingress-wikikube

https://gerrit.wikimedia.org/r/770506

Change 770556 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Prevent allocation of nodePorts when ingress is used

https://gerrit.wikimedia.org/r/770556

Change 773191 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Update miscweb to latest scaffold

https://gerrit.wikimedia.org/r/773191

Change 770556 merged by jenkins-bot:

[operations/deployment-charts@master] Switch service type to ClusterIP in case Ingress is enabled

https://gerrit.wikimedia.org/r/770556

Change 773191 merged by JMeybohm:

[operations/deployment-charts@master] Update miscweb to latest scaffold

https://gerrit.wikimedia.org/r/773191

Change 773255 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Allow multiple tlsHostnames

https://gerrit.wikimedia.org/r/773255

Change 773255 merged by jenkins-bot:

[operations/deployment-charts@master] Allow multiple tlsHostnames

https://gerrit.wikimedia.org/r/773255

Change 773805 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Allow to specify additional gatewayHosts without overriding the default

https://gerrit.wikimedia.org/r/773805

Change 773805 merged by jenkins-bot:

[operations/deployment-charts@master] Allow to specify additional gateway hosts without overriding the default

https://gerrit.wikimedia.org/r/773805

Change 774916 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Move miscweb back to state monitoring_setup

https://gerrit.wikimedia.org/r/774916

Change 774917 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Move miscweb back to state production

https://gerrit.wikimedia.org/r/774917

Change 774916 merged by JMeybohm:

[operations/puppet@production] Move miscweb back to state monitoring_setup

https://gerrit.wikimedia.org/r/774916

Change 770504 merged by JMeybohm:

[operations/puppet@production] Remove LVS for miscweb

https://gerrit.wikimedia.org/r/770504

Change 770506 merged by JMeybohm:

[operations/dns@master] Move miscweb from it's own LVS VIP to k8s-ingress-wikikube

https://gerrit.wikimedia.org/r/770506

Change 775319 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Remove monitoring from kubernetes miscweb for now

https://gerrit.wikimedia.org/r/775319

Change 775319 merged by JMeybohm:

[operations/puppet@production] Remove monitoring from kubernetes miscweb for now

https://gerrit.wikimedia.org/r/775319

Change 786322 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] Update miscweb relates records for use with k8s ingress

https://gerrit.wikimedia.org/r/786322

Change 786322 merged by JMeybohm:

[operations/dns@master] Update miscweb relates records for use with k8s ingress

https://gerrit.wikimedia.org/r/786322

Change 774917 merged by JMeybohm:

[operations/puppet@production] Move miscweb back to state production

https://gerrit.wikimedia.org/r/774917

Change 786977 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] trafficserver: change miscweb backend back to miscweb.discovery.wmnet

https://gerrit.wikimedia.org/r/786977

Change 786977 merged by JMeybohm:

[operations/puppet@production] trafficserver: change miscweb backend back to miscweb.discovery.wmnet

https://gerrit.wikimedia.org/r/786977

This is done with miscweb being the first full Ingress service and datahub following up.
Docs at https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress