Page MenuHomePhabricator

Support TLS for service-to-service communication in k8s staging
Closed, ResolvedPublic

Description

Currently we can not do TLS for service-to-service communication in k8s staging environment, because all the certificates are signed for production hostnames. E.g. to call eventgate-analytics in staging I need to issue a request to staging.svc.eqiad.wmnet with a right port, but the certificate is issued for eventgate-analytics.discovery.wmnet.

It is not critical to have encrypted connections in staging, but it would be convenient to be able to test TLS in staging. Plus, supporting it will make staging more homogenous with production.

Event Timeline

I think we could try to provide an easy way to use TLS by creating a certificate for staging.svc.{eqiad,codfw}.wmnet and distribute it as the cert/key pair for all services via /etc/helmfile-defaults/private/$service/staging.yaml, where we currently add the actual certs for the production services.

Joe triaged this task as High priority.Sep 29 2020, 3:39 PM
JMeybohm subscribed.

As this is all in private repo I try do describe here what I plan to do:

  1. Add to pivate/modules/secret/secrets/certificates/certificate.manifests.d/kube_services.certs.yaml:
default-staging-certificate.wmnet:                                                                                                                                                                                                                                           
  authority: puppet_ca                                                                                                                                                                                                                                                 
  expiry: null                                                                                                                                                                                                                                                         
  alt_names: ["staging.svc.eqiad.wmnet", "staging.svc.codfw.wmnet"]                                                                                                                                        
  key:                                                                                                                                                                                                                                                                 
    algorithm: ec
  1. run cergen, commit etc. (https://wikitech.wikimedia.org/wiki/Enable_TLS_for_Kubernetes_deployments#Create_and_place_certificates)
  1. do something like the following to hieradata/role/common/deployment_server.yaml:
--- deployment_server.yaml.orig 2020-10-01 09:54:11.630710735 +0000
+++ deployment_server.yaml      2020-10-01 09:54:36.954662985 +0000
@@ -1,11 +1,17 @@
 profile::kubernetes::deployment_server_secrets::services:
+  default-staging-certificate:
+    tls: &default-staging-certificate
+      certs:
+        key: "secret(certificates/default-staging-certificate.wmnet/default-staging-certificate.wmnet.key.private.pem)"
+        cert: "secret(certificates/default-staging-certificate.wmnet/default-staging-certificate.wmnet.crt.pem)"
+
   blubberoid:
     staging:
+      tls: *default-staging-certificate
+    eqiad:
       tls: &blubberoid_certs
         certs:
           key: "secret(certificates/blubberoid.discovery.wmnet/blubberoid.discovery.wmnet.key.private.pem)"
           cert: "secret(certificates/blubberoid.discovery.wmnet/blubberoid.discovery.wmnet.crt.pem)"
-    eqiad:
-      tls: *blubberoid_certs
     codfw:
       tls: *blubberoid_certs
  1. test and adoopt for all other services.

there's a simpler way, having puppet special-case for staging instead of changing all the occurrences, but pick the approach you prefer.

there's a simpler way, having puppet special-case for staging instead of changing all the occurrences, but pick the approach you prefer.

Sounds good (and more generic), let me try to understand better (still in early puppet level, you know :-) ).

I would create a hiera key like

profile::kubernetes::deployment_server_secrets::defaults:
  staging:
    tls:
      certs:
        key: "secret(certificates/default-staging-certificate.wmnet/default-staging-certificate.wmnet.key.private.pem)"
        cert: "secret(certificates/default-staging-certificate.wmnet/default-staging-certificate.wmnet.crt.pem)"
# More environments/defaults possible
# eqiad:
#   ...

And then patch modules/profile/manifests/kubernetes/deployment_server/helmfile.pp to deep_merge that with raw_data somewhere here: https://github.com/wikimedia/puppet/blob/52e2256938f66a2c05b902bab5bc5b4c4a1b7fd0/modules/profile/manifests/kubernetes/deployment_server/helmfile.pp#L186

there's a simpler way, having puppet special-case for staging instead of changing all the occurrences, but pick the approach you prefer.

Sounds good (and more generic), let me try to understand better (still in early puppet level, you know :-) ).

I would create a hiera key like

profile::kubernetes::deployment_server_secrets::defaults:
  staging:
    tls:
      certs:
        key: "secret(certificates/default-staging-certificate.wmnet/default-staging-certificate.wmnet.key.private.pem)"
        cert: "secret(certificates/default-staging-certificate.wmnet/default-staging-certificate.wmnet.crt.pem)"
# More environments/defaults possible
# eqiad:
#   ...

And then patch modules/profile/manifests/kubernetes/deployment_server/helmfile.pp to deep_merge that with raw_data somewhere here: https://github.com/wikimedia/puppet/blob/52e2256938f66a2c05b902bab5bc5b4c4a1b7fd0/modules/profile/manifests/kubernetes/deployment_server/helmfile.pp#L186

yes, that's the general idea. I can see advantages for either approach.

Change 631720 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] deployment_server::helmfile: Allow default secrets per environment

https://gerrit.wikimedia.org/r/631720

Change 631724 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[labs/private@master] Add dummy default secrets

https://gerrit.wikimedia.org/r/631724

Change 631724 merged by JMeybohm:
[labs/private@master] Add dummy default secrets

https://gerrit.wikimedia.org/r/631724

I could use a pair of eyes on https://gerrit.wikimedia.org/r/q/bug:T260917
The PCC full diff (https://puppet-compiler.wmflabs.org/compiler1002/25621/) lacks defaultsecret: notdefault for staging zotero. What am I missing here?

I could use a pair of eyes on https://gerrit.wikimedia.org/r/q/bug:T260917
The PCC full diff (https://puppet-compiler.wmflabs.org/compiler1002/25621/) lacks defaultsecret: notdefault for staging zotero. What am I missing here?

It also lacks the file for staging/eqiad zotero completely. Are you sure it's all defined correctly in the labs/private data?

It also lacks the file for staging/eqiad zotero completely. Are you sure it's all defined correctly in the labs/private data?

"eqiad zotero" not being there is fine as there is no secret for that combination defined in labs/private.

I changed the values a bit to also include a new (not default) zotero secret for staging and that does as well not show up in the full diff whereas it is shown in the change catalog (including the overridden defaults): https://puppet-compiler.wmflabs.org/compiler1003/25719/

labs/private changes are:

I could use a pair of eyes on https://gerrit.wikimedia.org/r/q/bug:T260917
The PCC full diff (https://puppet-compiler.wmflabs.org/compiler1002/25621/) lacks defaultsecret: notdefault for staging zotero. What am I missing here?

The catalog though, does have it. See https://puppet-compiler.wmflabs.org/compiler1002/25621/deploy1001.eqiad.wmnet/change.deploy1001.eqiad.wmnet.pson where I see the following

{
  "type": "File",
  "title": "/srv/deployment-charts/helmfile.d/services/staging/zotero/private/secrets.yaml",
  "tags": [
    "file",
    "class",
    "profile::kubernetes::deployment_server::helmfile",
    "profile",
    "kubernetes",
    "deployment_server",
    "helmfile",
    "profile::kubernetes::deployment_server",
    "role::deployment_server",
    "role"
  ],     
  "file": "/srv/jenkins-workspace/puppet-compiler/25621/change/src/modules/profile/manifests/kubernetes/deployment_server/helmfile.pp",
  "line": 195,
  "exported": false, 
  "parameters": {
    "owner": "mwdeploy",
    "group": "wikidev",
    "mode": "0640",
    "content": "defaultsecret: notdefault\nloooldata: a\n\n",
    "require": [
      "Git::Clone[operations/deployment-charts]",
      "File[/srv/deployment-charts/helmfile.d/services/staging/zotero/private]"
    ]
  }      
},

What's peculiar, is that https://puppet-compiler.wmflabs.org/compiler1002/25621/deploy1001.eqiad.wmnet/change.deploy1001.eqiad.wmnet.pson doesn't have the test:data thing. Perhaps that's an earlier PCC though?

What's peculiar, is that https://puppet-compiler.wmflabs.org/compiler1002/25621/deploy1001.eqiad.wmnet/change.deploy1001.eqiad.wmnet.pson doesn't have the test:data thing. Perhaps that's an earlier PCC though?

Yeah, that's a PCC from before the second change to labs/private. https://puppet-compiler.wmflabs.org/compiler1003/25719/ is the current one (which contains test: data in change catalog).

What's peculiar, is that https://puppet-compiler.wmflabs.org/compiler1002/25621/deploy1001.eqiad.wmnet/change.deploy1001.eqiad.wmnet.pson doesn't have the test:data thing. Perhaps that's an earlier PCC though?

Yeah, that's a PCC from before the second change to labs/private. https://puppet-compiler.wmflabs.org/compiler1003/25719/ is the current one (which contains test: data in change catalog).

OK, that explains it.

Leaning to a PCC diff bug then for the lack of defaultsecret: notdefault then.

Mentioned in SAL (#wikimedia-operations) [2020-10-14T14:12:44Z] <jayme> disable-puppet on deploy1001 to test a change in hemlfile puppet on deploy2001 only - T260917

Change 631720 merged by JMeybohm:
[operations/puppet@production] deployment_server::helmfile: Allow default secrets per environment

https://gerrit.wikimedia.org/r/631720

I did test the change on deploy2001 and it behaves as expected (e.g. different from what PCC suggests). Means this PCC not showing a diff here is definitely a bug.

Mentioned in SAL (#wikimedia-operations) [2020-10-14T15:24:10Z] <jayme> enabled and ran puppet on deploy1001 - T260917

Deployed all services with new certificate in staging. Should be fine now @Pchelolo - let me know if you run into any issues.