Page MenuHomePhabricator

Sessionstore's discovery TLS cert will expire before end of May 2024
Open, HighPublic

Description

Hi folks!

The TLS cert for sessionstore.discovery.wmnet will expire before end of May 2024:

$ openssl s_client -connect sessionstore.discovery.wmnet:8081 | openssl x509 -text
[..]
        Validity
            Not Before: May 29 17:38:58 2019 GMT
            Not After : May 28 17:38:58 2024 GMT
[..]

We should replace it to avoid any outages, or move the service to PKI/Mesh in K8s.

If we want to just regenerate the Puppet TLS cert: https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate

Event Timeline

This certificate doesn't show up anywhere in certificate.manifests.d for cergen, though?

JMeybohm triaged this task as High priority.
JMeybohm subscribed.

This certificate doesn't show up anywhere in certificate.manifests.d for cergen, though?

This is probably because I've deleted them during T300033: Use cert-manager for service-proxy certificate creation thinking that I had migrated everything. Ofc. sessionstore/kask is different and does not use the service mesh/the service proxy for tls termination so I did not catch it

The cert is here:

elukey@puppetmaster1001:/srv/private$ find -name *session* | grep -v cassandra
./modules/secret/secrets/ssl/sessionstore.discovery.wmnet.key

@Eevans IIUC kask terminates TLS by itself for session store, is it right? Would it be a problem to move to the mesh k8s module, namely to use the envoy sidecar that terminates TLS and proxies the request (in this case, kask would listen to a plaintext localhost:port combination). I am asking since we could move to PKI directly if kask's chart used the mesh module.

@akosiaris maybe you recall if there was a deliberate decision not to use service mesh for kask/session store?

If we're not using the mesh, kask should support reloading certs from disk on changes. With that we could just replace the cergen certs with auto-renewing ones like we do in the mesh.

@akosiaris maybe you recall if there was a deliberate decision not to use service mesh for kask/session store?

Yes it was a deliberate decision. There is some back story but the TL;DR is

  • Kask predates the service mesh, needed TLS and so implemented itself
  • Sessionstore is pretty critical and could benefit from not having the extra latency the service mesh possibly adds
  • A change to sessionstore was somewhat high risk and altering the behavior of a well functioning component didn't seem like it had any kind of significant payoff back then.

If we're not using the mesh, kask should support reloading certs from disk on changes. With that we could just replace the cergen certs with auto-renewing ones like we do in the mesh.

Sure, but I don't know if it actually supports reloading certs from disk, adding @Eevans.

@akosiaris maybe you recall if there was a deliberate decision not to use service mesh for kask/session store?

Yes it was a deliberate decision. There is some back story but the TL;DR is

  • Kask predates the service mesh, needed TLS and so implemented itself
  • Sessionstore is pretty critical and could benefit from not having the extra latency the service mesh possibly adds
  • A change to sessionstore was somewhat high risk and altering the behavior of a well functioning component didn't seem like it had any kind of significant payoff back then.

If we're not using the mesh, kask should support reloading certs from disk on changes. With that we could just replace the cergen certs with auto-renewing ones like we do in the mesh.

Sure, but I don't know if it actually supports reloading certs from disk, adding @Eevans.

It does not, it just uses the Golang http module's standard server initialization. Adding this would be pretty easy though.

I'd vote to add the mesh support/configuration at this point, it seems less risky and error prone than allowing kask to reload TLS certs. The only concern would be the extra latency involved, but in theory it shouldn't be heavy (it add an extra hop/tcp conn to localhost, we can measure the impact in staging and decide).

I tend to agree, also for sake of alignment of sessionstore with the rest of our services. Unfortunately this feels like the more involved change (A change to sessionstore was somewhat high risk ...) - but I think it's also true that it should not add much to latency.

Steps that I see:

  • Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom for a major change to one of our most critical services
  • Add toggleable mesh support
  • Get a baseline for performance in staging by running siege for a few hours
  • Deploy mesh-enabled version to staging, compare performance before and after
  • Roll forward if everything looks okay

Only concern around testing in staging is having reasonably representative infrastructure, given that it's both using cassandra-dev, and in codfw from pods in eqiad. That said, we're just looking for a baseline and not identical numbers.

Also worth noting that mediawiki already calls sessionstore via it's envoy sidecar, so we do have telemetry data from prod and we should be able to see the impact there pretty quickly as well: https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=sessionstore

Steps that I see:

  • Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom for a major change to one of our most critical services

[ ... ]

This certificate doesn't show up anywhere in certificate.manifests.d for cergen, though?

This is probably because I've deleted them during T300033: Use cert-manager for service-proxy certificate creation thinking that I had migrated everything. Ofc. sessionstore/kask is different and does not use the service mesh/the service proxy for tls termination so I did not catch it

The cert is here:

elukey@puppetmaster1001:/srv/private$ find -name *session* | grep -v cassandra
./modules/secret/secrets/ssl/sessionstore.discovery.wmnet.key

That's just a key; I don't see anything but key files there...

@Eevans IIUC that is the private key, it gets rendered in /etc/helmfile-defaults/private/main_services/sessionstore on deploy1002 and eventually it is deployed on the session store pods. The public key is stored in deployment-charts, in the sessionstore's values.yaml. I suspect that cergen wasn't used for this use case, or something more manual was chosen because of K8s.

Steps that I see:

  • Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom for a major change to one of our most critical services
  • Add toggleable mesh support
  • Get a baseline for performance in staging by running siege for a few hours
  • Deploy mesh-enabled version to staging, compare performance before and after
  • Roll forward if everything looks okay

Only concern around testing in staging is having reasonably representative infrastructure, given that it's both using cassandra-dev, and in codfw from pods in eqiad. That said, we're just looking for a baseline and not identical numbers.

+1! Not sure how to renew the existing cert since IIUC cergen wasn't used, but we can try to check in old tasks to see what it was done.

Steps that I see:

  • Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom for a major change to one of our most critical services
  • Add toggleable mesh support
  • Get a baseline for performance in staging by running siege for a few hours
  • Deploy mesh-enabled version to staging, compare performance before and after
  • Roll forward if everything looks okay

Only concern around testing in staging is having reasonably representative infrastructure, given that it's both using cassandra-dev, and in codfw from pods in eqiad. That said, we're just looking for a baseline and not identical numbers.

+1! Not sure how to renew the existing cert since IIUC cergen wasn't used, but we can try to check in old tasks to see what it was done.

I'm out next week, and the cert is about to expire, so I'll try to reverse engineer this. If anyone has first-hand knowledge of how this was done, please let me know!

+1! Not sure how to renew the existing cert since IIUC cergen wasn't used, but we can try to check in old tasks to see what it was done.

I'm out next week, and the cert is about to expire, so I'll try to reverse engineer this. If anyone has first-hand knowledge of how this was done, please let me know!

Looking at git log in the private there's a reference to https://phabricator.wikimedia.org/T220401, so it seems Alex added it back then.

Steps that I see:

  • Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom for a major change to one of our most critical services
  • Add toggleable mesh support
  • Get a baseline for performance in staging by running siege for a few hours
  • Deploy mesh-enabled version to staging, compare performance before and after
  • Roll forward if everything looks okay

Only concern around testing in staging is having reasonably representative infrastructure, given that it's both using cassandra-dev, and in codfw from pods in eqiad. That said, we're just looking for a baseline and not identical numbers.

+1! Not sure how to renew the existing cert since IIUC cergen wasn't used, but we can try to check in old tasks to see what it was done.

I'm out next week, and the cert is about to expire, so I'll try to reverse engineer this. If anyone has first-hand knowledge of how this was done, please let me know!

@akosiaris pointed out that he generated the certs with utils/create_ecdsa_cert - I've been testing this out and it will hopefully work similarly.

Change #1034952 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] utils: remove pem copying step from ecdsa_cert

https://gerrit.wikimedia.org/r/1034952

Mentioned in SAL (#wikimedia-operations) [2024-05-22T14:57:01Z] <hnowlan> running puppet cert revoke sessionstore.discovery.wmnet T363996

Change #1034972 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] sessionstore: use new cert to go with new key in staging

https://gerrit.wikimedia.org/r/1034972

Change #1034972 merged by jenkins-bot:

[operations/deployment-charts@master] sessionstore: use new cert to go with new key in staging

https://gerrit.wikimedia.org/r/1034972

Change #1035010 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] kask: checksum tls certs

https://gerrit.wikimedia.org/r/1035010

Current situation - I have refreshed the .key file on the puppet master using a modified version of the create_ecdsa_cert script, and I have pushed the new key to the staging k8s secrets for sessionstore only. I've also updated the cert file in the helmfile configuration for sessionstore, but it hasn't picked it up because that part of the config wasn't checksummed to recreate the pods (fixed in this change). Tomorrow I will try to roll to codfw and, if successful, eqiad.

Change #1035010 merged by jenkins-bot:

[operations/deployment-charts@master] kask: checksum tls certs

https://gerrit.wikimedia.org/r/1035010

Change #1035357 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] sessionstore: update certs in advance of expiry

https://gerrit.wikimedia.org/r/1035357

Change #1035357 merged by jenkins-bot:

[operations/deployment-charts@master] sessionstore: update certs in advance of expiry

https://gerrit.wikimedia.org/r/1035357

certs updated in all DCs, alerts resolved. I sincerely hope we will have the mesh migration resolved so we can avoid having to update echostore's certificates in October, but in case something prevents that and for reference the process was:

  • puppet cert revoke sessionstore.discovery.wmnet
  • In the puppet repo on your local checkout ./utils/create_ecdsa_cert sessionstore.discovery.wmnet sessionstore.svc.eqiad.wmnet sessionstore.svc.codfw.wmnet
  • On the puppetmaster, put the contents of /var/lib/puppet/server/ssl/ca/signed/sessionstore.discovery.wmnet.pem into certs.kask.cert in helmfile.d
  • Add the contents of the new private key from ./modules/secret/secrets/ssl/sessionstore.discovery.wmnet.key to hieradata/role/common/deployment_server/kubernetes.yaml
  • Validate the files and make sure everything looks okay using openssl ec/openssl x509, then git commit your changes in private
  • Follow the Helm rollout process as normal, keeping an eye on the sessionstore graphs and the session loss graphs

We currently have no documentation for echostore so the impact of a broken rollout is unclear, but it's just another kask instance should it come to the point that we need to follow the above process. That said, worryingly it doesn't seem like a cert even exists in puppet for echostore.discovery.wmnet? We can just roll ahead and generate one regardless though

certs updated in all DCs, alerts resolved. I sincerely hope we will have the mesh migration resolved so we can avoid having to update echostore's certificates in October, but in case something prevents that and for reference the process was:

  • puppet cert revoke sessionstore.discovery.wmnet
  • In the puppet repo on your local checkout ./utils/create_ecdsa_cert sessionstore.discovery.wmnet sessionstore.svc.eqiad.wmnet sessionstore.svc.codfw.wmnet
  • On the puppetmaster, put the contents of /var/lib/puppet/server/ssl/ca/signed/sessionstore.discovery.wmnet.pem into certs.kask.cert in helmfile.d
  • Add the contents of the new private key from ./modules/secret/secrets/ssl/sessionstore.discovery.wmnet.key to hieradata/role/common/deployment_server/kubernetes.yaml
  • Validate the files and make sure everything looks okay using openssl ec/openssl x509, then git commit your changes in private
  • Follow the Helm rollout process as normal, keeping an eye on the sessionstore graphs and the session loss graphs

Thanks for taking care of this @hnowlan , (and for documenting the steps)!

We currently have no documentation for echostore so the impact of a broken rollout is unclear, but it's just another kask instance should it come to the point that we need to follow the above process. That said, worryingly it doesn't seem like a cert even exists in puppet for echostore.discovery.wmnet? We can just roll ahead and generate one regardless though

I just created https://wikitech.wikimedia.org/wiki/EchoStore. It's not great; Mostly a copy-paste of https://wikitech.wikimedia.org/wiki/SessionStorage, but maybe better than nothing? The TL;DR is that it's a cache of seen-times. I think the impact of a broken rollout is quite low (relative to sessionstore, anyway). I believe it mainly controls the little badge that shows the number of unread notifications, and a failure to lookup might show that you had more than you should?

Change #1034952 merged by Hnowlan:

[operations/puppet@production] utils: remove pem copying step from ecdsa_cert

https://gerrit.wikimedia.org/r/1034952

Change #1039247 had a related patch set uploaded (by CDanis; author: Hnowlan):

[operations/deployment-charts@master] kask: add mesh configuration

https://gerrit.wikimedia.org/r/1039247

Change #1039247 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] kask: add mesh configuration

https://gerrit.wikimedia.org/r/1039247

Change #1039247 merged by jenkins-bot:

[operations/deployment-charts@master] kask: add mesh configuration

https://gerrit.wikimedia.org/r/1039247

Change #1047052 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] kask: don't allocate service port twice when using mesh

https://gerrit.wikimedia.org/r/1047052

Change #1047052 merged by jenkins-bot:

[operations/deployment-charts@master] kask: don't allocate service port twice when using mesh

https://gerrit.wikimedia.org/r/1047052

Change #1048013 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] sessionstore: temporarily disable mesh on staging

https://gerrit.wikimedia.org/r/1048013