Sessionstore's discovery TLS cert will expire before end of May 2024
Open, HighPublic
Actions

Assigned To

Authored By

	elukey
	May 2 2024, 11:49 AM

Description

Hi folks!

The TLS cert for sessionstore.discovery.wmnet will expire before end of May 2024:

$ openssl s_client -connect sessionstore.discovery.wmnet:8081 | openssl x509 -text
[..]
        Validity
            Not Before: May 29 17:38:58 2019 GMT
            Not After : May 28 17:38:58 2024 GMT
[..]

We should replace it to avoid any outages, or move the service to PKI/Mesh in K8s.

If we want to just regenerate the Puppet TLS cert: https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate

Details

Subject	Repo	Branch	Lines +/-
sessionstore: temporarily disable mesh on staging	operations/deployment-charts	master	+0 -13
kask: don't allocate service port twice when using mesh	operations/deployment-charts	master	+3 -1
kask: add mesh configuration	operations/deployment-charts	master	+1 K -15
utils: remove pem copying step from ecdsa_cert	operations/puppet	production	+0 -4
sessionstore: update certs in advance of expiry	operations/deployment-charts	master	+23 -54
kask: checksum tls certs	operations/deployment-charts	master	+3 -1
sessionstore: use new cert to go with new key in staging	operations/deployment-charts	master	+31 -0

Customize query in gerrit

Related Objects

Mentioned Here: T220401: Introduce kask session storage service to kubernetes
T300033: Use cert-manager for service-proxy certificate creation

Event Timeline

elukey created this task.May 2 2024, 11:49 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 2 2024, 11:49 AM

This certificate doesn't show up anywhere in certificate.manifests.d for cergen, though?

elukey updated the task description. (Show Details)May 2 2024, 1:25 PM

In T363996#9763646, @MoritzMuehlenhoff wrote:

This certificate doesn't show up anywhere in certificate.manifests.d for cergen, though?

This is probably because I've deleted them during T300033: Use cert-manager for service-proxy certificate creation thinking that I had migrated everything. Ofc. sessionstore/kask is different and does not use the service mesh/the service proxy for tls termination so I did not catch it

The cert is here:

elukey@puppetmaster1001:/srv/private$ find -name *session* | grep -v cassandra
./modules/secret/secrets/ssl/sessionstore.discovery.wmnet.key

@Eevans IIUC kask terminates TLS by itself for session store, is it right? Would it be a problem to move to the mesh k8s module, namely to use the envoy sidecar that terminates TLS and proxies the request (in this case, kask would listen to a plaintext localhost:port combination). I am asking since we could move to PKI directly if kask's chart used the mesh module.

@akosiaris maybe you recall if there was a deliberate decision not to use service mesh for kask/session store?

If we're not using the mesh, kask should support reloading certs from disk on changes. With that we could just replace the cergen certs with auto-renewing ones like we do in the mesh.

In T363996#9774208, @JMeybohm wrote:

@akosiaris maybe you recall if there was a deliberate decision not to use service mesh for kask/session store?

Yes it was a deliberate decision. There is some back story but the TL;DR is

Kask predates the service mesh, needed TLS and so implemented itself
Sessionstore is pretty critical and could benefit from not having the extra latency the service mesh possibly adds
A change to sessionstore was somewhat high risk and altering the behavior of a well functioning component didn't seem like it had any kind of significant payoff back then.

If we're not using the mesh, kask should support reloading certs from disk on changes. With that we could just replace the cergen certs with auto-renewing ones like we do in the mesh.

Sure, but I don't know if it actually supports reloading certs from disk, adding @Eevans.

In T363996#9785916, @akosiaris wrote:

In T363996#9774208, @JMeybohm wrote:

@akosiaris maybe you recall if there was a deliberate decision not to use service mesh for kask/session store?

Yes it was a deliberate decision. There is some back story but the TL;DR is

Kask predates the service mesh, needed TLS and so implemented itself

Sessionstore is pretty critical and could benefit from not having the extra latency the service mesh possibly adds

A change to sessionstore was somewhat high risk and altering the behavior of a well functioning component didn't seem like it had any kind of significant payoff back then.

If we're not using the mesh, kask should support reloading certs from disk on changes. With that we could just replace the cergen certs with auto-renewing ones like we do in the mesh.

Sure, but I don't know if it actually supports reloading certs from disk, adding @Eevans.

It does not, it just uses the Golang http module's standard server initialization. Adding this would be pretty easy though.

I'd vote to add the mesh support/configuration at this point, it seems less risky and error prone than allowing kask to reload TLS certs. The only concern would be the extra latency involved, but in theory it shouldn't be heavy (it add an extra hop/tcp conn to localhost, we can measure the impact in staging and decide).

elukey updated the task description. (Show Details)May 10 2024, 4:22 PM

I tend to agree, also for sake of alignment of sessionstore with the rest of our services. Unfortunately this feels like the more involved change (A change to sessionstore was somewhat high risk ...) - but I think it's also true that it should not add much to latency.

Scott_French subscribed.May 13 2024, 4:02 PM

jijiki reassigned this task from JMeybohm to hnowlan.May 14 2024, 8:43 AM

Steps that I see:

Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom for a major change to one of our most critical services
Add toggleable mesh support
Get a baseline for performance in staging by running siege for a few hours
Deploy mesh-enabled version to staging, compare performance before and after
Roll forward if everything looks okay

Only concern around testing in staging is having reasonably representative infrastructure, given that it's both using cassandra-dev, and in codfw from pods in eqiad. That said, we're just looking for a baseline and not identical numbers.

Also worth noting that mediawiki already calls sessionstore via it's envoy sidecar, so we do have telemetry data from prod and we should be able to see the impact there pretty quickly as well: https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=mediawiki&var-kubernetes_namespace=All&var-destination=sessionstore

In T363996#9794048, @hnowlan wrote:

Steps that I see:

Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom for a major change to one of our most critical services

[ ... ]

In T363996#9763941, @JMeybohm wrote:

In T363996#9763646, @MoritzMuehlenhoff wrote:

This certificate doesn't show up anywhere in certificate.manifests.d for cergen, though?

This is probably because I've deleted them during T300033: Use cert-manager for service-proxy certificate creation thinking that I had migrated everything. Ofc. sessionstore/kask is different and does not use the service mesh/the service proxy for tls termination so I did not catch it

In T363996#9764067, @elukey wrote:

The cert is here:

elukey@puppetmaster1001:/srv/private$ find -name *session* | grep -v cassandra
./modules/secret/secrets/ssl/sessionstore.discovery.wmnet.key

That's just a key; I don't see anything but key files there...

@Eevans IIUC that is the private key, it gets rendered in /etc/helmfile-defaults/private/main_services/sessionstore on deploy1002 and eventually it is deployed on the session store pods. The public key is stored in deployment-charts, in the sessionstore's values.yaml. I suspect that cergen wasn't used for this use case, or something more manual was chosen because of K8s.

In T363996#9794048, @hnowlan wrote:

Steps that I see:

Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom for a major change to one of our most critical services

Add toggleable mesh support

Get a baseline for performance in staging by running siege for a few hours

Deploy mesh-enabled version to staging, compare performance before and after

Roll forward if everything looks okay

Only concern around testing in staging is having reasonably representative infrastructure, given that it's both using cassandra-dev, and in codfw from pods in eqiad. That said, we're just looking for a baseline and not identical numbers.

+1! Not sure how to renew the existing cert since IIUC cergen wasn't used, but we can try to check in old tasks to see what it was done.

In T363996#9816587, @elukey wrote:

In T363996#9794048, @hnowlan wrote:

Steps that I see:

Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom for a major change to one of our most critical services

Add toggleable mesh support

Get a baseline for performance in staging by running siege for a few hours

Deploy mesh-enabled version to staging, compare performance before and after

Roll forward if everything looks okay

Only concern around testing in staging is having reasonably representative infrastructure, given that it's both using cassandra-dev, and in codfw from pods in eqiad. That said, we're just looking for a baseline and not identical numbers.

+1! Not sure how to renew the existing cert since IIUC cergen wasn't used, but we can try to check in old tasks to see what it was done.

I'm out next week, and the cert is about to expire, so I'll try to reverse engineer this. If anyone has first-hand knowledge of how this was done, please let me know!

In T363996#9821312, @Eevans wrote:

+1! Not sure how to renew the existing cert since IIUC cergen wasn't used, but we can try to check in old tasks to see what it was done.

I'm out next week, and the cert is about to expire, so I'll try to reverse engineer this. If anyone has first-hand knowledge of how this was done, please let me know!

Looking at git log in the private there's a reference to https://phabricator.wikimedia.org/T220401, so it seems Alex added it back then.

In T363996#9821312, @Eevans wrote:

In T363996#9816587, @elukey wrote:

In T363996#9794048, @hnowlan wrote:

Steps that I see:

Renewing the existing cergen cert to give us breathing room just in case. We're looking at less than 2 weeks of headroom for a major change to one of our most critical services

Add toggleable mesh support

Get a baseline for performance in staging by running siege for a few hours

Deploy mesh-enabled version to staging, compare performance before and after

Roll forward if everything looks okay

Only concern around testing in staging is having reasonably representative infrastructure, given that it's both using cassandra-dev, and in codfw from pods in eqiad. That said, we're just looking for a baseline and not identical numbers.

+1! Not sure how to renew the existing cert since IIUC cergen wasn't used, but we can try to check in old tasks to see what it was done.

I'm out next week, and the cert is about to expire, so I'll try to reverse engineer this. If anyone has first-hand knowledge of how this was done, please let me know!

@akosiaris pointed out that he generated the certs with utils/create_ecdsa_cert - I've been testing this out and it will hopefully work similarly.

Change #1034952 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] utils: remove pem copying step from ecdsa_cert

https://gerrit.wikimedia.org/r/1034952

gerritbot added a project: Patch-For-Review.Wed, May 22, 2:55 PM

Mentioned in SAL (#wikimedia-operations) [2024-05-22T14:57:01Z] <hnowlan> running puppet cert revoke sessionstore.discovery.wmnet T363996

Change #1034972 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] sessionstore: use new cert to go with new key in staging

https://gerrit.wikimedia.org/r/1034972

Change #1034972 merged by jenkins-bot:

[operations/deployment-charts@master] sessionstore: use new cert to go with new key in staging

https://gerrit.wikimedia.org/r/1034972

Change #1035010 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] kask: checksum tls certs

https://gerrit.wikimedia.org/r/1035010

Current situation - I have refreshed the .key file on the puppet master using a modified version of the create_ecdsa_cert script, and I have pushed the new key to the staging k8s secrets for sessionstore only. I've also updated the cert file in the helmfile configuration for sessionstore, but it hasn't picked it up because that part of the config wasn't checksummed to recreate the pods (fixed in this change). Tomorrow I will try to roll to codfw and, if successful, eqiad.

Change #1035010 merged by jenkins-bot:

[operations/deployment-charts@master] kask: checksum tls certs

https://gerrit.wikimedia.org/r/1035010

Change #1035357 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] sessionstore: update certs in advance of expiry

https://gerrit.wikimedia.org/r/1035357

Change #1035357 merged by jenkins-bot:

[operations/deployment-charts@master] sessionstore: update certs in advance of expiry

https://gerrit.wikimedia.org/r/1035357

certs updated in all DCs, alerts resolved. I sincerely hope we will have the mesh migration resolved so we can avoid having to update echostore's certificates in October, but in case something prevents that and for reference the process was:

puppet cert revoke sessionstore.discovery.wmnet
In the puppet repo on your local checkout ./utils/create_ecdsa_cert sessionstore.discovery.wmnet sessionstore.svc.eqiad.wmnet sessionstore.svc.codfw.wmnet
On the puppetmaster, put the contents of /var/lib/puppet/server/ssl/ca/signed/sessionstore.discovery.wmnet.pem into certs.kask.cert in helmfile.d
Add the contents of the new private key from ./modules/secret/secrets/ssl/sessionstore.discovery.wmnet.key to hieradata/role/common/deployment_server/kubernetes.yaml
Validate the files and make sure everything looks okay using openssl ec/openssl x509, then git commit your changes in private
Follow the Helm rollout process as normal, keeping an eye on the sessionstore graphs and the session loss graphs

We currently have no documentation for echostore so the impact of a broken rollout is unclear, but it's just another kask instance should it come to the point that we need to follow the above process. That said, worryingly it doesn't seem like a cert even exists in puppet for echostore.discovery.wmnet? We can just roll ahead and generate one regardless though

In T363996#9826387, @hnowlan wrote:

certs updated in all DCs, alerts resolved. I sincerely hope we will have the mesh migration resolved so we can avoid having to update echostore's certificates in October, but in case something prevents that and for reference the process was:

puppet cert revoke sessionstore.discovery.wmnet

In the puppet repo on your local checkout ./utils/create_ecdsa_cert sessionstore.discovery.wmnet sessionstore.svc.eqiad.wmnet sessionstore.svc.codfw.wmnet

On the puppetmaster, put the contents of /var/lib/puppet/server/ssl/ca/signed/sessionstore.discovery.wmnet.pem into certs.kask.cert in helmfile.d

Add the contents of the new private key from ./modules/secret/secrets/ssl/sessionstore.discovery.wmnet.key to hieradata/role/common/deployment_server/kubernetes.yaml

Validate the files and make sure everything looks okay using openssl ec/openssl x509, then git commit your changes in private

Follow the Helm rollout process as normal, keeping an eye on the sessionstore graphs and the session loss graphs

Thanks for taking care of this @hnowlan , (and for documenting the steps)!

We currently have no documentation for echostore so the impact of a broken rollout is unclear, but it's just another kask instance should it come to the point that we need to follow the above process. That said, worryingly it doesn't seem like a cert even exists in puppet for echostore.discovery.wmnet? We can just roll ahead and generate one regardless though

I just created https://wikitech.wikimedia.org/wiki/EchoStore. It's not great; Mostly a copy-paste of https://wikitech.wikimedia.org/wiki/SessionStorage, but maybe better than nothing? The TL;DR is that it's a cache of seen-times. I think the impact of a broken rollout is quite low (relative to sessionstore, anyway). I believe it mainly controls the little badge that shows the number of unread notifications, and a failure to lookup might show that you had more than you should?