Page MenuHomePhabricator

OpenSearch on K8s: Monitor and rotate TLS certificates
Closed, ResolvedPublic

Description

Creating this ticket based on this Slack conversation .

@brouberol noticed the opensearch-test certificate had expired, causing a downstream service (opensearch-ipoid) to fail.

Creating this ticket to:

  • Add alerts to TLS certificate expiration
  • Figure out a way to rotate certificates for OpenSearch on K8s clusters

Event Timeline

Change #1217597 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] opensearch-on-k8s: enable certificate hot reloads

https://gerrit.wikimedia.org/r/1217597

Change #1217597 merged by Bking:

[operations/deployment-charts@master] opensearch-on-k8s: enable certificate hot reloads

https://gerrit.wikimedia.org/r/1217597

Change #1217599 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] opensearch-on-k8s: Enable Reload Certificates API

https://gerrit.wikimedia.org/r/1217599

Change #1217599 merged by jenkins-bot:

[operations/deployment-charts@master] opensearch-on-k8s: Enable Reload Certificates API

https://gerrit.wikimedia.org/r/1217599

Change #1217606 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] opensearch-on-k8s: Increment chart version

https://gerrit.wikimedia.org/r/1217606

Change #1217606 merged by Bking:

[operations/deployment-charts@master] opensearch-on-k8s: Increment chart version

https://gerrit.wikimedia.org/r/1217606

For monitoring, we could enable Blackbox autodiscovery, similar to how we already do Prometheus autodiscovery for metrics scrapes.

For reloading the certificates themselves, newer versions of OpenSearch have the hot reload certificate API. Unfortunately, our version of OpenSearch (2.7.0) does not support it, so we've fallen back to using the reload certificates API . I've deployed opensearch-test in eqiad, but have not had time to try the API call yet. I'll take a look tomorrow.

Re: Blackbox autodiscovery, I had a brief chat with @tappof (Observability team SRE) about this in observability 's IRC room today and he said his team will take a look and get back to us.

Re: reload certificates API

The API call is accepted, but fails on 2.7.0 . Since the latest OpenSearch version is purportedly compatible with our version of the upstream OpenSearch helm chart, I've cut a new image based on the latest 2.x version, 2.19.4. I've published the new image and will be testing it out shortly.

Change #1218363 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] opensearch-on-k8s: Update opensearch image from 2.7.0 -> 2.19.4

https://gerrit.wikimedia.org/r/1218363

Change #1218363 merged by Bking:

[operations/deployment-charts@master] opensearch-on-k8s: Update opensearch image from 2.7.0 -> 2.19.4

https://gerrit.wikimedia.org/r/1218363

Change #1218834 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] opensearch-cluster: Replace reload certificates API call with hot reload setting

https://gerrit.wikimedia.org/r/1218834

Change #1220356 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] opensearch-cluster: remove broken setting

https://gerrit.wikimedia.org/r/1220356

Change #1220356 merged by Bking:

[operations/deployment-charts@master] opensearch-cluster: remove broken setting

https://gerrit.wikimedia.org/r/1220356

Per IRC discussion with @tappof, there is already a supported way to create blackbox checks for kubernetes-hosted services via Puppet's services.yaml . I'll try this approach first.

Change #1224827 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert "Move opensearch-ipoid to production state"

https://gerrit.wikimedia.org/r/1224827

Change #1224827 merged by Ryan Kemper:

[operations/puppet@production] Revert "Move opensearch-ipoid to production state"

https://gerrit.wikimedia.org/r/1224827

Change #1224999 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] opensearch-ipoid: move service to "production" status.

https://gerrit.wikimedia.org/r/1224999

Change #1224999 merged by Bking:

[operations/puppet@production] opensearch-ipoid: move service to "production" status.

https://gerrit.wikimedia.org/r/1224999

Change #1225029 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] [DO NOT MERGE] opensearch-ipoid: Add a path and timeout to blackbox check

https://gerrit.wikimedia.org/r/1225029

Change #1225029 abandoned by Bking:

[operations/puppet@production] [DO NOT MERGE] opensearch-ipoid: Add a path and timeout to blackbox check

Reason:

Not needed, scrape config is already present

https://gerrit.wikimedia.org/r/1225029

We've merged the above patch, and we can confirm that a prometheus scrape job was created. We've also been able to query the metrics from Grafana Explorer, which means the scrapes are happening.

It's not a perfect solution, because the current integration creates the scrape config on the ops prometheus instance (as opposed to the dse-k8s prometheus instance). This means that mainline SRE will be alerted, whereas we'd prefer for my team (Data Platform SRE) to be alerted. That being said, there are pre-existing tickets (T303744 and T398073) on the topic of improved targeting of alerts in k8s, so I don't think we need to rehash that here. Closing...

bking reopened this task as In Progress.Tue, Jan 13, 2:32 PM
bking triaged this task as High priority.

Reopening so I can attempt to create targeted alerts from the blackbox probes that will notify DPE SRE.

Change #1226282 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] WIP: Alert DPE SRE when probes fail in dse-k8s clusters

https://gerrit.wikimedia.org/r/1226282

Change #1226282 merged by Bking:

[operations/alerts@master] Alert DPE SRE when probes fail in dse-k8s clusters

https://gerrit.wikimedia.org/r/1226282

hnowlan subscribed.

Just a heads-up, I've filed T414662 to try to streamline these alerts in future and save you having to maintain these alerts by yourself

Change #1227406 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] opensearch-ipoid: Add codfw to list of sites

https://gerrit.wikimedia.org/r/1227406

Change #1227406 merged by Bking:

[operations/puppet@production] opensearch-ipoid: Add codfw to list of sites

https://gerrit.wikimedia.org/r/1227406

Thanks @hnowlan , that is very much appreciated!

We've confirmed that the alerts are reaching the data-platform notifier:

FIRING: [2x] CertAlmostExpired: Certificate for service opensearch-ipoid:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#opensearch-ipoid:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired

We need to fix the alert verbiage to link to the correct runbook, but other than that we're good to go.

Change #1229203 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] data-platform: Show affected DC on blackbox alerts

https://gerrit.wikimedia.org/r/1229203

Change #1229203 merged by Bking:

[operations/alerts@master] data-platform: Show affected DC on blackbox alerts

https://gerrit.wikimedia.org/r/1229203

We've fixed the above alerts, so the AC is satisfied. Closing...