Page MenuHomePhabricator

Consider enabling Blackbox K8s autodiscovery on dse-k8s Prometheus instance
Closed, ResolvedPublic

Description

Per parent ticket, we are need a way to monitor the TLS certificate lifetime for our OpenSearch on K8s platform. Also, because OpenSearch terminates TLS directly instead of going through Envoy, we might not be able to get the same kind of HTTP metrics, error codes, etc that we can typically get for k8s-hosted services.†

We should be able to do this via Prometheus' support for kubernetes_sd_config , which we already use to autodiscovery scrape targets hosted in K8s. There wouldn't be too much difference, but we'd need to a few things to target the Blackbox exporter instead of scraping metrics, see this article for examples.

Creating this ticket to:

  • Discuss our options with Observability.
  • Implement a TLS and HTTP health monitoring solution based on their recommendations.

† to be determined, the Envoy Telemetry (k8s) dashboard wasn't working for me when I wrote this.

Event Timeline

If you need to monitor endpoints and/or TLS certificates, it might be sufficient to add the probes key under the service in the service catalog:

service::catalog:
  apertium:
    description: Machine Translation service. apertium.discovery.wmnet
    ...
    page: false
    probes:
      - type: http
        path: /listPairs
    ...

Change #1224827 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] Revert "Move opensearch-ipoid to production state"

https://gerrit.wikimedia.org/r/1224827

Change #1224827 merged by Ryan Kemper:

[operations/puppet@production] Revert "Move opensearch-ipoid to production state"

https://gerrit.wikimedia.org/r/1224827

Change #1224999 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] opensearch-ipoid: move service to "production" status.

https://gerrit.wikimedia.org/r/1224999

Change #1224999 merged by Bking:

[operations/puppet@production] opensearch-ipoid: move service to "production" status.

https://gerrit.wikimedia.org/r/1224999

Change #1225029 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] [DO NOT MERGE] opensearch-ipoid: Add a path and timeout to blackbox check

https://gerrit.wikimedia.org/r/1225029

Change #1225029 abandoned by Bking:

[operations/puppet@production] [DO NOT MERGE] opensearch-ipoid: Add a path and timeout to blackbox check

Reason:

Not needed, scrape config is already present

https://gerrit.wikimedia.org/r/1225029

Thanks for the suggestion, @tappof ! I can confirm that the current solution is working.

There's a small, non-urgent problem I wanted to mention: the current integration creates the scrape config on the ops prometheus instance (as opposed to the dse-k8s prometheus instance). This means that mainline SRE will be alerted, whereas we'd prefer for my team (Data Platform SRE) to be alerted. That being said, there are pre-existing tickets (T303744 and T398073) on the topic of improved targeting of alerts in k8s, so I don't think we need to rehash that here.

As such, I'll go ahead and close this out. Thanks again for your help!