Page MenuHomePhabricator

[tools] Prometheus k8s cert expired
Closed, ResolvedPublic

Description

From alert:

TektonDown
project: tools
1
description
summary: Tekton is down
17 minutes agoinstance: k8s.tools.eqiad1.wikimedia.cloud:6443
service: toolforge,build_service,tekton
source: prometheus
team: wmcs
@cluster: wmcloud.org
@receiver: metricsinfra_cloud-feed
runbook

at https://alerts.wikimedia.org/?q=team%3Dwmcs

Tekton seems up and running:

root@tools-k8s-control-5:~# kubectl get all -n tekton-pipelines
NAME                                               READY   STATUS    RESTARTS   AGE
pod/tekton-pipelines-controller-5c78ddd49b-z6pm2   1/1     Running   0          15d
pod/tekton-pipelines-webhook-5d899cc8c-kk9hf       1/1     Running   0          17d

NAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                              AGE
service/tekton-pipelines-controller   ClusterIP   10.110.221.64   <none>        9090/TCP,8008/TCP,8080/TCP           64d
service/tekton-pipelines-webhook      ClusterIP   10.105.112.2    <none>        9090/TCP,8008/TCP,443/TCP,8080/TCP   64d

NAME                                          READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/tekton-pipelines-controller   1/1     1            1           64d
deployment.apps/tekton-pipelines-webhook      1/1     1            1           64d

NAME                                                     DESIRED   CURRENT   READY   AGE
replicaset.apps/tekton-pipelines-controller-5c78ddd49b   1         1         1       64d
replicaset.apps/tekton-pipelines-webhook-5d899cc8c       1         1         1       64d

NAME                                                           REFERENCE                             TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/tekton-pipelines-webhook   Deployment/tekton-pipelines-webhook   4%/100%   1         5         1          64d

Looking at the cert in the prometheus machine, it expired:

root@tools-prometheus-6:/srv/prometheus/tools# openssl x509 -in /etc/ssl/localcerts/toolforge-k8s-prometheus.crt -text
root@tools-prometheus-6:/srv/prometheus/tools# openssl x509 -in /etc/ssl/localcerts/toolforge-k8s-prometheus.crt -text
Certificate:
...
        Validity
            Not Before: Jun  2 11:55:07 2022 GMT
            Not After : Jun  2 11:55:07 2023 GMT

Event Timeline

dcaro triaged this task as High priority.Jun 2 2023, 12:23 PM
dcaro created this task.

Change 926484 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] toolforge: refresh toolforge-k8s-prometheus certificate

https://gerrit.wikimedia.org/r/926484

Change 926484 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] toolforge: refresh toolforge-k8s-prometheus certificate

https://gerrit.wikimedia.org/r/926484

That worked :), updated the runbook and such too, will close.

dcaro moved this task from To refine to Done on the User-dcaro board.