I've checked the logs from September (https://logstash.wikimedia.org/goto/37581ef39fe3ed2251e9cf0e13d12445) where this happened 132 times as of now, usually in batches of around 13 messages coming from one single pod. Cross referencing with k8s events showed that this "sometimes" happens during the startup of a new mediawiki pod. I would assume there is some kind of race condition at play.
Thu, Sep 21
I've added two dashboards:
Wed, Sep 20
I'm going to resolve this one as we no longer use it
Linking to T265979: Alert on unapplied changes in deployment-charts repo at this is somewhat similar but not identical
Containers should have the wmf-certificates package installed which contains the puppet ca as well.
This has been resolved with the move to PKI in T307943: Update Kubernetes clusters to v1.23
Tue, Sep 19
Removed all the certs with [puppet-private] (23d9433a) and ran puppet on all masters without issue. Wikitech has been updated as well to remove all mentions of cergen.
Mon, Sep 18
Fri, Sep 15
Updated etcd-mirror package has been rolled out, resolving this again
SRE was paged due to EtcdReplicationDown. Turns out the etcdmirror webinterface does not work with python3 on bullseye
Thu, Sep 14
conf2 nodes are on bullseye now and the metrics do look better now, as expected
This is done and clients (confd/pybal) are back on the cluster.
Wed, Sep 13
Mon, Sep 11
Fri, Sep 8
Thu, Sep 7
Wed, Sep 6
Tue, Sep 5
I put together a small go tool to validate some/all tokens with provided certficates (one or many). I did not see any other way of checking which token is signed by which key - and we need to make sure all tokens are signed by a pki key before we remove the cergen cert from the list for validation. For anybody interested, the public certs of all clusters (cergen and pki) can be found at deploy1002:/home/jayme/kube-apiserver-sa/certs/ (a compiled version of the below is at deploy1002:/home/jayme/kube-apiserver-sa/k8s-jwt-validator
Mon, Sep 4
Thanks to @jbond's refactor this is now resolved (again).
Fri, Sep 1
Thu, Aug 31
Did the LVS dance, curl -4 --resolve jaeger-query.discovery.wmnet:30443:$(dig +short k8s-ingress-aux.svc.eqiad.wmnet) https://jaeger-query.discovery.wmnet:30443 does now work as expected and probes come through.
Wed, Aug 30
Before we can complete T343302: otel collector is configured to send traces to jaeger we need to get jaeger collector TCP ports (4317 for grpc and 4318 for http) exposed on the production network.
Mon, Aug 28
No. I think the code is doing something different when called from mediawiki instead of curl. The issue clearly is that with the curl command the orchestrator does not call back to mw-api.
Thanks @elukey for stepping in!