Page MenuHomePhabricator

move k8s opentelemetry-collector from services to admin_ng
Closed, ResolvedPublic

Event Timeline

Change #1034978 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] Move opentelemetry-collector to admin_ng

https://gerrit.wikimedia.org/r/1034978

Mentioned in SAL (#wikimedia-operations) [2024-05-23T18:48:19Z] <cdanis> T365626 helmfile destroy'd all opentelemetry-collector releases

Change #1034978 merged by jenkins-bot:

[operations/deployment-charts@master] Move opentelemetry-collector to admin_ng

https://gerrit.wikimedia.org/r/1034978

helmfile apply went seamlessly, but unfortunately this broke trace collection: I realized only in retrospect that this also changes the DNS name of the collector, and that's vendored into a lot of other charts with the full old name: main-opentelemetry-collector.opentelemetry-collector.svc.cluster.local

I considered an approach like an ExternalName Service as a workaround, but after finding no other extant examples in deployment-charts, then also realized that TLS usually doesn't work with such DNS aliasing approaches.

@RLazarus suggested adding an override to one of the cluster-level values.yaml files for where it's defined -- mesh.tracing.host -- but neither of us are sure how cool it is to introduce diffs of that magnitude.

I'm considering options. I might simply rename the release in the new admin_ng helmfile to main-opentelemetry-collector and decide we can live with the weird name for now.

Change #1035559 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] Rename admin_ng otelcol to include 'main' prefix

https://gerrit.wikimedia.org/r/1035559

Change #1035559 merged by jenkins-bot:

[operations/deployment-charts@master] Rename admin_ng otelcol to include 'main' prefix

https://gerrit.wikimedia.org/r/1035559

CDanis claimed this task.

Traces are flowing again in eqiad.