Page MenuHomePhabricator

Proper service names in trace data
Open, Needs TriagePublic

Description

Currently, the only service Jaeger shows on trace.wikimedia.org is OTLPResourceNoServiceName.

This is because the OTLP exporter extension packaged with our version of Envoy actually doesn't export a service name at all -- but it's required by the OTLP spec, and so a placeholder value is filled in. (Note that despite a lot of digging, I haven't actually found where this happens yet.)

This omission was corrected with Envoy pull request #22472. That makes the service name easily configurable by name on each tracer stanza, matching how other tracer implementations work in Envoy.

The merge commit of such, in August 2022, became part of v1.24.0 -- but not part of any earlier version.

We currently run v1.23.12, and so don't have this fix. A service-wide Envoy upgrade is a so-called "heavy lift"; it involves at the very least a redeployment of every service we run.

Fortunately the OTel Collector has various data rewriting abilities -- many thanks @Clement_Goubert for the original suggestion to run that as our collector everywhere.

Proposal

Use the OpenTelemetry Collector's transformprocessor to rewrite the service.name Resource according to the following steps:

  • On k8s
    1. If a valid service.name is already set, use that and do nothing else.
      • Ensures forwards compatibility.
    2. If a span's upstream_cluster.name is set to something other than local_service, use that as the service name.
      • Allows easy overriding in existing infrastructure (see below re: mesh).
    3. Otherwise, take the piece of the span's node_id value before the first period, and use that as the service name (on k8s this is the pod name, example: mw-debug.eqiad.pinkunicorn-5bbd65ff7c-ws289)
      • Provides a sensible default without redeploying anything -- node_id is set automatically already.
    4. Otherwise, use our own "unknown" value.
  • On bare-metal
    1. If a valid service.name is already set, use that and do nothing else.
      • Ensures forwards compatibility.
    2. If a span's upstream_cluster.name is set to something other than something matching ^local_(port|path)_.*, use that as the service name.
      • Ensures forwards compatibility.
    3. Otherwise, let's add an optional hiera to the role for a service name, use that if present, and if not, use our own "unknown" value.

Additionally, for k8s, I propose a new minor version of the mesh module that:

  • allows specifying a service name for tracing as part of its configuration, which if set, will override the local_service cluster name
  • and where that name will default to {{ .Release.Namespace }} if not set

Alternatives considered

Other OTel processors ๐Ÿšซ

In this case, since we need to rewrite service.name, which is defined as a Resource in the spec, we can't use either the attributesprocessor nor the spanprocessor. Since the source of the data we wish to write into the service.name Resource is in the span attributes, and not already in the Resources section, we can't use the resourceprocessor either.

Upgrade Envoy ๐Ÿ‘Ž

Too much work and too much risk for this quarter.

However, the implementation described above allows for a graceful migration to defining the service name directly in Envoy when we do upgrade in the future.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+4 -2
operations/deployment-chartsmaster+4 -3
operations/deployment-chartsmaster+22 -6
operations/deployment-chartsmaster+1 K -0
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+12 -0
operations/deployment-chartsmaster+2 -3
operations/deployment-chartsmaster+6 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+3 -12
operations/deployment-chartsmaster+36 -1
operations/deployment-chartsmaster+16 -7
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+19 -10
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+5 -1
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+37 -0
Show related patches Customize query in gerrit

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptApr 24 2024, 8:03 PM

Thanks for the write-up!
What is not very clear to me is what part of the work would need to be done anyways (in case we'd have a envoy version >= 1.24). The reason I'm asking this is that envoy 1.23 is EOL since a year or so, so we need to look at an upgrade anyways.

Regarding the name of local_service in k8s I think {{ .Release.Namespace }} won't be enough. There might be multiple releases per namespace (of different things even). So the base.name.release might be a better choice here.

Thanks for the write-up!
What is not very clear to me is what part of the work would need to be done anyways (in case we'd have a envoy version >= 1.24). The reason I'm asking this is that envoy 1.23 is EOL since a year or so, so we need to look at an upgrade anyways.

The parts of the work that would definitely need to be done anyway:

  • modify the mesh module to allow specifying a service name for tracing purposes
  • plumb that value through to Envoy's tracer configuration, also use a default if not present

The parts of the work that we'd probably want to do anyway:

  • parsing out some service information from node_id so we don't have to upgrade everything and add new configuration stanzas to everything to get close-to-sensible names

The bulk of the work described originally is some mostly-straightforward configuration in a DSL called OTTL and I have most of it already implemented in a local docker-compose proof of concept. The mesh module changes I think are pretty straightforward. In either case, it's all more straightforward than an Envoy upgrade and a full redeploy of services.

Regarding the name of local_service in k8s I think {{ .Release.Namespace }} won't be enough. There might be multiple releases per namespace (of different things even). So the base.name.release might be a better choice here.

Fair enough, I hadn't taken a comprehensive look at charts and different deployments. Happy to go with base.name.release, although we might want to revisit later.

BTW in case it was not clear, my intentions here are basically:

  • deploy something ASAP (like next week) that everyone is reasonably happy with for the interim
  • don't do anything to get in the way of the badly-needed Envoy upgrade
  • don't break anything else

Ok, understood. The only thing I'm really worrying about is that metrics change/get less intuitive with this. For example in here it's pretty clear what the filter means (selecting "local_service"). I think we will loose clarity here if local_service changes to mw-web.eqiad.main. Maybe adding local as suffix/prefix would help here (and you could strip that out again in OTTL?

Plan LGTM (with the caveat that Janis mentioned regarding using base.name.release. Happy to review patches!

I 've had a look around and there are the following usages of local_service:

The last one is mostly irrelevant I think. It's copied in the image so that it's functional, but we don't run anywhere with that config

The grafana one, is anyway a dynamic value and not hardcoded. Whatever value we override too should generally make sense and not be confusing. We kinda trained our brains already around local_service, but I think we can re-train them, especially if the new value is more meaningful.

The first one, given it's an alert can probably be worked around with a patch merged before changing local_service to be something like envoy_cluster_name=~"OUR_NAME_VALUE|local_service" instead of envoy_cluster_name="local_service"

Bare-metal wise, I wouldn't fret much. Right now, the mesh is included as a profile on the following puppet roles

  • role::mediawiki::common - while some of it might stay for a while longer, it's arguable that it won't see enough traffic soon to be worth to trace anything
  • role::deployment_server - This shouldn't see much user related traffic. httpbb passes through this IIRC, so it might be helpful. However, I 'd argue that for this to be helpful and not confusing during debugging, we 'll need to sample at 100%.
  • role::restbase::production - And of course the elephant in the room. Ideally, we 'd be already rid of RESTBase by now. Unfortunately that hasn't happened yet, but we aren't very far away for it. There are arguments both for and against. My take is that it's not the end of the world if we wait a bit and see what happens with the sunsetting.

Ok, understood. The only thing I'm really worrying about is that metrics change/get less intuitive with this. For example in here it's pretty clear what the filter means (selecting "local_service"). I think we will loose clarity here if local_service changes to mw-web.eqiad.main. Maybe adding local as suffix/prefix would help here (and you could strip that out again in OTTL?

OK fair enough. How about we add LOCAL_ as a prefix so it also sorts to the top in grafana?

Ok, understood. The only thing I'm really worrying about is that metrics change/get less intuitive with this. For example in here it's pretty clear what the filter means (selecting "local_service"). I think we will loose clarity here if local_service changes to mw-web.eqiad.main. Maybe adding local as suffix/prefix would help here (and you could strip that out again in OTTL?

OK fair enough. How about we add LOCAL_ as a prefix so it also sorts to the top in grafana?

SGTM!

Change #1030221 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] Service mesh: rename local_service cluster

https://gerrit.wikimedia.org/r/1030221

Change #1030290 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] Parse OTel service names from what's available.

https://gerrit.wikimedia.org/r/1030290

Change #1030290 merged by jenkins-bot:

[operations/deployment-charts@master] Parse OTel service names from what's available.

https://gerrit.wikimedia.org/r/1030290

Change #1031051 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] otelcol: re-add mistakenly removed default processors

https://gerrit.wikimedia.org/r/1031051

Change #1031051 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: re-add mistakenly removed default processors

https://gerrit.wikimedia.org/r/1031051

Change #1031542 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] otelcol: tweak rollout params

https://gerrit.wikimedia.org/r/1031542

The good news is that this mostly works, and that I also have a patch pending to fix the issues I had with my first attempt at rolling it out.

The bad news is that there's some sort of issue that smells like a race condition or similar nondeterministic data corruption. (Although I suppose the good part of the bad news is that it's happening very often and so isn't too hard to reproduce.)

Here's an example of how this manifests:

image.png (441ร—834 px, 46 KB)

There, the top-level span from a mw-debug pod answering enwiki Special:Version on its local_service cluster is shown as service echotore.

I'll dig into this. To start with I intend to both try to reproduce it locally, and I will probably also look at upgrading to the latest stable version of otelcol and seeing what happens in prod.

Change #1031551 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] otelcol: do service name transform first of all

https://gerrit.wikimedia.org/r/1031551

Change #1031542 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: tweak rollout params

https://gerrit.wikimedia.org/r/1031542

Change #1031551 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: do service name transform first of all

https://gerrit.wikimedia.org/r/1031551

Change #1031554 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] otelcol: bump version to v0.100.0-1

https://gerrit.wikimedia.org/r/1031554

Change #1031554 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: bump version to v0.100.0-1

https://gerrit.wikimedia.org/r/1031554

Still seeing the bad data in new traces.

05:08:49	<@claime>	I think this is the same or a closely related issue https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/32080#issuecomment-2057446307
05:09:30	<@claime>	They're talking about logs, but that last comment seems to point to a larger point about how ottl treats scopes

That sure does look relevant, thanks @Clement_Goubert . I left a comment and will probably also raise the issue on the CNCF Slack channel #otel-collector.

Change #1031952 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] otelcol: attempt to fix service name confusion

https://gerrit.wikimedia.org/r/1031952

Change #1031952 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: attempt to fix service name confusion

https://gerrit.wikimedia.org/r/1031952

Change #1031956 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] otelcol: reference the changed transformprocessor name

https://gerrit.wikimedia.org/r/1031956

Change #1031956 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: reference the changed transformprocessor name

https://gerrit.wikimedia.org/r/1031956

At a glance, this seems to work.

image.png (1ร—1 px, 186 KB)

Of course this isn't sufficient to say that IT'S ABSOLUTELY FIXED, but, I was seeing wrong service names approximately 40% of the time before, and this test run is 0.

Change #1030221 merged by jenkins-bot:

[operations/deployment-charts@master] Service mesh: rename local_service cluster

https://gerrit.wikimedia.org/r/1030221

During work on T320563 we learned that we had made a very brittle assumption about naming in this task.

  • node_id is just whatever a gethostname() call returns
  • which is just the pod name
  • the pod's name derives from the Deployment name
  • the Mediawiki chart overrides its naming of Deployments in a way that other charts don't

And thus we're now generating service names like citoid-production-56fc847f4-2bsmm.

So, instead of doing further silly things with regular expressions, let's do something we should really do anyway: deploy OTel's Kubernetes Attributes Processor in our collectors. This uses the k8s API to discover all pods running on its local machine, and then augments received traces with lots of k8s metadata, matching on pod IP addresses or other data.

This requires a k8s serviceAccount / clusterRole to work, which we can't easily do in the current setup, so we'll also do T365626: move k8s opentelemetry-collector from services to admin_ng as a prereq.

I tested out a simple enabling of the k8s attributes processor in the values in the chart. The diffs all looked quite reasonable, but of course it doesn't work:

E0523 20:39:57.672901       1 reflector.go:147] k8s.io/client-go@v0.29.3/tools/cache/reflector.go:229: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://10.64.72.1:443/api/v1/pods?fieldSelector=spec.nodeName%3Dmw1440.eqiad.wmnet&limit=500&resourceVersion=0": dial tcp 10.64.72.1:443: i/o timeout

That IP shown is the ClusterIP of the kubernetes API itself.

Of course the upstream chart only wants to speak in terms of the v1 k8s api for network policies, not calico.

I tried adding egress rules that hardcoded the cluster IPs and it didn't seem to fix it; the connect timeouts continued.

Change #1035756 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] WIP

https://gerrit.wikimedia.org/r/1035756

Change #1035756 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: deploy k8sattributes processor

https://gerrit.wikimedia.org/r/1035756

Change #1035811 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] otelcol: use k8s attributes for service names

https://gerrit.wikimedia.org/r/1035811

Change #1035811 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: use k8s attributes for service names

https://gerrit.wikimedia.org/r/1035811

Change #1036708 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] otelcol: add three new k8s ctrl IPs

https://gerrit.wikimedia.org/r/1036708

Change #1036707 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] temporarily disable otelcol @ eqiad

https://gerrit.wikimedia.org/r/1036707

Change #1036707 merged by jenkins-bot:

[operations/deployment-charts@master] temporarily disable otelcol @ eqiad

https://gerrit.wikimedia.org/r/1036707

Change #1037083 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] otelcol: disable logs & metrics pipelines

https://gerrit.wikimedia.org/r/1037083

Change #1036708 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: add three new k8s ctrl IPs

https://gerrit.wikimedia.org/r/1036708

Change #1037083 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: disable logs & metrics pipelines

https://gerrit.wikimedia.org/r/1037083

Change #1037113 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] otelcol: limit collected k8s data

https://gerrit.wikimedia.org/r/1037113

Change #1037113 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: limit collected k8s data

https://gerrit.wikimedia.org/r/1037113

Change #1037548 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] re-enable otelcol in eqiad

https://gerrit.wikimedia.org/r/1037548

Change #1037548 merged by jenkins-bot:

[operations/deployment-charts@master] re-enable otelcol in eqiad

https://gerrit.wikimedia.org/r/1037548

CDanis claimed this task.

If we need to do a version of this for bare-metal hosts, we will, but for now let's not.

CDanis reopened this task as Open.EditedJul 2 2024, 2:23 PM

Regarding the name of local_service in k8s I think {{ .Release.Namespace }} won't be enough. There might be multiple releases per namespace (of different things even). So the base.name.release might be a better choice here.

Fair enough, I hadn't taken a comprehensive look at charts and different deployments. Happy to go with base.name.release, although we might want to revisit later.

Unfortunately I think we've discovered that base.name.release is not a very good name for the tracing use case.

  • Most base.name.releases end with -main or -production. We currently don't support tracing in staging, so this is superfluous -- and even if we did later, IMO would be better handled with -staging suffixed names and no-suffix names for production. (Or a separate deployment of Jaeger altogether.)
  • Differentiating on main/production vs canary at the service name level in traces is probably not useful. IMO we'd be better off including a canary=True attribute on spans, or using the k8sattributesprocessor to also extract and attach the release name. This lets you filter for canary traffic but also makes it easier to do comparisons of similar traffic for the service between canary and not.
  • Very often the namespace includes vital information that you don't get with base.name.release (which is the chart/app name + release name). In fact the norm is deploying logically-different services differentiated only by namespace, with similar or identical app names and release names:
    • mw-api-ext, mw-web, mw-api-int, mw-debug et al -- versus just mediawiki-main and mediawiki-canary (and the in-joke mediawiki-pinkunicorn)
    • shellbox, shellbox-video, shellbox-constraints, et al -- versus just shellbox-main
    • eventgate-analytics, eventgate-analytics-external, eventgate-main et al -- versus just eventgate-production & -canary
    • image-suggestion, data-gateway, device-analytics et al -- versus just aqs-http-gateway-main
  • That being said, it's rare but sometimes we do need to differentiate based on the app and/or release name within one namespace:
    • wikifunctions has app=function-orchestrator release=main-orchestrator, app=function-evaluator release=javascript-evaluator, and app=function-evaluator release=python-evaluator.
    • miscweb has release names wikiworkshop, static-codereview, security-landing-page, et al
    • The use of the release name here to indicate different functionality (instead of different versions of the same logical application) seems to only happen in these two namespaces.

Here's a grafana explore query to see what our Envoys are currently reporting for app/namespace/release: https://grafana.wikimedia.org/goto/sv3SY9wSR?orgId=1

I think we should move back to the namespace as the default, and for the few services where this loses information (wikifunctions and miscweb), make this overridable in values.

Summarizing from a discussion in #wikimedia-tracing for posterity's sake.

  • No disagreement from anyone about the findings listed above.
  • The original intent of namespaces hasn't really been to carry signal, but work as tenant separation mechanism, controlling deployment access, kube API access controls and quota management. However, alongside it there was an effort to decouple services from teams owning them, providing services with a stable identifier across time. That identifier has indeed ended up carrying signal indeed.
  • rsyslog and prometheus/grafana already talk to the k8s API, scrape the namespace and make that data available to grafana/logstash. So there is precedent for using the namespace name
  • releases in most cases don't carry the same signal. We have things like -main vs -canary but not much more in most cases. There are exceptions like miscwebhowever

Given the above we are in a de-factor (and not de-jure) situation where indeed namespace names carry signal that is useful to OpenTelemetry collector, enough to warrant changing the default to use the namespace name, allowing to override it in specific cases with values.yaml.

Change #1051449 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] copy patch

https://gerrit.wikimedia.org/r/1051449

Change #1051450 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] mesh: use namespace for default service name

https://gerrit.wikimedia.org/r/1051450

Change #1051453 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] Bump mediawiki chart version & mesh version

https://gerrit.wikimedia.org/r/1051453

Change #1051449 merged by jenkins-bot:

[operations/deployment-charts@master] copy patch

https://gerrit.wikimedia.org/r/1051449

Change #1051450 merged by jenkins-bot:

[operations/deployment-charts@master] mesh: use namespace for default service name

https://gerrit.wikimedia.org/r/1051450

Change #1051453 merged by jenkins-bot:

[operations/deployment-charts@master] Bump mediawiki chart version & mesh version

https://gerrit.wikimedia.org/r/1051453

Change #1187036 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] otelcol: fix service name munging post-Envoy upgrade

https://gerrit.wikimedia.org/r/1187036

Change #1187036 merged by jenkins-bot:

[operations/deployment-charts@master] otelcol: fix service name munging post-Envoy upgrade

https://gerrit.wikimedia.org/r/1187036