Page MenuHomePhabricator

Automation to find / summarize "orphaned" traces
Closed, ResolvedPublic

Description

Not all services we run here propagate tracing headers from incoming requests to outgoing ones. A few of these we've manually identified and fixed, like in T371129

As a deliverable for the MVP it'd be nice to have an idea of the extend of this, and be able to file some tasks against the services easiest and/or most important to add propagation.

In the long run it'd be good to keep tabs on this so we can work towards increasing tracing "coverage".

One of the ways that is likely a strong signal for failure to propagate headers, using data we're already collecting:

  • Find traces where the root span attribute upstream_cluster.name does not match either LOCAL_.* or local_service
  • What this means is that the Envoy service mesh sidecar of some other service, received a request from its application destined towards another service but without any tracing context attached. In theory, this should be either healthchecks (which we should filter) or uninstrumented user traffic
  • The application sending the traffic can be found in the process-level information of that span -- k8s.namespace.name, k8s.pod.name etc.

Event Timeline

CDanis updated the task description. (Show Details)

I gave this a shot, and came up with the following elasticsearch query:

{
  "query": {
    "bool": {
      "must_not": [
        {
          "nested": {
            "path": "tags",
            "query": {
              "match": {
                "tags.value": "local_service"
              }
            }
          }
        },
        {
          "nested": {
            "path": "references",
            "query": {
              "match": {
                "references.refType": "CHILD_OF"
              }
            }
          }
        },
        {
          "nested": {
            "path": "tags",
            "query": {
              "prefix": {
                "tags.value": "LOCAL_"
              }
            }
          }
        }
      ]
    }
  }
}

What I've been doing is plugging in the query above into https://logstash.wikimedia.org/app/dev_tools#/console with the first line set as GET /jaeger-span-2024.08.13/_search (for example).

I haven't found a way to match exactly upstream_cluster shouldn't start with LOCAL_ and upstream_cluster.name != local_service though the non-match happens on every tags.value, which is probably good enough in this case (?)

Putting it all together again as a POC for the command line for Aug 13th we get for example:

logstash1023:~$ curl -s -XGET "http://localhost:9200/jaeger-span-2024.08.13/_search" -H 'Content-Type: application/json' -d'
{
  "from":0, "size": 10000, "query": {
    "bool": {
      "must_not": [
        {
          "nested": {
            "path": "tags",
            "query": {
              "match": {
                "tags.value": "local_service"
              }
            }
          }
        },
        {
          "nested": {
            "path": "references",
            "query": {
              "match": {
                "references.refType": "CHILD_OF"
              }
            }
          }
        },
        {
          "nested": {
            "path": "tags",
            "query": {
              "prefix": {
                "tags.value": "LOCAL_"
              }
            }
          }
        }
      ]
    }
  }
}' | jq '.hits.hits[]["_source"].process.tags[] | select(.key == "k8s.namespace.name") | .value' | sort -u

which yields

"citoid"
"cxserver"
"eventstreams"
"mobileapps"
"wikifeeds"

Nice! This is very cool.

It looks like the ES query returns the full document? So it should be possible to also output N example trace URLs for each service.

Do you have a preferred way of writing small Python scripts against ES/OS?

I haven't found a way to match exactly upstream_cluster shouldn't start with LOCAL_ and upstream_cluster.name != local_service though the non-match happens on every tags.value, which is probably good enough in this case (?)

Yeah, I think that's probably good enough...?

I guess another option -- and one maybe worth considering for both ease of manual querying plus maybe also performance reasons? -- is to experiment with the tags-as-fields options available in Jaeger collector+query.
The docs in upstream are pretty meager, but I found a Medium post (sigh) with more examples

Also related to performance there were some interesting ideas in here: https://karlstoney.com/speeding-up-jaeger-on-elasticsearch/

Nice! This is very cool.

It looks like the ES query returns the full document? So it should be possible to also output N example trace URLs for each service.

Correct, the full document is returned and +1 to return sample trace URLs

Do you have a preferred way of writing small Python scripts against ES/OS?

Good question, two options off the top of my head:

  1. run/deploy on the logstash backend hosts (i.e. no auth required)
  2. run/deploy on whichever host(s) have access to the logs-api.svc credentials we use for jaeger

I haven't found a way to match exactly upstream_cluster shouldn't start with LOCAL_ and upstream_cluster.name != local_service though the non-match happens on every tags.value, which is probably good enough in this case (?)

Yeah, I think that's probably good enough...?

I guess another option -- and one maybe worth considering for both ease of manual querying plus maybe also performance reasons? -- is to experiment with the tags-as-fields options available in Jaeger collector+query.
The docs in upstream are pretty meager, but I found a Medium post (sigh) with more examples

Very interesting, yes definitely we should be experimenting with tags-as-fields. I skimmed the upstream post and at this time it isn't clear to me how the transition looks like. Or said otherwise whether jaeger UI transparently supports reading from both "formats"

Also related to performance there were some interesting ideas in here: https://karlstoney.com/speeding-up-jaeger-on-elasticsearch/

I'll take a closer look, definitely sounds interesting

Very interesting, yes definitely we should be experimenting with tags-as-fields. I skimmed the upstream post and at this time it isn't clear to me how the transition looks like. Or said otherwise whether jaeger UI transparently supports reading from both "formats"

OK so you have to Ctrl-F compat in the pull request to find it, but, the author asserted that this is backwards-compatible (without explanation).

I took a look at the Jaeger repo and how this is all plumbed through from the configuration parsing into the ES clients. I mostly agree with the author. The one case that isn't backwards-compatible is removing a field from tags-as-fields. But adding fields is totally backwards-compatible.

And the removal case is a subtle breakage only -- viewing a serialized trace by ID always works, but searching by the removed field won't find traces serialized in the interval before it was removed.

(For the record, the 'viewing' code path is mostly handled by jaeger/plugin/storage/es/spanstore/dbmodel/to_domain.go, and the 'searching' code path is handled by jaeger/plugin/storage/es/spanstore/reader.go, most importantly SpanReader's buildTagQuery method)

ES has a hardcoded maximum of 1000 top-level fields per index, after which point it will refuse writes that add any more fields, so that's one argument to not enable the flag that puts all tags as fields .... but I could still be convinced. I guess you could just drop the index for that day and start over and it would be fine ... yeah. Yeah actually let's just do all tags as fields? I don't think there's any other downside.

Change #1065127 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] jaeger: enable tags-as-fields for query and collector

https://gerrit.wikimedia.org/r/1065127

Change #1065127 merged by jenkins-bot:

[operations/deployment-charts@master] jaeger: enable tags-as-fields for query and collector

https://gerrit.wikimedia.org/r/1065127

Change #1070920 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] logging: add script to query for orphan traces

https://gerrit.wikimedia.org/r/1070920

Change #1070920 merged by Filippo Giunchedi:

[operations/puppet@production] logging: add script to query for orphan traces

https://gerrit.wikimedia.org/r/1070920

fgiunchedi closed this task as Resolved.EditedSep 12 2024, 9:03 AM
fgiunchedi claimed this task.

script is deployed!

logstash1023:~$ jaeger-find-traces
mobileapps 10 https://trace.wikimedia.org/trace/af6e2ffd163845f361b28d6991662a8e