Page MenuHomePhabricator

our various Envoys are configured to report traces to local OpenTelemetry Collector
Open, Needs TriagePublic

Description

This task tracks configuring envoy to send traces to otel-collector (within k8s).

To reach the jaeger UI before it is available behind SSO:

ssh deploy1002.eqiad.wmnet -L16686:localhost:16686
kube-env jaeger aux-k8s-eqiad
kubectl -n jaeger port-forward svc/main-jaeger-query query

Navigate on your browser to https://localhost:16686 . There will be HTTPS warnings, on Chrome you have to type "thisisunsafe" in the browser window to get past them.

Immediate TODO (could be done concurrently too)

Envoy

  • Identify which charts we're targeting
  • Figure out how to expose/inject the current node address into the envoy config (likely via env variables, obtained via spec.nodeName via the k8s downward API)
  • Add the otelcol to the envoy(s) above's clusters (via flags in mesh module configuration*.tpl)
- name: opentelemetry_collector
  type: STRICT_DNS
  lb_policy: ROUND_ROBIN
  typed_extension_protocol_options:
    envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
      "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
      explicit_http_config:
        http2_protocol_options: {}
  load_assignment:
    cluster_name: opentelemetry_collector
    endpoints:
    - lb_endpoints:
      - endpoint:
          address:
            socket_address:
              address: node's address?
              port_value: 4317
  • Instruct envoy's http manager to send traces via said cluster (via flags in mesh module configuration*.tpl)
tracing:
  provider:
    name: envoy.tracers.opentelemetry
    typed_config:
      "@type": type.googleapis.com/envoy.config.trace.v3.OpenTelemetryConfig
      grpc_service:
        envoy_grpc:
          cluster_name: opentelemetry_collector
        timeout: 0.250s
      service_name: XXX

otelcol

  • While T344253 and T343302 are in progress, we should consider enabling zpages extension on otelcol so we can inspect the received traces without requiring jaeger (not needed)

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+4 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+15 -0
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+4 -8
operations/deployment-chartsmaster+4 -0
operations/deployment-chartsmaster+86 -16
operations/deployment-chartsmaster+4 -2
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+74 -11
operations/deployment-chartsmaster+129 -3
operations/deployment-chartsmaster+4 -6
operations/deployment-chartsmaster+33 -0
operations/deployment-chartsmaster+518 -0
operations/deployment-chartsmaster+90 -0
Show related patches Customize query in gerrit

Event Timeline

The good news: OpenTelemetry tracing support exists as of our currently-deployed version of Envoy (v1.23.10): https://www.envoyproxy.io/docs/envoy/v1.23.10/api-v3/config/trace/v3/opentelemetry.proto.html

As of the Envoy v1.24 docs there's even an example sandbox: https://www.envoyproxy.io/docs/envoy/v1.24.10/start/sandboxes/opentelemetry

Hopefully this is enough to get started?

Thank you @CDanis ! I think that's indeed enough to get started, the sandbox instructions didn't work as-is locally unfortunately. I'll be debugging a little more and see if I can get the sandbox to run for me.

The configurations were useful to me as-is though, and I've updated the task description with what I think could be the next steps (please feel free to add/remove/edit at will!)

Change 953268 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mesh: add KUBERNETES_NODE (spec.nodeName)

https://gerrit.wikimedia.org/r/953268

Change 953268 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mesh: add KUBERNETES_NODE (spec.nodeName)

https://gerrit.wikimedia.org/r/953268

I was far too optimistic with this change. Envoy configuration can't expand environment variables, and we need the node name or address to be able to reach otel-col's port on the node.

The approach could still come handy for software that does support env variables though (e.g. mediawiki?).

For envoy @JMeybohm suggested looking into file-based configuration discovery mechanism, this is upstream's quickstart on the matter: https://www.envoyproxy.io/docs/envoy/latest/start/quick-start/configuration-dynamic-filesystem

With this approach (to be investigated next) we'll be writing a yaml file with the grpc cluster definition at container startup, with the correct values expanded, then reference that into the envoy's configuration.

Change 953268 abandoned by Filippo Giunchedi:

[operations/deployment-charts@master] mesh: add KUBERNETES_NODE (spec.nodeName)

Reason:

Not needed

https://gerrit.wikimedia.org/r/953268

Change 953576 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mesh: add tracing support

https://gerrit.wikimedia.org/r/953576

Change 953575 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mesh: new configuration version

https://gerrit.wikimedia.org/r/953575

Change 953575 merged by Filippo Giunchedi:

[operations/deployment-charts@master] mesh: new configuration version

https://gerrit.wikimedia.org/r/953575

Change 954210 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mesh: new networkpolicy version

https://gerrit.wikimedia.org/r/954210

Change 954210 merged by Filippo Giunchedi:

[operations/deployment-charts@master] mesh: new networkpolicy version

https://gerrit.wikimedia.org/r/954210

Change 955290 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] otel-col: enable grpc-http

https://gerrit.wikimedia.org/r/955290

Change 955290 merged by Filippo Giunchedi:

[operations/deployment-charts@master] otel-col: enable grpc-http

https://gerrit.wikimedia.org/r/955290

Change 953576 merged by jenkins-bot:

[operations/deployment-charts@master] mesh: add tracing support

https://gerrit.wikimedia.org/r/953576

Change 955333 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] cxserver: update mesh module

https://gerrit.wikimedia.org/r/955333

Change 955334 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] cxserver: enable mesh tracing

https://gerrit.wikimedia.org/r/955334

fgiunchedi added a project: User-fgiunchedi.
fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.

Change 955333 merged by Filippo Giunchedi:

[operations/deployment-charts@master] cxserver: update mesh module

https://gerrit.wikimedia.org/r/955333

Change 955334 merged by Filippo Giunchedi:

[operations/deployment-charts@master] cxserver: enable mesh tracing

https://gerrit.wikimedia.org/r/955334

Change 955734 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] cxserver: enable mesh tracing in staging only

https://gerrit.wikimedia.org/r/955734

Change 955734 merged by Filippo Giunchedi:

[operations/deployment-charts@master] cxserver: enable mesh tracing in staging only

https://gerrit.wikimedia.org/r/955734

Change 955894 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] citoid: update mesh module

https://gerrit.wikimedia.org/r/955894

Change 955895 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] citoid: enable mesh tracing in staging

https://gerrit.wikimedia.org/r/955895

Thanks to @JMeybohm and @akosiaris I was able to test a cxserver call in staging which resulted in envoy posting a trace:

curl -k -X 'POST' \
  'https://staging.svc.eqiad.wmnet:4002/v1/mt/en/es/Apertium' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "html": "here comes the rain again"
}'

Which resulted in this trace

2023-09-08-140014_2262x1828_scrot.png (1×2 px, 272 KB)

Of course still a few ways to go (e.g. fix `
Service:OTLPResourceNoServiceName` to sth meaningful) though the basic scaffolding works!

Change 955894 merged by Filippo Giunchedi:

[operations/deployment-charts@master] citoid: update mesh module

https://gerrit.wikimedia.org/r/955894

Change 955895 merged by Filippo Giunchedi:

[operations/deployment-charts@master] citoid: enable mesh tracing in staging

https://gerrit.wikimedia.org/r/955895

mesh tracing for citoid also enabled in staging now!

We current have tracing enabled for cxserver and citoid in staging. As a first step and to gain confidence I'll enable tracing for those in production.

Change #1034047 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] Enable tracing for citoid and cxserver in production

https://gerrit.wikimedia.org/r/1034047

Change #1034047 merged by Filippo Giunchedi:

[operations/deployment-charts@master] Enable tracing for citoid and cxserver in production

https://gerrit.wikimedia.org/r/1034047

Change #1043076 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] eventstreams: enable mesh tracing

https://gerrit.wikimedia.org/r/1043076

Change #1043077 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] page-analytics: enable mesh tracing

https://gerrit.wikimedia.org/r/1043077

Change #1043078 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] wikifeeds: enable mesh tracing

https://gerrit.wikimedia.org/r/1043078

Change #1043085 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] shellboxen: enable mesh tracing

https://gerrit.wikimedia.org/r/1043085

Change #1043089 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] zotero: enable mesh tracing

https://gerrit.wikimedia.org/r/1043089

Change #1043090 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] apertium: enable mesh tracing

https://gerrit.wikimedia.org/r/1043090

Change #1043107 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mobileapps: enable mesh tracing

https://gerrit.wikimedia.org/r/1043107

Change #1043076 merged by jenkins-bot:

[operations/deployment-charts@master] eventstreams: enable mesh tracing

https://gerrit.wikimedia.org/r/1043076

Change #1043090 merged by jenkins-bot:

[operations/deployment-charts@master] apertium: enable mesh tracing

https://gerrit.wikimedia.org/r/1043090

Change #1043089 merged by jenkins-bot:

[operations/deployment-charts@master] zotero: enable mesh tracing

https://gerrit.wikimedia.org/r/1043089

Change #1043077 merged by Filippo Giunchedi:

[operations/deployment-charts@master] page-analytics: enable mesh tracing

https://gerrit.wikimedia.org/r/1043077

Change #1043085 merged by Filippo Giunchedi:

[operations/deployment-charts@master] shellboxen: enable mesh tracing

https://gerrit.wikimedia.org/r/1043085

Change #1043078 merged by Filippo Giunchedi:

[operations/deployment-charts@master] wikifeeds: enable mesh tracing

https://gerrit.wikimedia.org/r/1043078

Change #1051699 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] wikifeeds: lower tracing sample rate

https://gerrit.wikimedia.org/r/1051699

Change #1051699 merged by Filippo Giunchedi:

[operations/deployment-charts@master] wikifeeds: lower tracing sample rate

https://gerrit.wikimedia.org/r/1051699

Change #1043107 merged by Filippo Giunchedi:

[operations/deployment-charts@master] mobileapps: enable mesh tracing

https://gerrit.wikimedia.org/r/1043107

Change #1052912 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mobileapps: lower tracing sampling percentage

https://gerrit.wikimedia.org/r/1052912

Change #1052912 merged by Filippo Giunchedi:

[operations/deployment-charts@master] mobileapps: lower tracing sampling percentage

https://gerrit.wikimedia.org/r/1052912

Change #1062012 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/deployment-charts@master] tracing: tweak samplerates for services

https://gerrit.wikimedia.org/r/1062012

Change #1062012 merged by jenkins-bot:

[operations/deployment-charts@master] tracing: tweak samplerates for services

https://gerrit.wikimedia.org/r/1062012