Page MenuHomePhabricator

our various Envoys are configured to report traces to local OpenTelemetry Collector
Open, Needs TriagePublic

Description

This task tracks configuring envoy to send traces to otel-collector (within k8s).

To reach the jaeger UI before it is available behind SSO:

ssh deploy1002.eqiad.wmnet -L16686:localhost:16686
kube-env jaeger aux-k8s-eqiad
kubectl -n jaeger port-forward svc/main-jaeger-query query

Navigate on your browser to https://localhost:16686 . There will be HTTPS warnings, on Chrome you have to type "thisisunsafe" in the browser window to get past them.

Immediate TODO (could be done concurrently too)

Envoy

  • Identify which charts we're targeting
  • Figure out how to expose/inject the current node address into the envoy config (likely via env variables, obtained via spec.nodeName via the k8s downward API)
  • Add the otelcol to the envoy(s) above's clusters (via flags in mesh module configuration*.tpl)
- name: opentelemetry_collector
  type: STRICT_DNS
  lb_policy: ROUND_ROBIN
  typed_extension_protocol_options:
    envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
      "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
      explicit_http_config:
        http2_protocol_options: {}
  load_assignment:
    cluster_name: opentelemetry_collector
    endpoints:
    - lb_endpoints:
      - endpoint:
          address:
            socket_address:
              address: node's address?
              port_value: 4317
  • Instruct envoy's http manager to send traces via said cluster (via flags in mesh module configuration*.tpl)
tracing:
  provider:
    name: envoy.tracers.opentelemetry
    typed_config:
      "@type": type.googleapis.com/envoy.config.trace.v3.OpenTelemetryConfig
      grpc_service:
        envoy_grpc:
          cluster_name: opentelemetry_collector
        timeout: 0.250s
      service_name: XXX

otelcol

  • While T344253 and T343302 are in progress, we should consider enabling zpages extension on otelcol so we can inspect the received traces without requiring jaeger (not needed)

Event Timeline

The good news: OpenTelemetry tracing support exists as of our currently-deployed version of Envoy (v1.23.10): https://www.envoyproxy.io/docs/envoy/v1.23.10/api-v3/config/trace/v3/opentelemetry.proto.html

As of the Envoy v1.24 docs there's even an example sandbox: https://www.envoyproxy.io/docs/envoy/v1.24.10/start/sandboxes/opentelemetry

Hopefully this is enough to get started?

Thank you @CDanis ! I think that's indeed enough to get started, the sandbox instructions didn't work as-is locally unfortunately. I'll be debugging a little more and see if I can get the sandbox to run for me.

The configurations were useful to me as-is though, and I've updated the task description with what I think could be the next steps (please feel free to add/remove/edit at will!)

Change 953268 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mesh: add KUBERNETES_NODE (spec.nodeName)

https://gerrit.wikimedia.org/r/953268

Change 953268 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mesh: add KUBERNETES_NODE (spec.nodeName)

https://gerrit.wikimedia.org/r/953268

I was far too optimistic with this change. Envoy configuration can't expand environment variables, and we need the node name or address to be able to reach otel-col's port on the node.

The approach could still come handy for software that does support env variables though (e.g. mediawiki?).

For envoy @JMeybohm suggested looking into file-based configuration discovery mechanism, this is upstream's quickstart on the matter: https://www.envoyproxy.io/docs/envoy/latest/start/quick-start/configuration-dynamic-filesystem

With this approach (to be investigated next) we'll be writing a yaml file with the grpc cluster definition at container startup, with the correct values expanded, then reference that into the envoy's configuration.

Change 953268 abandoned by Filippo Giunchedi:

[operations/deployment-charts@master] mesh: add KUBERNETES_NODE (spec.nodeName)

Reason:

Not needed

https://gerrit.wikimedia.org/r/953268

Change 953576 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mesh: add tracing support

https://gerrit.wikimedia.org/r/953576

Change 953575 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mesh: new configuration version

https://gerrit.wikimedia.org/r/953575

Change 953575 merged by Filippo Giunchedi:

[operations/deployment-charts@master] mesh: new configuration version

https://gerrit.wikimedia.org/r/953575

Change 954210 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mesh: new networkpolicy version

https://gerrit.wikimedia.org/r/954210

Change 954210 merged by Filippo Giunchedi:

[operations/deployment-charts@master] mesh: new networkpolicy version

https://gerrit.wikimedia.org/r/954210

Change 955290 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] otel-col: enable grpc-http

https://gerrit.wikimedia.org/r/955290

Change 955290 merged by Filippo Giunchedi:

[operations/deployment-charts@master] otel-col: enable grpc-http

https://gerrit.wikimedia.org/r/955290

Change 953576 merged by jenkins-bot:

[operations/deployment-charts@master] mesh: add tracing support

https://gerrit.wikimedia.org/r/953576

Change 955333 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] cxserver: update mesh module

https://gerrit.wikimedia.org/r/955333

Change 955334 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] cxserver: enable mesh tracing

https://gerrit.wikimedia.org/r/955334

fgiunchedi added a project: User-fgiunchedi.
fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.

Change 955333 merged by Filippo Giunchedi:

[operations/deployment-charts@master] cxserver: update mesh module

https://gerrit.wikimedia.org/r/955333

Change 955334 merged by Filippo Giunchedi:

[operations/deployment-charts@master] cxserver: enable mesh tracing

https://gerrit.wikimedia.org/r/955334

Change 955734 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] cxserver: enable mesh tracing in staging only

https://gerrit.wikimedia.org/r/955734

Change 955734 merged by Filippo Giunchedi:

[operations/deployment-charts@master] cxserver: enable mesh tracing in staging only

https://gerrit.wikimedia.org/r/955734

Change 955894 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] citoid: update mesh module

https://gerrit.wikimedia.org/r/955894

Change 955895 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] citoid: enable mesh tracing in staging

https://gerrit.wikimedia.org/r/955895

Thanks to @JMeybohm and @akosiaris I was able to test a cxserver call in staging which resulted in envoy posting a trace:

curl -k -X 'POST' \
  'https://staging.svc.eqiad.wmnet:4002/v1/mt/en/es/Apertium' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "html": "here comes the rain again"
}'

Which resulted in this trace

2023-09-08-140014_2262x1828_scrot.png (1×2 px, 272 KB)

Of course still a few ways to go (e.g. fix `
Service:OTLPResourceNoServiceName` to sth meaningful) though the basic scaffolding works!

Change 955894 merged by Filippo Giunchedi:

[operations/deployment-charts@master] citoid: update mesh module

https://gerrit.wikimedia.org/r/955894

Change 955895 merged by Filippo Giunchedi:

[operations/deployment-charts@master] citoid: enable mesh tracing in staging

https://gerrit.wikimedia.org/r/955895

mesh tracing for citoid also enabled in staging now!

We current have tracing enabled for cxserver and citoid in staging. As a first step and to gain confidence I'll enable tracing for those in production.

Change #1034047 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] Enable tracing for citoid and cxserver in production

https://gerrit.wikimedia.org/r/1034047

Change #1034047 merged by Filippo Giunchedi:

[operations/deployment-charts@master] Enable tracing for citoid and cxserver in production

https://gerrit.wikimedia.org/r/1034047

Change #1043076 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] eventstreams: enable mesh tracing

https://gerrit.wikimedia.org/r/1043076

Change #1043077 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] page-analytics: enable mesh tracing

https://gerrit.wikimedia.org/r/1043077

Change #1043078 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] wikifeeds: enable mesh tracing

https://gerrit.wikimedia.org/r/1043078

Change #1043085 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] shellboxen: enable mesh tracing

https://gerrit.wikimedia.org/r/1043085

Change #1043089 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] zotero: enable mesh tracing

https://gerrit.wikimedia.org/r/1043089

Change #1043090 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] apertium: enable mesh tracing

https://gerrit.wikimedia.org/r/1043090

Change #1043107 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mobileapps: enable mesh tracing

https://gerrit.wikimedia.org/r/1043107

Change #1043076 merged by jenkins-bot:

[operations/deployment-charts@master] eventstreams: enable mesh tracing

https://gerrit.wikimedia.org/r/1043076

Change #1043090 merged by jenkins-bot:

[operations/deployment-charts@master] apertium: enable mesh tracing

https://gerrit.wikimedia.org/r/1043090

Change #1043089 merged by jenkins-bot:

[operations/deployment-charts@master] zotero: enable mesh tracing

https://gerrit.wikimedia.org/r/1043089