Page MenuHomePhabricator

enable tracing on mwdebug hosts
Closed, ResolvedPublic

Description

In T320565 and T320551 we packaged the OpenTelemetry Collector as a .deb and also prepared some puppetization, but didn't actually deploy it to any roles or profiles or hosts.

So let's

  • modify mediawiki::canary_appserver (the role used for mwdebug hosts) to include profile::opentelemetry::collector (possibly optionally based on hiera, if we want to not to deploy to all the canaries immediately, but I don't see too much reason for that)
  • modify profile::tlsproxy::envoy to optionally report otel metadata to the local otel collector, configuring it with a tracing provider, similar to but different from the work done in T320563 on enabling it on the wikikube mesh (see also this patch)
  • modify profile::tlsproxy::envoy to have a configurable sampling fraction to initiate traces
    • set that sampling fraction to 1.0 on mwdebug*
  • modify profile::services_proxy::envoy to also enable tracing, possibly per listener.

Looking at the Envoy tracing configuration stanzas it seems like we would want to set random_sampling: 1.0 on debug hosts, and leave other values at their defaults.

Event Timeline

Hey,

So, a couple of questions:

  • profile::opentelemetry::collector, has 2 optional parameters, $otel_gateway_fqdn and $otel_gateway_otlp_port. Looking at the puppet code, if we don't supply these, we won't be configuring an otlp exporter (the otlp receiver will be enabled regardless and it's anyway orthogonal to the exporter). Are there plans to enable it in the near future?
  • profile::tlsproxy::envoy is just for the "local" service. That is, only for the traffic that is destined for the services the mwdebug host is meant to serve, not the traffic originating FROM the local machine and destined for other services via the service mesh. I am assuming this is what we want for this task, but do we have plans to include the service mesh in the future?
  • random_sampling defaults to 100% per tracing docs. While that value is fine for mwdebug hosts, we probably want a saner default for everything else. Any ideas on what that would be? My gut instinct says ~1% but my guess is as good as any.

Change 983441 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] tlsproxy::envoy: Allow specifying a percentage to be traced

https://gerrit.wikimedia.org/r/983441

Hey,

So, a couple of questions:

  • profile::opentelemetry::collector, has 2 optional parameters, $otel_gateway_fqdn and $otel_gateway_otlp_port. Looking at the puppet code, if we don't supply these, we won't be configuring an otlp exporter (the otlp receiver will be enabled regardless and it's anyway orthogonal to the exporter). Are there plans to enable it in the near future?

Yeah, sorry, implied in this but not actually written out was to provide values for those, similar to what we have in deployment-charts.

  • profile::tlsproxy::envoy is just for the "local" service. That is, only for the traffic that is destined for the services the mwdebug host is meant to serve, not the traffic originating FROM the local machine and destined for other services via the service mesh. I am assuming this is what we want for this task, but do we have plans to include the service mesh in the future?

Yes please. Really we need two things:

  • the "local" service to both report trace data to otlp AND for it to choose to trigger tracing on incoming requests
  • the service mesh to also report trace data to otlp
  • random_sampling defaults to 100% per tracing docs. While that value is fine for mwdebug hosts, we probably want a saner default for everything else. Any ideas on what that would be? My gut instinct says ~1% but my guess is as good as any.

For this stage I was going to tune this based on logstash capacity, which is really our limiting factor right now. I was going to start at something like 0.1% and consider going up to as much as 1%. SGTU?

Change 983895 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] mediawiki canaries: Include opentelemetry::collector

https://gerrit.wikimedia.org/r/983895

Change 983895 merged by Alexandros Kosiaris:

[operations/puppet@production] mediawiki canaries: Include opentelemetry::collector

https://gerrit.wikimedia.org/r/983895

Change 983441 merged by Alexandros Kosiaris:

[operations/puppet@production] tlsproxy::envoy: Allow specifying a percentage to be traced

https://gerrit.wikimedia.org/r/983441

Change 984814 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Switch canaries to 1% OpenTelemetry sampling

https://gerrit.wikimedia.org/r/984814

Change 984817 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] Provide OpenTelemetry Collector and Port values

https://gerrit.wikimedia.org/r/984817

Hey,

So, a couple of questions:

  • profile::opentelemetry::collector, has 2 optional parameters, $otel_gateway_fqdn and $otel_gateway_otlp_port. Looking at the puppet code, if we don't supply these, we won't be configuring an otlp exporter (the otlp receiver will be enabled regardless and it's anyway orthogonal to the exporter). Are there plans to enable it in the near future?

Yeah, sorry, implied in this but not actually written out was to provide values for those, similar to what we have in deployment-charts.

Double check me on this one? https://gerrit.wikimedia.org/r/c/operations/puppet/+/984817

  • profile::tlsproxy::envoy is just for the "local" service. That is, only for the traffic that is destined for the services the mwdebug host is meant to serve, not the traffic originating FROM the local machine and destined for other services via the service mesh. I am assuming this is what we want for this task, but do we have plans to include the service mesh in the future?

Yes please. Really we need two things:

  • the "local" service to both report trace data to otlp AND for it to choose to trigger tracing on incoming requests

This is done.

  • the service mesh to also report trace data to otlp

Working on this one.

  • random_sampling defaults to 100% per tracing docs. While that value is fine for mwdebug hosts, we probably want a saner default for everything else. Any ideas on what that would be? My gut instinct says ~1% but my guess is as good as any.

For this stage I was going to tune this based on logstash capacity, which is really our limiting factor right now. I was going to start at something like 0.1% and consider going up to as much as 1%. SGTU?

STGM, I 've posted https://gerrit.wikimedia.org/r/c/operations/puppet/+/984814 for that.

Change 984825 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] tlsproxy: Fix the definition of random_sampling

https://gerrit.wikimedia.org/r/984825

Change 984825 merged by Alexandros Kosiaris:

[operations/puppet@production] tlsproxy: Fix the definition of random_sampling

https://gerrit.wikimedia.org/r/984825

Change 985102 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] envoy: Make tracing configuration clearer

https://gerrit.wikimedia.org/r/985102

Change 985102 merged by Alexandros Kosiaris:

[operations/puppet@production] envoy: Make tracing configuration clearer

https://gerrit.wikimedia.org/r/985102

Change 984817 merged by Alexandros Kosiaris:

[operations/puppet@production] Provide OpenTelemetry Collector and Port values

https://gerrit.wikimedia.org/r/984817

I 've configured the gateway too for mwdebug1001 and it apparently works, at least per otel-collectors prometheus metrics. After checking a couple of traces via the full UX (WikimediaDebug on a browser and selecting mwdebug1001), I switched to telemetrygen to get some more data so that the prometheus exporter has something to showcase. I even managed to get the collector killed by throwing what I assume is too much traffic at it.

curl http://localhost:8888/metrics
# HELP otelcol_exporter_enqueue_failed_log_records Number of log records failed to be added to the sending queue.
# TYPE otelcol_exporter_enqueue_failed_log_records counter
otelcol_exporter_enqueue_failed_log_records{exporter="logging",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 0
otelcol_exporter_enqueue_failed_log_records{exporter="otlp",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 0
# HELP otelcol_exporter_enqueue_failed_metric_points Number of metric points failed to be added to the sending queue.
# TYPE otelcol_exporter_enqueue_failed_metric_points counter
otelcol_exporter_enqueue_failed_metric_points{exporter="logging",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 0
otelcol_exporter_enqueue_failed_metric_points{exporter="otlp",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 0
# HELP otelcol_exporter_enqueue_failed_spans Number of spans failed to be added to the sending queue.
# TYPE otelcol_exporter_enqueue_failed_spans counter
otelcol_exporter_enqueue_failed_spans{exporter="logging",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 0
otelcol_exporter_enqueue_failed_spans{exporter="otlp",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 0
# HELP otelcol_exporter_queue_capacity Fixed capacity of the retry queue (in batches)
# TYPE otelcol_exporter_queue_capacity gauge
otelcol_exporter_queue_capacity{exporter="otlp",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 1000
# HELP otelcol_exporter_queue_size Current size of the retry queue (in batches)
# TYPE otelcol_exporter_queue_size gauge
otelcol_exporter_queue_size{exporter="otlp",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 120
# HELP otelcol_exporter_sent_spans Number of spans successfully sent to destination.
# TYPE otelcol_exporter_sent_spans counter
otelcol_exporter_sent_spans{exporter="logging",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 1.059326e+06
otelcol_exporter_sent_spans{exporter="otlp",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 2150
# HELP otelcol_processor_batch_batch_send_size Number of units in the batch
# TYPE otelcol_processor_batch_batch_send_size histogram
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="10"} 1
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="25"} 1
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="50"} 1
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="75"} 1
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="100"} 1
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="250"} 2
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="500"} 3
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="750"} 3
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="1000"} 3
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="2000"} 3
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="3000"} 4
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="4000"} 4
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="5000"} 4
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="6000"} 4
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="7000"} 4
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="8000"} 4
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="9000"} 133
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="10000"} 133
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="20000"} 133
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="30000"} 133
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="50000"} 133
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="100000"} 133
otelcol_processor_batch_batch_send_size_bucket{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",le="+Inf"} 133
otelcol_processor_batch_batch_send_size_sum{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 1.0593259999999998e+06
otelcol_processor_batch_batch_send_size_count{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 133
# HELP otelcol_processor_batch_batch_size_trigger_send Number of times the batch was sent due to a size trigger
# TYPE otelcol_processor_batch_batch_size_trigger_send counter
otelcol_processor_batch_batch_size_trigger_send{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 129
# HELP otelcol_processor_batch_timeout_trigger_send Number of times the batch was sent due to a timeout trigger
# TYPE otelcol_processor_batch_timeout_trigger_send counter
otelcol_processor_batch_timeout_trigger_send{processor="batch",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0"} 4
# HELP otelcol_receiver_accepted_spans Number of spans successfully pushed into the pipeline.
# TYPE otelcol_receiver_accepted_spans counter
otelcol_receiver_accepted_spans{receiver="otlp",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",transport="grpc"} 1.059326e+06
# HELP otelcol_receiver_refused_spans Number of spans that could not be pushed into the pipeline.
# TYPE otelcol_receiver_refused_spans counter
otelcol_receiver_refused_spans{receiver="otlp",service_instance_id="8038ba01-4120-43f7-b2cf-91d4b4ad6285",service_name="otelcol-contrib",service_version="0.81.0",transport="grpc"} 0

I am not sure how to test on the jaeger side, I 'll leave that to people more experienced than me.

I think that for now, I am done on this front. I 'll look more into the services proxy side in a couple of weeks.

@CDanis, I 've got the patch up ready for enabling this across all canaries(including mwdebug hosts) at 0.1%, I 'll let you gauge if we are ok on the logstash front to go with that percentage.

Change 987954 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] services_proxy: Support tracing

https://gerrit.wikimedia.org/r/987954

Change 987954 merged by Alexandros Kosiaris:

[operations/puppet@production] services_proxy: Support tracing

https://gerrit.wikimedia.org/r/987954

Services mesh tracing configured as well and functioning apparently ok on mwdebug1001.

I think I am done on this front. I 've also got the patch ready to enable across all canaries at 0.1% but that's a different thing than mwdebugs

Change 993098 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] tracing: Add local_service/support random sampling

https://gerrit.wikimedia.org/r/993098

Change 993098 merged by jenkins-bot:

[operations/deployment-charts@master] tracing: Add local_service/support random sampling

https://gerrit.wikimedia.org/r/993098

Change 994193 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] mw-debug: Enable tracing with 100% sampling

https://gerrit.wikimedia.org/r/994193

Change 994193 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: Enable tracing with 100% sampling

https://gerrit.wikimedia.org/r/994193

akosiaris claimed this task.

I 'll resolve, this is now done.

Change #984814 abandoned by Alexandros Kosiaris:

[operations/puppet@production] Switch canaries to 0.1% OpenTelemetry sampling

Reason:

No longer needed, https://gerrit.wikimedia.org/r/c/994193 did the trick.

https://gerrit.wikimedia.org/r/984814