Page MenuHomePhabricator

Proton metrics broken
Closed, ResolvedPublic

Description

I wanted to deploy a small change to the chromium-render helm chart today and stumbled upon a service dashboard which looks pretty broken: https://grafana-rw.wikimedia.org/d/llIEd7MMz/proton

This seems to correlate with the deployment at https://sal.toolforge.org/log/IpEeZHcBpU87LSFJhPIC which probably is https://gerrit.wikimedia.org/r/c/mediawiki/services/chromium-render/+/558213 (and that's why this is a subtask of T205870).

I'm not completely sure what the change does/should do, but I guess it broke quite a bunch of metrics. Or those are no longer exposed via statds but natively. Which would be nice, but in that case statsd should probably be disabled completely and we should start scraping the native metrics port.

Event Timeline

I think this is the patch that introduced the change from statsd metrics to native prometheus:
https://gerrit.wikimedia.org/r/c/mediawiki/services/chromium-render/+/558213

Change 673454 had a related patch set uploaded (by Jgiannelos; owner: Jgiannelos):
[operations/deployment-charts@master] Configure prometheus metrics for chromium-renderer

https://gerrit.wikimedia.org/r/673454

Change 673454 merged by jenkins-bot:
[operations/deployment-charts@master] Configure prometheus metrics for chromium-renderer

https://gerrit.wikimedia.org/r/673454

@Jgiannelos I'm moving this back to the to-do column because the changes didn't seem to have any effect.

Maybe that's relevant:
(from the logs)

endTiming() unsupported for metric type Gauge

Change 674941 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):
[mediawiki/services/chromium-render@master] Fix broken prometheus metrics

https://gerrit.wikimedia.org/r/674941

Change 674941 merged by jenkins-bot:
[mediawiki/services/chromium-render@master] Fix broken prometheus metrics

https://gerrit.wikimedia.org/r/674941

Metrics still not working. Things i discovered while debugging:

  • Patch that removes statsd exporter is already landed and shows up in deployment node
  • Manually cURLing the pod on prometheus port returns statsd exporter related stuff rather than native prometheus metrics
curl POD_IP:9102
...
<head><title>StatsD Exporter</title></head>
...
  • Manually describing the pod still shows that production-metrics-exporter container is deployed and running
  • Even though pods are freshly deployed (~last time helm was deployed) the k8s deployment object is old (older than last time helm was deployed)
  • Helmfile diff doesn't show anything to be deployed

I guess somehow the old metrics exporter is still running (without any statsd metrics to export) and prometheus cannot scrape the native prometheus metrics from proton service.
Can we somehow force a new deployment?

Metrics still not working. Things i discovered while debugging:

  • Patch that removes statsd exporter is already landed and shows up in deployment node
  • Manually cURLing the pod on prometheus port returns statsd exporter related stuff rather than native prometheus metrics
curl POD_IP:9102
...
<head><title>StatsD Exporter</title></head>
...
  • Manually describing the pod still shows that production-metrics-exporter container is deployed and running
  • Even though pods are freshly deployed (~last time helm was deployed) the k8s deployment object is old (older than last time helm was deployed)

That's expected, the deployment objects isn't being recreated on every deploy, it only gets updated, so the CreationTimestamp stays the same.

  • Helmfile diff doesn't show anything to be deployed

I guess somehow the old metrics exporter is still running (without any statsd metrics to export) and prometheus cannot scrape the native prometheus metrics from proton service.
Can we somehow force a new deployment?

The helm chart version wasn't bumped after 5ce71288eb656dc606700a5e75f782919bda47e8 was merged. Bump the version in Chart.yaml, merge it, and try to deploy again.

Change 676869 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] Bump helm chart version for chromium-render

https://gerrit.wikimedia.org/r/676869

Change 676869 merged by jenkins-bot:

[operations/deployment-charts@master] Bump helm chart version for chromium-render

https://gerrit.wikimedia.org/r/676869

It looks like native prometheus metrics are now exposed in the service. That said we may still need to adapt the grafana dashboard because the metrics names might have changed.

Change 677919 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/chromium-render@master] Migrate missing metrics to prometheus format

https://gerrit.wikimedia.org/r/677919

Change 677919 merged by jenkins-bot:

[mediawiki/services/chromium-render@master] Migrate missing metrics to prometheus format

https://gerrit.wikimedia.org/r/677919

Change 684896 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/chromium-render@master] Convert gauges to histograms for compatibility

https://gerrit.wikimedia.org/r/684896

Change 684896 merged by jenkins-bot:

[mediawiki/services/chromium-render@master] Convert gauges to histograms for compatibility

https://gerrit.wikimedia.org/r/684896

Chromium-render is now completely moved on native prometheus metrics using service runner.
There were many incompatibilities any many broken dashboards because of metrics/labels/queries that changed over time.

I tried to fix the dashboards but in order to not break the format I started from a copy.
Here is a version with working dashboards: https://grafana-rw.wikimedia.org/d/U4TuF-lMk/proton-tmp-jgiannelos

Maybe we can just review and then export and re-import to the original proton grafana dashboard

@Jgiannelos the board looks good for me. I think we should change the primary one with these changes.

@Jgiannelos the board looks good for me. I think we should change the primary one with these changes.

I just did so since the old one was entirely devoid of data by now. I also fixed a couple of minor issues like the queue depth but otherwise this looks pretty good. Thanks @Jgiannelos. I 'll boldly resolve this, feel free to reopen though.