Page MenuHomePhabricator

EventGate should use recent service-runner (^2.8.1) with Prometheus support
Closed, ResolvedPublic

Description

Currently, EventGate is using service-runner ^2.7.7, emitting metrics via statsd, which in production eventgate-wikimedia is translated to prometheus via promtheus-statsd-exporter. We should do the following:

  • Update to a recent service-runner (^2.8.1) with Prometheus metrics support.
  • Configure eventgate-wikimedia to use Prometheus metrics.
  • Remove prometheus-statsd-exporter usaeg from eventgate helm chart.
  • Make sure Grafana dashboards continue to work, or make new ones if metrics have changed.

This will allow us to do T257237: eventgate-wikimedia should emit metrics about validation errors

Event Timeline

fdans triaged this task as Medium priority.Jan 28 2021, 5:38 PM

cc Michael: I know you were looking for some event platform tasks..this one would be really helpful!

Change 700945 had a related patch set uploaded (by Ottomata; author: Ottomata):

[eventgate-wikimedia@master] [WIP] Prometheus support with service-runner 2.8.3

https://gerrit.wikimedia.org/r/700945

Change 700945 merged by Ottomata:

[eventgate-wikimedia@master] Prometheus support with service-runner 2.8.3 metrics

https://gerrit.wikimedia.org/r/700945

Change 703463 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate now uses prometheus directly instead of statsd bridge

https://gerrit.wikimedia.org/r/703463

Change 703463 merged by Ottomata:

[operations/deployment-charts@master] eventgate now uses prometheus directly instead of statsd bridge

https://gerrit.wikimedia.org/r/703463

Change 703477 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics/values-staging.yaml - set num_workers: 0

https://gerrit.wikimedia.org/r/703477

Change 703477 merged by Ottomata:

[operations/deployment-charts@master] eventgate-analytics/values-staging.yaml - set num_workers: 0

https://gerrit.wikimedia.org/r/703477

Change 703484 had a related patch set uploaded (by Ottomata; author: Ottomata):

[eventgate-wikimedia@master] Manually normalize rdkafka prometheus labels

https://gerrit.wikimedia.org/r/703484

Change 703484 merged by Ottomata:

[eventgate-wikimedia@master] Manually normalize rdkafka prometheus labels

https://gerrit.wikimedia.org/r/703484

Change 703487 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] Bump eventgate image version to get normalized prometheus metric labels

https://gerrit.wikimedia.org/r/703487

Change 703487 merged by Ottomata:

[operations/deployment-charts@master] Bump eventgate image version to get normalized prometheus metric labels

https://gerrit.wikimedia.org/r/703487

@Pchelolo, @colewhite Q: I'm getting close to getting this working, but I seem to be missing a service-runner metric and I don't know where it came from before. service_runner_request_duration_seconds_count is mapped by promethues-statsd-exporter, but after upgrading to service-runner 2.8.3 with Prometheus I don't see any equivalent http request metrics. I also am having trouble finding any service-runner code where that metric was emitted to statsd previously.

I'd like to keep this metric if possible. What am I missing? Thanks!

Oh great, thanks Cole. What do we need to do to get that merged and released?

NM, petr answered on PR. Working on it.

Change 704350 had a related patch set uploaded (by Ottomata; author: Ottomata):

[eventgate-wikimedia@master] Prometheus metric label value fixes

https://gerrit.wikimedia.org/r/704350

Change 704350 merged by Ottomata:

[eventgate-wikimedia@master] Prometheus metric label value fixes

https://gerrit.wikimedia.org/r/704350

Change 704353 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics - bump to 2021-07-13-151027-production

https://gerrit.wikimedia.org/r/704353

Change 704353 merged by Ottomata:

[operations/deployment-charts@master] eventgate-analytics - bump to 2021-07-13-151027-production

https://gerrit.wikimedia.org/r/704353

Change 704548 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Add prometheus precomputed query for express_router_request_duration_seconds

https://gerrit.wikimedia.org/r/704548

Change 704548 merged by Ottomata:

[operations/puppet@production] Add prometheus precomputed query for express_router_request_duration_seconds

https://gerrit.wikimedia.org/r/704548

Change 704554 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Fix express_router_request_duration_seconds precomputed promtheus query

https://gerrit.wikimedia.org/r/704554

Change 704554 merged by Ottomata:

[operations/puppet@production] Fix express_router_request_duration_seconds precomputed promtheus query

https://gerrit.wikimedia.org/r/704554

Change 704588 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics - set num_workers: 0

https://gerrit.wikimedia.org/r/704588

@colewhite Ok, I need to fork node-rdkafka-prometheus to fix this. Tomorrow I'll make a @wikimedia/node-rdkafka-prometheus package and github repo and apply my patches. I'll ask you to review the PRs if you don't mind!

I'll then adapt eventgate-wikimedia to use @wikimedia/node-rdkafka-prometheus instead of the main one.

Change 704851 had a related patch set uploaded (by Ottomata; author: Ottomata):

[eventgate-wikimedia@master] Use @wikimedia/node-rdkafka-prometheus 1.1.1

https://gerrit.wikimedia.org/r/704851

Change 704851 merged by Ottomata:

[eventgate-wikimedia@master] Use @wikimedia/node-rdkafka-prometheus 1.1.1

https://gerrit.wikimedia.org/r/704851

Change 704588 abandoned by Ottomata:

[operations/deployment-charts@master] eventgate-analytics - set num_workers: 0

Reason:

https://gerrit.wikimedia.org/r/704588

Change 704853 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics bump to 2021-07-15-185242-production with prom-client fixes

https://gerrit.wikimedia.org/r/704853

Change 704853 merged by Ottomata:

[operations/deployment-charts@master] eventgate-analytics bump to 2021-07-15-185242-production with prom-client fixes

https://gerrit.wikimedia.org/r/704853

^ Done and deployed to eventgate-analytics staging. Looks good.

Will deploy to production the week of July 26th.

Change 708289 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics - Use canary releases for eqiad and codfw

https://gerrit.wikimedia.org/r/708289

Change 708289 merged by Ottomata:

[operations/deployment-charts@master] eventgate-analytics - Use canary releases for eqiad and codfw

https://gerrit.wikimedia.org/r/708289

Change 708292 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate - use native prometheus in all services

https://gerrit.wikimedia.org/r/708292

Change 708292 merged by Ottomata:

[operations/deployment-charts@master] eventgate - use native prometheus in all services

https://gerrit.wikimedia.org/r/708292

All eventgate clusters deployed, woohoo! This is the new EventGate dashboard.

I'm going to wait a day, the final step will be to remove the old EventGate dashboard.