Page MenuHomePhabricator

eventgate-wikimedia should emit metrics about validation errors
Closed, ResolvedPublic

Description

We already emit a validation error event to Kafka, but it'd be nice to emit some metrics (via prometheus) about validation errors per stream and schema uri. To do this, we need to update EventGate to use service-runner with Prometheus support, which I believe has not yet been released.

We should be able to do this by passing in the service-runner metrics object to eventgate-wikimedia's makeMapToErrorEvent function and emitting the metric then.

Event Timeline

Milimetric moved this task from Incoming to Event Platform on the Analytics board.

Hm, we might be able to do this incrementally. service-runner will let us configure multiple metrics clients. We can keep the existing statsd -> prometheues-statsd-exporter stuff, and at the same time use service-runner's prometheus for this validation error metric. This way, we wouldn't have to re-implement node-rdkafka-statsd as node-rdkafka-prometheus. This would allow us to incrementally move away from prometheus-statsd-exporter.

However, we'd have to run service-runner's prometheus http server on a different port than the one that prometheus-statsd-exporter uses, and I'm not sure our k8s configuration would be happy having to scrape 2 different prometheus endpoints for one service.

I guess that's a Q for service-ops folks. @JMeybohm, @akosiaris:

Would it be difficult to scrape 2 different prometheus endpoints for one service in k8s? One for prometheus-statsd-exporter, and another for service-runner prometheus?

Or, I could probably just get this working with the existent service-runner statsd + prometheus-statsd-exporter config, and just define a mapping from statsd -> prometheus like we do in the helm chart now.

Change 657902 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Emit error.validation counter metrics per stream

https://gerrit.wikimedia.org/r/657902

Change 657908 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] [WIP] eventgate - Map from eventgate event and error statsd metrics to prometheus

https://gerrit.wikimedia.org/r/657908

Ottomata added a project: Analytics-Kanban.
Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.
Ottomata moved this task from In Progress to In Code Review on the Analytics-Kanban board.

Change 657902 merged by Ottomata:
[eventgate-wikimedia@master] Emit metrics about events and validation errors

https://gerrit.wikimedia.org/r/657902

Change 658410 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics-external - bump to 2021-01-25-183848-production

https://gerrit.wikimedia.org/r/658410

Change 658410 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics-external - bump to 2021-01-25-183848-production

https://gerrit.wikimedia.org/r/658410

Change 657908 merged by Ottomata:
[operations/deployment-charts@master] eventgate - Map from eventgate event and error statsd metrics to prometheus

https://gerrit.wikimedia.org/r/657908

Change 658412 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-* - bump to 2021-01-25-183848-production

https://gerrit.wikimedia.org/r/658412

Change 658412 merged by Ottomata:
[operations/deployment-charts@master] eventgate-* - bump to 2021-01-25-183848-production

https://gerrit.wikimedia.org/r/658412

@fgiunchedi hello! Been reading some alert documentation stuff and I have some questions.

I want to add an alert to this panel that will include information about validation error rate per stream going over some threshold. The last time I added an alert it was done via monitoring::check_prometheus in Puppet. I could probably get this to work there, but I'd like the alert to be a bit more dynamic, reporting hopefully WHICH steams have validation errors, not just that there are some.

I'll try to find you on IRC sometime this week for help! :)

I guess that's a Q for service-ops folks. @JMeybohm, @akosiaris:

Would it be difficult to scrape 2 different prometheus endpoints for one service in k8s? One for prometheus-statsd-exporter, and another for service-runner prometheus?

While possible, it's a bit tricky to do. Prometheus service discovery would need to be configured to look at a second (third, actually as there is a metrics endpoint for envoy as well) set of annotations (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/prometheus/k8s.pp#221).
As this seems like a temporary solution, I would prefer a way around that.

@fgiunchedi hello! Been reading some alert documentation stuff and I have some questions.

I want to add an alert to this panel that will include information about validation error rate per stream going over some threshold. The last time I added an alert it was done via monitoring::check_prometheus in Puppet. I could probably get this to work there, but I'd like the alert to be a bit more dynamic, reporting hopefully WHICH steams have validation errors, not just that there are some.

I'll try to find you on IRC sometime this week for help! :)

For sure! I'll add a bit of context below as well:

At the moment there are several options for "alert from Prometheus metrics" use case, with various degrees of legacyness (sp?) and ease of use, namely:

  1. Alerts live in Grafana dashboards
    1. Currently notifications go through Icinga, like any other alert
    2. Going forward, Grafana will be an Alertmanager client (i.e. no Icinga). This quarter for example we'll be moving Performance alerts to AM, we don't have a formal onboarding procedure to AM yet, but happy to help. Note that Grafana at the moment does not support alerts for dashboards with template variables.
  2. Alerts live in Prometheus (as alerting rules) and show up as Alertmanager alerts. This means writing alerting rules and deploy them via Puppet (currently, we're working to deploy rules via another repo and thus make deployment self-service).
  3. Alerts live in Icinga with check_prometheus, as usual. Works but such alerts will eventually move to Prometheus alerting rules.

In terms of workflow, using AM means that alerts will show up at https://alerts.wikimedia.org and must be ack'd from there. Notifications can happen via IRC and/or email, usually depending on the alert' severity.

Hope that helps! Let's sync up on IRC too.

Change 661999 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add alerts on eventgate_validation_errors_total rate for each eventgate service

https://gerrit.wikimedia.org/r/661999

Change 661999 merged by Ottomata:
[operations/puppet@production] Add alerts on eventgate_validation_errors_total rate for each eventgate service

https://gerrit.wikimedia.org/r/661999