Page MenuHomePhabricator

Add a prometheus metric exporter to all the Druid daemons
Closed, ResolvedPublic21 Estimated Story Points

Description

Add and configure a Prometheus agent with metrics coming from http://druid.io/docs/0.9.2/operations/metrics.html. This requires two important actions:

  1. Decide with the team what metrics we want to expose.
  1. Create a custom ad-hoc Prometheus exporter. Druid offerts only what it is called an "Emitter", namely a way to push metrics to some target like an HTTP endpoint (via post), a file or Graphite. Prometheus requires the opposite, namely to poll a http interface that returns metrics formatted in a predefined way.

Druid does not currently use jmxtrans yet, so we'll skip the step entirely and go directly to Prometheus.

Event Timeline

Change 382452 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] druid: add log4j logger to direct metrics to a specific file

https://gerrit.wikimedia.org/r/382452

Change 382452 merged by Elukey:
[operations/puppet@production] druid: add log4j logger to direct metrics to a specific file

https://gerrit.wikimedia.org/r/382452

elukey renamed this task from Add the prometheus jmx exporter to all the Druid daemons to Add a prometheus metric exporter to all the Druid daemons.Oct 6 2017, 1:16 PM
elukey updated the task description. (Show Details)

All the daemons are now emitting metrics to /var/log/druid/$daemon-metrics.log, and it seems that we have already all we need in there without more configuration. Now it is a matter of designing the prometheus agent to read and aggregate json data from these logs and expose it via a HTTP interface.

Today I had a great chat with @fgiunchedi and I came up with the following simple script to handle HTTP POSTs with custom code and regular HTTP GETs returning prometheus metrics:

# heavily inspired by https://github.com/prometheus/client_python's examples

from prometheus_client import make_wsgi_app, Summary
from wsgiref.simple_server import make_server
import time
import random

# Create a metric to track time spent and requests made.
REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request')
app = make_wsgi_app()

# Decorate function with metric.
@REQUEST_TIME.time()
def process_request(t):
    """A dummy function that takes some time."""
    time.sleep(t)

def simple_app(environ, start_response):
    if environ['REQUEST_METHOD'] == 'POST':
        process_request(random.random())
        status = '200 OK'
        headers = [('Content-Type', 'text/plain')]
        start_response(status, headers)
        return ''
    else:
        return app(environ, start_response)

if __name__ == '__main__':
    httpd = make_server('localhost', 8000, simple_app)
    httpd.serve_forever()

Some simple test:

$ curl http://localhost:8000/metrics
# HELP request_processing_seconds Time spent processing request
# TYPE request_processing_seconds summary
request_processing_seconds_count 0.0
request_processing_seconds_sum 0.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="2",minor="7",patchlevel="10",version="2.7.10"} 1.0

$ curl -X POST http://localhost:8000
$ curl -X POST http://localhost:8000

$ curl http://localhost:8000/metrics
# HELP request_processing_seconds Time spent processing request
# TYPE request_processing_seconds summary

request_processing_seconds_count 2.0                  <================== Incremented after the two POSTs ===========

request_processing_seconds_sum 0.7823622226715088
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="2",minor="7",patchlevel="10",version="2.7.10"} 1.0

So the idea is to handle POSTs coming from Druid (containing JSON metrics) and store them as prometheus metrics, to then return them when asked via GETs by the master. Filippo also explained to me how nginx latency metrics are collected, as example about metrics that are not counter and that would need to be aggregated (for example for p75/90/etc..):

nginx_http_request_duration_seconds_bucket{host="thumbor.svc.codfw.wmnet",le="005.00"} 15994
nginx_http_request_duration_seconds_bucket{host="thumbor.svc.codfw.wmnet",le="010.00"} 16014
nginx_http_request_duration_seconds_bucket{host="thumbor.svc.codfw.wmnet",le="035.00"} 16149

The trick is basically to come up with "buckets" of request durations less-equal than certain thresholds, and let those counters increase over time in the prometheus agent. The Prometheus master will then collect them periodically and infer variations by itself (if asked via the prometheus cli).

elukey triaged this task as High priority.
elukey edited projects, added Analytics-Kanban; removed Patch-For-Review, Analytics.
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 389475 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/software/druid_exporter@master] [WIP] First commit

https://gerrit.wikimedia.org/r/389475

Change 390393 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] druid: remove com.metamx.metrics.JvmMonitor from default monitors

https://gerrit.wikimedia.org/r/390393

Change 390393 merged by Elukey:
[operations/puppet@production] druid: remove com.metamx.metrics.JvmMonitor from default monitors

https://gerrit.wikimedia.org/r/390393

Change 390419 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::druid::broker: add prometheus jmx exporter config (jvm only)

https://gerrit.wikimedia.org/r/390419

Change 390419 merged by Elukey:
[operations/puppet@production] profile::druid::broker: add prometheus jmx exporter config (jvm only)

https://gerrit.wikimedia.org/r/390419

Change 390968 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::druid::*: add prometheus jvm monitoring via jmx exporter

https://gerrit.wikimedia.org/r/390968

Change 390968 merged by Elukey:
[operations/puppet@production] profile::druid::*: add prometheus jvm monitoring via jmx exporter

https://gerrit.wikimedia.org/r/390968

Change 390976 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::druid::monitoring::coordinator: fix source for jmx exporter

https://gerrit.wikimedia.org/r/390976

Change 390976 merged by Elukey:
[operations/puppet@production] profile::druid::monitoring::coordinator: fix source for jmx exporter

https://gerrit.wikimedia.org/r/390976

Mentioned in SAL (#wikimedia-operations) [2017-11-13T11:18:21Z] <elukey> restart of all the druid daemons on druid100[1-6] to apply the new prometheus jmx jvm exporters - T177459

Change 391007 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::prometheus::analytics: add druid jmx exporter settings

https://gerrit.wikimedia.org/r/391007

Change 391007 merged by Elukey:
[operations/puppet@production] role::prometheus::analytics: add druid jmx exporter settings

https://gerrit.wikimedia.org/r/391007

Change 391173 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] monitoring.yaml: add druid clusters

https://gerrit.wikimedia.org/r/391173

Change 391173 merged by Elukey:
[operations/puppet@production] monitoring.yaml: add druid clusters

https://gerrit.wikimedia.org/r/391173

Change 391179 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::prometheus::analytics: remove redundant target configs

https://gerrit.wikimedia.org/r/391179

Change 391179 merged by Elukey:
[operations/puppet@production] role::prometheus::analytics: remove redundant target configs

https://gerrit.wikimedia.org/r/391179

Change 389475 merged by Elukey:
[operations/software/druid_exporter@master] First commit

https://gerrit.wikimedia.org/r/389475

Change 392052 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::*: add configuration for the Prometheus Druid exporter

https://gerrit.wikimedia.org/r/392052

Change 392052 merged by Elukey:
[operations/puppet@production] role::druid::*: add configuration for the Prometheus Druid exporter

https://gerrit.wikimedia.org/r/392052

Change 392422 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::public::worker: add prometheus druid exporter

https://gerrit.wikimedia.org/r/392422

Change 392422 merged by Elukey:
[operations/puppet@production] role::druid::public::worker: add prometheus druid exporter

https://gerrit.wikimedia.org/r/392422

Change 392424 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/software/druid_exporter@master] Remove incomplete query/node/* metrics

https://gerrit.wikimedia.org/r/392424

Change 392424 merged by Elukey:
[operations/software/druid_exporter@master] Remove incomplete query/node/* metrics

https://gerrit.wikimedia.org/r/392424

Change 392841 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role:prometheus::analytics: add druid_exporter targets

https://gerrit.wikimedia.org/r/392841

Change 392841 merged by Elukey:
[operations/puppet@production] role:prometheus::analytics: add druid_exporter targets

https://gerrit.wikimedia.org/r/392841

Dashboard updated with new metrics: https://grafana.wikimedia.org/dashboard/db/prometheus-druid

There will probably be features to add and small bugs to fix, but in my opinion the goal of this task is reached.

elukey set the point value for this task to 21.Nov 24 2017, 2:25 PM

Hi,

Is this prometheus druid metrics exporter open-sourced? really interested to integrate!

Hi,

Is this prometheus druid metrics exporter open-sourced? really interested to integrate!

Awesome news! There is also a github mirror if you want to quickly check the code: https://github.com/wikimedia/operations-software-druid_exporter

We also have this dashboard: https://grafana.wikimedia.org/dashboard/db/prometheus-druid

If you have any questions feel free to follow up on IRC Freenode in #wikimedia-analytics.

Hi,

That's great, but does it support druid 0.10.0 already? The git repo says it only supports 0.9.2 by far.

Hi,

That's great, but does it support druid 0.10.0 already? The git repo says it only supports 0.9.2 by far.

We are currently using 0.9.2 so the exporter works with this version, but it should be ok also for 0.10 if metric names/format has not changed in 0.10.

http://druid.io/docs/0.9.2/operations/metrics.html vs http://druid.io/docs/0.10.0/operations/metrics.html

I still haven't checked/tested 0.10 but from a quick glance it should work fine.

Hi all,

I'm trying to run this program, but haven't figured out how.
I'm running 'python exporter.py -l xxxx:8000' to run it on my server and to produce all types of metrics but didn't work well, or can you show me the easiest way to quickly start?
I guess I have to provide the hosts for both the druid cluster and prometheus cluster, but I didn't find where to specify, please advice. Thank you!

Hi! Can you give me a bit more detail about the "doesn't work well"?

https://github.com/wikimedia/operations-software-druid_exporter/blob/master/README.md is surely a good place to start, especially the section "how does it work?". Druid must be configured to POST metrics to the prometheus druid exporter, that will collect/aggregate metrics and then expose them when requested.

Hi,

I've made it work already! But I didn't see JVM/GC related metrics collected in prometheus, which display in your demo -> https://grafana.wikimedia.org/dashboard/db/prometheus-druid?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_analytics&var-druid_datasource=All .
And I've checked collector.py which also doesn't include any JVM/GC related metrics items, could you illustrate how to collecte such metrics?

Hi,

I've made it work already! But I didn't see JVM/GC related metrics collected in prometheus, which display in your demo -> https://grafana.wikimedia.org/dashboard/db/prometheus-druid?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_analytics&var-druid_datasource=All .
And I've checked collector.py which also doesn't include any JVM/GC related metrics items, could you illustrate how to collecte such metrics?

Good! We preferred not to include those metrics in the druid exporter, because we already use https://github.com/prometheus/jmx_exporter for JVM metrics (grabbed from Mbeans), so they are not supported.

Druid.io doesn't seem to have a jmx interface, how can I collect those JVM metrics with this jmx_exporter? I only see JVM/GC metrics produced from its http emitter.

Druid.io doesn't seem to have a jmx interface, how can I collect those JVM metrics with this jmx_exporter? I only see JVM/GC metrics produced from its http emitter.

All the Druid daemons collect GC metrics via MBeans (I think) as part of the standard behavior of the JVM. We use the jmx prometheus exporter as javaagent (more info in their docs) so it is not even needed to expose a JMX port.

Oh I see. Could you provide an example druid yaml file to apply this jmx_exporter, which produce the JVM metrics in your demo? Thank you!

Oh I see. Could you provide an example druid yaml file to apply this jmx_exporter, which produce the JVM metrics in your demo? Thank you!

We use the standard config:

---
lowercaseOutputLabelNames: true
lowercaseOutputName: false

That's it! This is basically telling to the jmx exporter to infer the name of the metrics from the Mbeans itself. For Druid it does a pretty good job :)