Page MenuHomePhabricator

Add prometheus metrics for varnishkafka instances running on caching hosts
Closed, ResolvedPublic

Description

Each varnishkafka is currently set to write stats periodically to a file like /var/cache/varnishkafka/$instance_name.stats.json. We have a root cron that executes every minute logster, that reads the json and pushes the latest metrics to graphite.

Rather than adding HTTP support to varnishkafka to poll the same set of metrics, we could simply write a python script that reads the json file, picks the last datapoints and then exposes the metrics to Prometheus when polled.

Event Timeline

elukey triaged this task as Normal priority.May 31 2018, 12:21 PM
elukey created this task.
Restricted Application added a project: Operations. · View Herald TranscriptMay 31 2018, 12:21 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

As reference, prometheus::node_gdnsd might be an example about how to proceed.

ema moved this task from Triage to Caching on the Traffic board.Jun 4 2018, 9:09 AM
Vvjjkkii renamed this task from Add prometheus metrics for varnishkafka instances running on caching hosts to 1wbaaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii removed elukey as the assignee of this task.
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from 1wbaaaaaaa to Add prometheus metrics for varnishkafka instances running on caching hosts.Jul 1 2018, 3:18 PM
CommunityTechBot assigned this task to elukey.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
CommunityTechBot triaged this task as Normal priority.Jul 3 2018, 2:24 AM
mforns claimed this task.Oct 19 2018, 1:47 PM
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Nov 13 2018, 2:18 PM
elukey added a comment.EditedNov 13 2018, 5:33 PM

@mforns created https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/prometheus-varnishkafka-exporter

Also added /home/mforns/webrequest.stats.json on stat1007 with an example of librdkafka metrics file that you can use :)

ema added a subscriber: ema.Mar 26 2019, 2:05 PM
mforns moved this task from In Progress to Paused on the Analytics-Kanban board.Apr 17 2019, 3:45 PM

Change 507632 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/debs/prometheus-varnishkafka-exporter@master] initial attempt at a varnishkafka exporter

https://gerrit.wikimedia.org/r/507632

Hm, sorry for this probably too late idea...but would it be worth building a C based prometheus plugin for librdkafka and/or varnishkafka instead or parsing this JSON file? librdkafka has built in constructs for callbacks to collect stats.

https://github.com/edenhill/librdkafka/blob/master/STATISTICS.md
https://github.com/edenhill/librdkafka/blob/master/src/rdkafka.h#L1749-L1775

elukey added a comment.May 2 2019, 2:30 PM

@Ottomata ahhhh you mean in varnishkafka itself! I thought that it would have needed a change in librdkafka.. So you are saying basically that we could create a C function in varnishkafka and register it via rd_kafka_conf_set_stats_cb or similar.. It would be nice, but from the rdkafka.h header it seems that the target is a json structure so we'd need to list all the metric names + values and read them anyway from the exporter no? (Trying to understand the pros/cons of one approach vs the other).

I think varnishkafka is already using this callback to write the stats out to the json file. I guess a pro would be not to have to manage a separate process...but then again maybe it is nice to have it completely separate. :)

elukey added a comment.May 2 2019, 2:40 PM

But something would need to be created (a simple exporter) to read the json with the Prometheus metrics and then report them via HTTP right? rd_kafka_conf_set_stats_cb seems to emit only json, but I might be wrong.. If we were able to emit a .prom file with no json but only the list of metrics and their values, we would be able to let the regular host exporter to read it and report metrics IIUC, but not with other formats.

Aye yeah I guess there'd have to be some pull service, ya.

Maybe converting whatever varnishkafka does to emit json to instead emit the prometheus format?

BTW, here's the nodejs code we use to convert from rdkafka json to statsd:
https://github.com/wikimedia/node-rdkafka-statsd

statsd-exporter then converts that using https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/eventgate-analytics/config/prometheus-statsd.conf to prometheus.

ema added a comment.May 2 2019, 2:51 PM

Hm, sorry for this probably too late idea...but would it be worth building a C based prometheus plugin for librdkafka and/or varnishkafka instead or parsing this JSON file?

I'd suggest not spending too much time on this, just go for the simplest solution (probably parsing the JSON file?) given that varnish will eventually be replaced by ATS.

Ya good point

Hm but also, whatever we replace varnishkafka with will likely be librdkafka based. Perhaps a librdkafka/prometheus stats plugin would be more useful than for just varnishkafka.

Anyway, I'm fine with either solution, just wanted to put that idea out there.

Hm but also, whatever we replace varnishkafka with will likely be librdkafka based. Perhaps a librdkafka/prometheus stats plugin would be more useful than for just varnishkafka.

I think the general idea of having librdkafka with some Prometheus support is great, if anything else to have consistent naming for Prometheus metrics from all rdkafka users. We're offtopic here by now, what would be the best way to pitch the idea to librdkafka upstream?

I don't think Magnus would build it into librdkafka itself. But, a 3rd party C library could be built that apps could use as the rdkafka stats callback. varnishkafka/ats-kafka/eventgate+node-rdkafka would then require this C library and register some function with librdkafka. Maybe?

I don't think Magnus would build it into librdkafka itself. But, a 3rd party C library could be built that apps could use as the rdkafka stats callback. varnishkafka/ats-kafka/eventgate+node-rdkafka would then require this C library and register some function with librdkafka. Maybe?

After a brief chat with Andrew we asked kafka-clients folks for tips/insights: https://groups.google.com/forum/#!topic/kafka-clients/LkVkWXSDfd8

fgiunchedi reassigned this task from mforns to colewhite.EditedMay 9 2019, 3:05 PM
fgiunchedi added a subscriber: mforns.

Reporting here a chat between me and @Ottomata re: metric naming and librkdkafka.

The idea being that we'd like consistent metric names across librdkafka users (e.g. eventgate, varnishkafka). Getting consistent labels will be trickier but we should try anyways; https://gist.github.com/ottomata/360bdf87cc85caf0b27d15a709d04e2a contains an example of current metrics from eventgate, while P8501 has current varnishkafka-exporter metrics.

The proposal is to use rdkafka as a prefix for the metric name (i.e. no service name) and append the metric name from librdkafka with as little transformation as possible, additionally there are two "branches" in the librdkafka stats object (topics and brokers) that will also need to be flattened.

I think there are a few more branches:

  • producer, consumer
  • global, per broker, per topic, per topic-partition

The various combinations of these branches should all be in the metric name since together they ID specific metrics. Examples!

  • rdkafka_producer_txmsgs (global)
  • rdkafka_producer_broker_outbuf_msg_cnt (per broker)
  • rdkafka_producer_topic_batchsize (per topic)
  • rdkafka_producer_topic_partition_msgq_cnt (per topic-partition)

I don't know if I have good example consumer metrics on hand, but we should support these too. EventStreams is a service-runner + node-rdkafka consumer that currently uses statsd. Whenever we move it to k8s we'd use the same statsd exporter mappings we use for EventGate. I believe there will be similar metric name branches, but there will be some different final metric names, with different labels too (e.g. consumer_group).

Oh check it out!

https://github.com/edenhill/librdkafka/blob/master/STATISTICS.md

There is:

  • type - Instance type (producer or consumer)

We could use this to add producer or consumer to the metric name.

  • client_id - The configured (or default) client.id

We set this in EventGate. If we exported this via node-rdkafka-statsd (oof, might get messy), we could do away with the EventGate specific producer_type label and just use this. In EventGate we currently set client.id to 'eventgate-producer-hasty` or 'eventgate-producer-guaranteed'. Would be easy to do away with producer_type altogether and make client_id a standard rdkafka label instead.

BTW for reference here's what EvenGate is currently exporting:

https://gist.github.com/ottomata/360bdf87cc85caf0b27d15a709d04e2a

I agree with dropping the prefix in favor of "rdkafka".

I see the type branch (producer and consumer) as valuable, but no preference as to whether they should be labels or in the metric name.

I'm not sure I see the value in breaking up the broker name into broker_hostname, broker_id, and broker_port labels. Interested to hear from others on this point, though.

I think there are a few more branches:

  • producer, consumer
  • global, per broker, per topic, per topic-partition

The various combinations of these branches should all be in the metric name since together they ID specific metrics. Examples!

  • rdkafka_producer_txmsgs (global)
  • rdkafka_producer_broker_outbuf_msg_cnt (per broker)
  • rdkafka_producer_topic_batchsize (per topic)
  • rdkafka_producer_topic_partition_msgq_cnt (per topic-partition)

I don't know if I have good example consumer metrics on hand, but we should support these too. EventStreams is a service-runner + node-rdkafka consumer that currently uses statsd. Whenever we move it to k8s we'd use the same statsd exporter mappings we use for EventGate. I believe there will be similar metric name branches, but there will be some different final metric names, with different labels too (e.g. consumer_group).

Oh check it out!
https://github.com/edenhill/librdkafka/blob/master/STATISTICS.md
There is:

  • type - Instance type (producer or consumer)

We could use this to add producer or consumer to the metric name.

Agreed, rdkafka_<type>_<branch if not global>_<name> sounds good to me!

  • client_id - The configured (or default) client.id

We set this in EventGate. If we exported this via node-rdkafka-statsd (oof, might get messy), we could do away with the EventGate specific producer_type label and just use this. In EventGate we currently set client.id to 'eventgate-producer-hasty` or 'eventgate-producer-guaranteed'. Would be easy to do away with producer_type altogether and make client_id a standard rdkafka label instead.

Sounds good to me to have client_id be one of the standard labels.

I agree with dropping the prefix in favor of "rdkafka".
I see the type branch (producer and consumer) as valuable, but no preference as to whether they should be labels or in the metric name.
I'm not sure I see the value in breaking up the broker name into broker_hostname, broker_id, and broker_port labels. Interested to hear from others on this point, though.

I'm +1 on keeping the parsing of label values at a minimum, IOW having broker_name = <name> (name from rdkafka object) unparsed I think would be good. Unless there are use cases for splitting/parsing the url I don't know about ?

I'm not sure I see the value in breaking up the broker name into broker_hostname, broker_id, and broker_port labels.

Unless there are use cases for splitting/parsing the url I don't know about ?

I mostly like them split for ease of reading in dashboards. Maybe e.g. broker=kafka1001.eqiad.wmnet:9092 and broker_id=1001? I'm not too pick though I'd be fine with a full broker_name with id too.

Agreed, rdkafka_<type>_<branch if not global>_<name> sounds good to me!

+1

Change 507632 merged by Cwhite:
[operations/debs/prometheus-varnishkafka-exporter@master] initial attempt at a varnishkafka exporter

https://gerrit.wikimedia.org/r/507632

Change 521580 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/debs/prometheus-varnishkafka-exporter@master] set up debian packaging

https://gerrit.wikimedia.org/r/521580

Change 521580 merged by Cwhite:
[operations/debs/prometheus-varnishkafka-exporter@master] set up debian packaging

https://gerrit.wikimedia.org/r/521580

Mentioned in SAL (#wikimedia-operations) [2019-07-12T19:17:04Z] <shdubsh> add prometheus-varnishkafka-exporter 0.1 to apt repo T196066

Change 522556 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: wire up prometheus-varnishkafka-exporter for deploy

https://gerrit.wikimedia.org/r/522556

Change 524288 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: canary enable varnishkafka_exporter on cp1088

https://gerrit.wikimedia.org/r/524288

Change 522556 merged by Cwhite:
[operations/puppet@production] prometheus: wire up prometheus-varnishkafka-exporter for deploy

https://gerrit.wikimedia.org/r/522556

Change 524288 merged by Cwhite:
[operations/puppet@production] hiera: canary enable varnishkafka_exporter on cp1088

https://gerrit.wikimedia.org/r/524288

Change 524545 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: refresh service on configuration change

https://gerrit.wikimedia.org/r/524545

Change 524561 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] prometheus: add varnishkafka job to prometheus scrapes

https://gerrit.wikimedia.org/r/524561

Change 524622 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: deploy varnishkafka exporter to eqsin

https://gerrit.wikimedia.org/r/524622

Change 524561 merged by Cwhite:
[operations/puppet@production] prometheus: add varnishkafka job to prometheus scrapes

https://gerrit.wikimedia.org/r/524561

Change 524545 merged by Cwhite:
[operations/puppet@production] prometheus: refresh prometheus-varnishkafka-exporter on configuration change

https://gerrit.wikimedia.org/r/524545

Change 524622 merged by Cwhite:
[operations/puppet@production] hiera: deploy varnishkafka exporter to eqsin

https://gerrit.wikimedia.org/r/524622

Change 524930 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: deploy varnishkafka exporter to ulsfo

https://gerrit.wikimedia.org/r/524930

Change 524931 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: deploy varnishkafka exporter to esams

https://gerrit.wikimedia.org/r/524931

Change 524932 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: deploy varnishkafka exporter to codfw

https://gerrit.wikimedia.org/r/524932

Change 524933 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: deploy varnishkafka exporter to eqiad

https://gerrit.wikimedia.org/r/524933

Change 524934 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] hiera: cleanup per-site varnishkafka deploy flags

https://gerrit.wikimedia.org/r/524934

Change 524930 merged by Cwhite:
[operations/puppet@production] hiera: deploy varnishkafka exporter to ulsfo

https://gerrit.wikimedia.org/r/524930

Change 524931 merged by Cwhite:
[operations/puppet@production] hiera: deploy varnishkafka exporter to esams

https://gerrit.wikimedia.org/r/524931

Change 524933 merged by Cwhite:
[operations/puppet@production] hiera: deploy varnishkafka exporter to eqiad

https://gerrit.wikimedia.org/r/524933

Change 524932 merged by Cwhite:
[operations/puppet@production] hiera: deploy varnishkafka exporter to codfw

https://gerrit.wikimedia.org/r/524932

Change 524934 merged by Cwhite:
[operations/puppet@production] profile: cleanup per-site varnishkafka deploy flags

https://gerrit.wikimedia.org/r/524934

colewhite closed this task as Resolved.Mon, Jul 29, 3:38 PM

@colewhite awesome work, thanks! Should I start working on creating a new varnishkafka dashboard and compare graphite vs prometheus metrics? (so we'll be able to deprecate the graphite metrics entirely).

Sounds good to me! Happy to help however I can, just let me know.

Change 526397 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] labs: set prometheus::varnishkafka_exporter::stats_default

https://gerrit.wikimedia.org/r/526397

Change 526397 merged by Ema:
[operations/puppet@production] labs: set prometheus::varnishkafka_exporter::stats_default

https://gerrit.wikimedia.org/r/526397

Change 528439 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add prometheus::varnishkafka_exporter::config

https://gerrit.wikimedia.org/r/528439

Change 528439 merged by Ema:
[operations/puppet@production] ATS: add prometheus::varnishkafka_exporter::config

https://gerrit.wikimedia.org/r/528439