Page MenuHomePhabricator

Port application-specific metrics from ganglia to prometheus
Closed, ResolvedPublic


Machine-level metrics are covered in prometheus by node_exporter (tracked in T140646) though we also have application-specific metrics deployed in ganglia.
For prometheus to be a viable replacement for ganglia we'd have to have at least the same metrics (if not better) in prometheus too.

See also for a list of ganglia plugins we are currently deploying. I'm listing below the ones I think are more important/urgent to have:

  • varnish
  • gdnsd
  • apache
  • vhtcpd
  • hhvm
  • memcache
  • redis (see T177196)
  • postgresql (see T177196)

The list of rrds updated in the last 30d in P4571 and their current status.

  • fundraising-related stats for misc queues and donations T152562
  • cirrussearch slow log rate, in graphite via logstash
  • apache mod_socache_shmcb stats, we don't seem to use mod_socache anyway
  • elasticsearch stats, afaict those are in graphite already
  • exim, can be done with diamond/graphite or in prometheus via node_exporter (see T177196)
  • jenkins TODO? some stats might be already in graphite
  • kafka, in graphite
  • varnishkafka, in graphite
  • osm sync lag from /srv/osmosis/state.txt
  • powerdns, in graphite via diamond

Event Timeline

Change 310557 had a related patch set uploaded (by Filippo Giunchedi):
prometheus: add varnish_exporter

fgiunchedi renamed this task from port application-specific metrics from ganglia to prometheus to Port application-specific metrics from ganglia to prometheus.Oct 5 2016, 1:51 PM
fgiunchedi triaged this task as Medium priority.

Change 310557 merged by Filippo Giunchedi:
prometheus: add varnish_exporter

I went over the ganglia rrds updated in the last 30d in P4571 and audited their origin and what to do, see task description.

Gehel added a subscriber: Gehel.

There is no reason to duplicate elasticsearch metrics in both graphite and prometheus. Let's just not port those metrics.

@hashar re: the jenkins stats in ganglia in P4571 are they all/some in graphite already?

The Jenkins metrics from P4571 can all be dropped. They are irrelevant to our setup:

The number of jobs per status, we do not use a linear build history since jobs are used to test random patch having random parents. So those are unneeded:


Then there are stats related to the queue/business. That is replaced by metrics from the CI systems that drive Jenkins (Zuul/Nodepool) eg the pool business and the Gearman queues on


So in short: we can drop all of those. What would be interested though is to get Prometheus to monitor some statsd metric and alarms when some threshold pass. Then Grafana 4 seems to have such support so maybe that is redundant with Prometheus.

Change 331097 had a related patch set uploaded (by Filippo Giunchedi):
ganglia: display deprecation banner

Change 331097 merged by Dzahn:
ganglia: display deprecation banner

Ganglia now shows a deprecation banner.

Screenshot from 2017-01-31 16-55-28.png (367×1 px, 47 KB)

Change 355741 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] stop using exim4::ganglia

I uploaded a patch to stop using exim4::ganglia plugin when i saw it play a role in the spam-from-labs issue in T166322. Then i saw this ticket has the open check box for exim after that. So it seems we don't have a replacement for these yet, right? (the exim stats are hard to see in ganglia web ui because you have to click "no_group metrics" first.

@Dzahn correct there's no replacement yet, the easiest ATM is likely to use extendedeximcollector for diamond we already use in toollabs

Change 355741 abandoned by Dzahn:
stop using exim4::ganglia

Resolving as the work will be completed in T177196 by porting the missing Diamond collectors.

fgiunchedi claimed this task.

@fgiunchedi I was wondering if there is something here that replaces the PacketLossLogtailer from udp2log ( .. udp2log is on the way out too, right, Ottomata said +1 to remove it i see

@Dzahn not afaik, though as you said udp2log is on its way out so we can live without packetlosslogtailer IMO