Page MenuHomePhabricator

Port application-specific metrics from ganglia to prometheus
Closed, ResolvedPublic

Description

Machine-level metrics are covered in prometheus by node_exporter (tracked in T140646) though we also have application-specific metrics deployed in ganglia.
For prometheus to be a viable replacement for ganglia we'd have to have at least the same metrics (if not better) in prometheus too.

See also https://wikitech.wikimedia.org/wiki/Prometheus#Replacing_Ganglia for a list of ganglia plugins we are currently deploying. I'm listing below the ones I think are more important/urgent to have:

  • varnish
  • gdnsd
  • apache
  • vhtcpd
  • hhvm
  • memcache
  • redis (see T177196)
  • postgresql (see T177196)

The list of rrds updated in the last 30d in P4571 and their current status.

  • fundraising-related stats for misc queues and donations T152562
  • cirrussearch slow log rate, in graphite via logstash
  • apache mod_socache_shmcb stats, we don't seem to use mod_socache anyway
  • elasticsearch stats, afaict those are in graphite already
  • exim, can be done with diamond/graphite or in prometheus via node_exporter (see T177196)
  • jenkins TODO? some stats might be already in graphite
  • kafka, in graphite
  • varnishkafka, in graphite
  • osm sync lag from /srv/osmosis/state.txt
  • powerdns, in graphite via diamond

Event Timeline

Change 310557 had a related patch set uploaded (by Filippo Giunchedi):
prometheus: add varnish_exporter

https://gerrit.wikimedia.org/r/310557

fgiunchedi renamed this task from port application-specific metrics from ganglia to prometheus to Port application-specific metrics from ganglia to prometheus.Oct 5 2016, 1:51 PM
fgiunchedi triaged this task as Medium priority.

Change 310557 merged by Filippo Giunchedi:
prometheus: add varnish_exporter

https://gerrit.wikimedia.org/r/310557

I went over the ganglia rrds updated in the last 30d in P4571 and audited their origin and what to do, see task description.

Gehel subscribed.

There is no reason to duplicate elasticsearch metrics in both graphite and prometheus. Let's just not port those metrics.

@hashar re: the jenkins stats in ganglia in P4571 are they all/some in graphite already?

The Jenkins metrics from P4571 can all be dropped. They are irrelevant to our setup:

The number of jobs per status, we do not use a linear build history since jobs are used to test random patch having random parents. So those are unneeded:

jenkins_jobs_aborted.rrd
jenkins_jobs_blue.rrd
jenkins_jobs_disabled.rrd
jenkins_jobs_grey.rrd
jenkins_jobs_notbuilt.rrd
jenkins_jobs_red.rrd
jenkins_jobs_total.rrd
jenkins_jobs_yellow.rrd

Then there are stats related to the queue/business. That is replaced by metrics from the CI systems that drive Jenkins (Zuul/Nodepool) eg the pool business https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=1&fullscreen and the Gearman queues on https://grafana.wikimedia.org/dashboard/db/releng-zuul

jenkins_overallload_busy_executors.rrd
jenkins_overallload_queue_length.rrd
jenkins_overallload_total_executors.rrd

So in short: we can drop all of those. What would be interested though is to get Prometheus to monitor some statsd metric and alarms when some threshold pass. Then Grafana 4 seems to have such support so maybe that is redundant with Prometheus.

Change 331097 had a related patch set uploaded (by Filippo Giunchedi):
ganglia: display deprecation banner

https://gerrit.wikimedia.org/r/331097

Change 331097 merged by Dzahn:
ganglia: display deprecation banner

https://gerrit.wikimedia.org/r/331097

Ganglia now shows a deprecation banner.

Screenshot from 2017-01-31 16-55-28.png (367×1 px, 47 KB)

Change 355741 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] stop using exim4::ganglia

https://gerrit.wikimedia.org/r/355741

I uploaded a patch to stop using exim4::ganglia plugin when i saw it play a role in the spam-from-labs issue in T166322. Then i saw this ticket has the open check box for exim after that. So it seems we don't have a replacement for these yet, right? (the exim stats are hard to see in ganglia web ui because you have to click "no_group metrics" first.

@Dzahn correct there's no replacement yet, the easiest ATM is likely to use extendedeximcollector for diamond we already use in toollabs

Change 355741 abandoned by Dzahn:
stop using exim4::ganglia

https://gerrit.wikimedia.org/r/355741

Resolving as the work will be completed in T177196 by porting the missing Diamond collectors.

fgiunchedi claimed this task.

@fgiunchedi I was wondering if there is something here that replaces the PacketLossLogtailer from udp2log (https://gerrit.wikimedia.org/r/#/c/382913/) .. udp2log is on the way out too, right, Ottomata said +1 to remove it i see

@Dzahn not afaik, though as you said udp2log is on its way out so we can live without packetlosslogtailer IMO