Page MenuHomePhabricator

Port application-specific metrics from ganglia to prometheus
Closed, ResolvedPublic

Description

Machine-level metrics are covered in prometheus by node_exporter (tracked in T140646) though we also have application-specific metrics deployed in ganglia.
For prometheus to be a viable replacement for ganglia we'd have to have at least the same metrics (if not better) in prometheus too.

See also https://wikitech.wikimedia.org/wiki/Prometheus#Replacing_Ganglia for a list of ganglia plugins we are currently deploying. I'm listing below the ones I think are more important/urgent to have:

  • varnish
  • gdnsd
  • apache
  • vhtcpd
  • hhvm
  • memcache
  • redis (see T177196)
  • postgresql (see T177196)

The list of rrds updated in the last 30d in P4571 and their current status.

  • fundraising-related stats for misc queues and donations T152562
  • cirrussearch slow log rate, in graphite via logstash
  • apache mod_socache_shmcb stats, we don't seem to use mod_socache anyway
  • elasticsearch stats, afaict those are in graphite already
  • exim, can be done with diamond/graphite or in prometheus via node_exporter (see T177196)
  • jenkins TODO? some stats might be already in graphite
  • kafka, in graphite
  • varnishkafka, in graphite
  • osm sync lag from /srv/osmosis/state.txt
  • powerdns, in graphite via diamond

Event Timeline

Change 310557 had a related patch set uploaded (by Filippo Giunchedi):
prometheus: add varnish_exporter

https://gerrit.wikimedia.org/r/310557

fgiunchedi renamed this task from port application-specific metrics from ganglia to prometheus to Port application-specific metrics from ganglia to prometheus.Oct 5 2016, 1:51 PM
fgiunchedi triaged this task as Medium priority.

Change 310557 merged by Filippo Giunchedi:
prometheus: add varnish_exporter

https://gerrit.wikimedia.org/r/310557

I went over the ganglia rrds updated in the last 30d in P4571 and audited their origin and what to do, see task description.

Gehel added a subscriber: Gehel.

There is no reason to duplicate elasticsearch metrics in both graphite and prometheus. Let's just not port those metrics.

@hashar re: the jenkins stats in ganglia in P4571 are they all/some in graphite already?

The Jenkins metrics from P4571 can all be dropped. They are irrelevant to our setup:

The number of jobs per status, we do not use a linear build history since jobs are used to test random patch having random parents. So those are unneeded:

jenkins_jobs_aborted.rrd
jenkins_jobs_blue.rrd
jenkins_jobs_disabled.rrd
jenkins_jobs_grey.rrd
jenkins_jobs_notbuilt.rrd
jenkins_jobs_red.rrd
jenkins_jobs_total.rrd
jenkins_jobs_yellow.rrd

Then there are stats related to the queue/business. That is replaced by metrics from the CI systems that drive Jenkins (Zuul/Nodepool) eg the pool business https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=1&fullscreen and the Gearman queues on https://grafana.wikimedia.org/dashboard/db/releng-zuul

jenkins_overallload_busy_executors.rrd
jenkins_overallload_queue_length.rrd
jenkins_overallload_total_executors.rrd

So in short: we can drop all of those. What would be interested though is to get Prometheus to monitor some statsd metric and alarms when some threshold pass. Then Grafana 4 seems to have such support so maybe that is redundant with Prometheus.

Change 331097 had a related patch set uploaded (by Filippo Giunchedi):
ganglia: display deprecation banner

https://gerrit.wikimedia.org/r/331097

Change 331097 merged by Dzahn:
ganglia: display deprecation banner

https://gerrit.wikimedia.org/r/331097

Ganglia now shows a deprecation banner.

Screenshot from 2017-01-31 16-55-28.png (367×1 px, 47 KB)

Change 355741 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] stop using exim4::ganglia

https://gerrit.wikimedia.org/r/355741

I uploaded a patch to stop using exim4::ganglia plugin when i saw it play a role in the spam-from-labs issue in T166322. Then i saw this ticket has the open check box for exim after that. So it seems we don't have a replacement for these yet, right? (the exim stats are hard to see in ganglia web ui because you have to click "no_group metrics" first.

@Dzahn correct there's no replacement yet, the easiest ATM is likely to use extendedeximcollector for diamond we already use in toollabs

Change 355741 abandoned by Dzahn:
stop using exim4::ganglia

https://gerrit.wikimedia.org/r/355741

Resolving as the work will be completed in T177196 by porting the missing Diamond collectors.

fgiunchedi claimed this task.

@fgiunchedi I was wondering if there is something here that replaces the PacketLossLogtailer from udp2log (https://gerrit.wikimedia.org/r/#/c/382913/) .. udp2log is on the way out too, right, Ottomata said +1 to remove it i see

@Dzahn not afaik, though as you said udp2log is on its way out so we can live without packetlosslogtailer IMO