Port application-specific metrics from ganglia to prometheus
Open, NormalPublic

Description

Machine-level metrics are covered in prometheus by node_exporter (tracked in T140646) though we also have application-specific metrics deployed in ganglia.
For prometheus to be a viable replacement for ganglia we'd have to have at least the same metrics (if not better) in prometheus too.

See also https://wikitech.wikimedia.org/wiki/Prometheus#Replacing_Ganglia for a list of ganglia plugins we are currently deploying. I'm listing below the ones I think are more important/urgent to have:

  • varnish
  • gdnsd
  • apache
  • vhtcpd
  • hhvm
  • memcache
  • redis
  • postgresql

The list of rrds updated in the last 30d in P4571 and their current status.

  • fundraising-related stats for misc queues and donations T152562
  • cirrussearch slow log rate, in graphite via logstash
  • apache mod_socache_shmcb stats, we don't seem to use mod_socache anyway
  • elasticsearch stats, afaict those are in graphite already
  • exim, can be done with diamond/graphite or in prometheus via node_exporter
  • jenkins TODO? some stats might be already in graphite
  • kafka, in graphite
  • varnishkafka, in graphite
  • osm sync lag from /srv/osmosis/state.txt
  • powerdns, in graphite via diamond
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 14 2016, 4:24 PM

Change 310557 had a related patch set uploaded (by Filippo Giunchedi):
prometheus: add varnish_exporter

https://gerrit.wikimedia.org/r/310557

elukey added a subscriber: elukey.Sep 14 2016, 5:50 PM
fgiunchedi changed the title from "port application-specific metrics from ganglia to prometheus" to "Port application-specific metrics from ganglia to prometheus".Oct 5 2016, 1:51 PM
fgiunchedi triaged this task as "Normal" priority.

Change 310557 merged by Filippo Giunchedi:
prometheus: add varnish_exporter

https://gerrit.wikimedia.org/r/310557

elukey edited the task description. (Show Details)Oct 19 2016, 12:58 PM
fgiunchedi added a comment.EditedDec 6 2016, 12:08 AM

I went over the ganglia rrds updated in the last 30d in P4571 and audited their origin and what to do, see task description.

fgiunchedi edited the task description. (Show Details)Dec 6 2016, 12:09 AM
fgiunchedi edited the task description. (Show Details)Dec 6 2016, 10:54 PM
Gehel edited the task description. (Show Details)Dec 7 2016, 9:07 AM
Gehel added a subscriber: Gehel.

There is no reason to duplicate elasticsearch metrics in both graphite and prometheus. Let's just not port those metrics.

@hashar re: the jenkins stats in ganglia in P4571 are they all/some in graphite already?

The Jenkins metrics from P4571 can all be dropped. They are irrelevant to our setup:

The number of jobs per status, we do not use a linear build history since jobs are used to test random patch having random parents. So those are unneeded:

jenkins_jobs_aborted.rrd
jenkins_jobs_blue.rrd
jenkins_jobs_disabled.rrd
jenkins_jobs_grey.rrd
jenkins_jobs_notbuilt.rrd
jenkins_jobs_red.rrd
jenkins_jobs_total.rrd
jenkins_jobs_yellow.rrd

Then there are stats related to the queue/business. That is replaced by metrics from the CI systems that drive Jenkins (Zuul/Nodepool) eg the pool business https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=1&fullscreen and the Gearman queues on https://grafana.wikimedia.org/dashboard/db/releng-zuul

jenkins_overallload_busy_executors.rrd
jenkins_overallload_queue_length.rrd
jenkins_overallload_total_executors.rrd

So in short: we can drop all of those. What would be interested though is to get Prometheus to monitor some statsd metric and alarms when some threshold pass. Then Grafana 4 seems to have such support so maybe that is redundant with Prometheus.

fgiunchedi edited the task description. (Show Details)Dec 16 2016, 5:45 PM
fgiunchedi edited the task description. (Show Details)Jan 3 2017, 11:54 PM
hashar removed a subscriber: hashar.Jan 4 2017, 10:15 AM

Change 331097 had a related patch set uploaded (by Filippo Giunchedi):
ganglia: display deprecation banner

https://gerrit.wikimedia.org/r/331097

Change 331097 merged by Dzahn:
ganglia: display deprecation banner

https://gerrit.wikimedia.org/r/331097

Dzahn added a subscriber: Dzahn.Feb 1 2017, 12:56 AM

Ganglia now shows a deprecation banner.

Mentioned in SAL (#wikimedia-operations) [2017-02-01T00:57:48Z] <mutante> Ganglia is now deprecated in favor of Grafana (https://phabricator.wikimedia.org/T145659#2925104)

fgiunchedi edited the task description. (Show Details)Feb 1 2017, 11:16 AM