Change Details

Machine-level metrics are covered in prometheus by `node_exporter` (tracked in T140646) though we also have application-specific metrics deployed in ganglia. For prometheus to be a viable replacement for ganglia we'd have to have at least the same metrics (if not better) in prometheus too. See also https://wikitech.wikimedia.org/wiki/Prometheus#Replacing_Ganglia for a list of ganglia plugins we are currently deploying. I'm listing below the ones I think are more important/urgent to have: [x] varnish [x] gdnsd [x] apache [x] vhtcpd [x] hhvm [x] memcache [] redis [] postgresql The list of rrds updated in the last 30d in P4571 and their current status. [] fundraising-related stats for misc queues and donations T152562 [x] cirrussearch slow log rate, in graphite via logstash [x] apache mod_socache_shmcb stats, we don't seem to use `mod_socache` anyway [x] elasticsearch stats, afaict those are in graphite already [] exim, can be done with diamond/graphite or in prometheus via node_exporter [x] jenkins TODO? some stats might be already in graphite [x] kafka, in graphite [x] varnishkafka, in graphite [x] osm sync lag from `/srv/osmosis/state.txt` [x] powerdns, in graphite via diamond