Page MenuHomePhabricator

Port non-deprecated Diamond collectors to Prometheus
Closed, ResolvedPublic

Description

We're using Diamond to collect all sorts of metrics from both services and the
"machine" (the kernel) itself, we'll need to port some/all of these collectors
to be collected by Prometheus.

There's a list of diamond collectors in use at
https://wikitech.wikimedia.org/wiki/Prometheus#Diamond

The list is reported below too, splitted by macro categories

Implementation via cron and textfile for node-exporter to pick up

  • minimalpuppetagent.py / Report puppet stats from last_run_summary.yaml
    • Substituted by prometheus-puppet-agent-stats
  • localcrontab.py / Report the number of users' crontabs, mainly used in tools
  • cherry-pick-counter-collector.py / Report the number of cherry-pick patches in a given git repo
  • nagios.py / Execute nagios commands locally and report the exit code
  • sshsessions.py / Collect number of lines from who
  • dir_size_tracker.py / Collect the size of given directories
  • sge.py / Collect metrics from gridengine

Implementation via a separate exporter

Misc

  • extendedexim.py / Parse exim's paniclog and queue stats by calling exim -bpr
  • etherpad.py / Parse localhost:9001/stats and report stats
    • Might make sense to contribute an etherpad plugin or patch for prometheus stats?
  • nfsd.py / Parse and report stats from /proc/net/rpc/nfsd and /proc/fs/nfsd/pool_stats
  • nfsiostat.py / Emulate iostat for NFS mount points using /proc/self/mountstats
    • Supported by node_exporter (nfs and mountstats collectors)
    • Metrics with both collectors enabled on one of tools-worker at https://phabricator.wikimedia.org/P6090 to compare with what we have now
  • nf_conntrack_counter.py / Report sysctl net.netfilter.nf_conntrack_count

Related Objects

StatusSubtypeAssignedTask
Resolvedfgiunchedi
Resolvedfgiunchedi
Resolvedfgiunchedi
Resolvedakosiaris
DeclinedGehel
Resolvedfgiunchedi
Resolvedfgiunchedi
Resolvedfgiunchedi
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
ResolvedGehel
ResolvedMoritzMuehlenhoff
Resolvedfgiunchedi
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
OpenNone
Resolvedbd808
Resolved Bstorm
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 382695 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add conntrack/entropy/edac collectors

https://gerrit.wikimedia.org/r/382695

Change 382728 merged by Filippo Giunchedi:
[operations/software/hhvm_exporter@master] Collect APC info

https://gerrit.wikimedia.org/r/382728

Mentioned in SAL (#wikimedia-operations) [2017-10-19T11:40:42Z] <akosiaris> T177196 upload prometheus-postgres-exporter_0.2.0+ds-2 to apt.wikimedia.org/stretch-wikimedia/main and copied over to apt.wikimedia.org/jessie-wikimedia/main

cc cloud-services-team for input on some of these Diamonds we have in use, namely:

  • nfsiostat.py we could replace it with node_exporter functionality (see task description for a sample of metrics)
  • nfsd.py not supported by node_exporter yet but there's an upstream issue open

There's also some collectors at the top of the task description we could reimplement with cron and drop text files with the metrics. I was curious though to know which ones are actively used / useful.

Change 392438 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add postgresql::prometheus class

https://gerrit.wikimedia.org/r/392438

Change 392441 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add postgresql::prometheus class to user of postgresql

https://gerrit.wikimedia.org/r/392441

Is there an issue with the memcached exporter since this looks empty:

https://grafana.wikimedia.org/dashboard/db/memcached?orgId=1&var-server=All&from=now-90d&to=now

Indeed, looks like that dashboard is graphite-based, the prometheus one is https://grafana.wikimedia.org/dashboard/db/prometheus-memcached-dc-stats?orgId=1. I've deleted the former dashboard to avoid confusion

Change 394571 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable nfs/mountstats collectors for tools-bastion-03

https://gerrit.wikimedia.org/r/394571

Change 394571 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable nfs/mountstats collectors for tools-bastion-03

https://gerrit.wikimedia.org/r/394571

I talked @fgiunchedi into enabling the client collector on tools-bastion-03 in https://gerrit.wikimedia.org/r/#/c/394571/ so we could see this work out on an instance with more data to expose. The tools worker in question is not busy at all and it's hard to get a great picture I'm seeing.

I talked @fgiunchedi into enabling the client collector on tools-bastion-03 in https://gerrit.wikimedia.org/r/#/c/394571/ so we could see this work out on an instance with more data to expose. The tools worker in question is not busy at all and it's hard to get a great picture I'm seeing.

https://tools-prometheus.wmflabs.org/tools/graph?g0.range_input=1d&g0.expr=rate(node_mountstats_nfs_read_bytes_total%7Bexport%3D%22nfs-tools-project.svc.eqiad.wmnet%3A%2Fproject%2Ftools%2Fproject%22%7D%5B5m%5D)&g0.tab=0

Change 392438 merged by Alexandros Kosiaris:
[operations/puppet@production] Add postgresql::prometheus class

https://gerrit.wikimedia.org/r/392438

Change 392441 merged by Alexandros Kosiaris:
[operations/puppet@production] Add prometheus::postgres_exporter class to users

https://gerrit.wikimedia.org/r/392441