Page MenuHomePhabricator

Port non-deprecated Diamond collectors to Prometheus
Closed, ResolvedPublic

Description

We're using Diamond to collect all sorts of metrics from both services and the
"machine" (the kernel) itself, we'll need to port some/all of these collectors
to be collected by Prometheus.

There's a list of diamond collectors in use at
https://wikitech.wikimedia.org/wiki/Prometheus#Diamond

The list is reported below too, splitted by macro categories

Implementation via cron and textfile for node-exporter to pick up

  • minimalpuppetagent.py / Report puppet stats from last_run_summary.yaml
    • Substituted by prometheus-puppet-agent-stats
  • localcrontab.py / Report the number of users' crontabs, mainly used in tools
  • cherry-pick-counter-collector.py / Report the number of cherry-pick patches in a given git repo
  • nagios.py / Execute nagios commands locally and report the exit code
  • sshsessions.py / Collect number of lines from who
  • dir_size_tracker.py / Collect the size of given directories
  • sge.py / Collect metrics from gridengine

Implementation via a separate exporter

Misc

  • extendedexim.py / Parse exim's paniclog and queue stats by calling exim -bpr
  • etherpad.py / Parse localhost:9001/stats and report stats
    • Might make sense to contribute an etherpad plugin or patch for prometheus stats?
  • nfsd.py / Parse and report stats from /proc/net/rpc/nfsd and /proc/fs/nfsd/pool_stats
  • nfsiostat.py / Emulate iostat for NFS mount points using /proc/self/mountstats
    • Supported by node_exporter (nfs and mountstats collectors)
    • Metrics with both collectors enabled on one of tools-worker at https://phabricator.wikimedia.org/P6090 to compare with what we have now
  • nf_conntrack_counter.py / Report sysctl net.netfilter.nf_conntrack_count

Related Objects

StatusAssignedTask
OpenNone
Resolvedfgiunchedi
Resolvedfgiunchedi
Resolvedakosiaris
DeclinedGehel
Resolvedfgiunchedi
Resolvedfgiunchedi
Resolvedfgiunchedi
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
ResolvedGehel
ResolvedMoritzMuehlenhoff
Resolvedfgiunchedi
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
OpenNone
Resolvedbd808
OpenBstorm
Openaborrero

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
fgiunchedi updated the task description. (Show Details)Oct 10 2017, 9:27 AM
fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Oct 11 2017, 12:48 PM
fgiunchedi updated the task description. (Show Details)Oct 13 2017, 1:42 PM
fgiunchedi updated the task description. (Show Details)

Change 382695 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add conntrack/entropy/edac collectors

https://gerrit.wikimedia.org/r/382695

Change 382728 merged by Filippo Giunchedi:
[operations/software/hhvm_exporter@master] Collect APC info

https://gerrit.wikimedia.org/r/382728

fgiunchedi updated the task description. (Show Details)Oct 19 2017, 10:06 AM
fgiunchedi updated the task description. (Show Details)Oct 19 2017, 10:23 AM

Mentioned in SAL (#wikimedia-operations) [2017-10-19T11:40:42Z] <akosiaris> T177196 upload prometheus-postgres-exporter_0.2.0+ds-2 to apt.wikimedia.org/stretch-wikimedia/main and copied over to apt.wikimedia.org/jessie-wikimedia/main

cc cloud-services-team for input on some of these Diamonds we have in use, namely:

  • nfsiostat.py we could replace it with node_exporter functionality (see task description for a sample of metrics)
  • nfsd.py not supported by node_exporter yet but there's an upstream issue open

There's also some collectors at the top of the task description we could reimplement with cron and drop text files with the metrics. I was curious though to know which ones are actively used / useful.

fgiunchedi updated the task description. (Show Details)Nov 2 2017, 1:58 PM
fgiunchedi updated the task description. (Show Details)
MoritzMuehlenhoff triaged this task as High priority.Nov 6 2017, 8:01 AM

pinging myself here to not forget :)

Change 392438 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add postgresql::prometheus class

https://gerrit.wikimedia.org/r/392438

Change 392441 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Add postgresql::prometheus class to user of postgresql

https://gerrit.wikimedia.org/r/392441

Is there an issue with the memcached exporter since this looks empty:

https://grafana.wikimedia.org/dashboard/db/memcached?orgId=1&var-server=All&from=now-90d&to=now

Indeed, looks like that dashboard is graphite-based, the prometheus one is https://grafana.wikimedia.org/dashboard/db/prometheus-memcached-dc-stats?orgId=1. I've deleted the former dashboard to avoid confusion

Change 394571 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable nfs/mountstats collectors for tools-bastion-03

https://gerrit.wikimedia.org/r/394571

Change 394571 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable nfs/mountstats collectors for tools-bastion-03

https://gerrit.wikimedia.org/r/394571

I talked @fgiunchedi into enabling the client collector on tools-bastion-03 in https://gerrit.wikimedia.org/r/#/c/394571/ so we could see this work out on an instance with more data to expose. The tools worker in question is not busy at all and it's hard to get a great picture I'm seeing.

fgiunchedi updated the task description. (Show Details)Dec 4 2017, 10:49 AM
fgiunchedi updated the task description. (Show Details)Dec 4 2017, 10:53 AM

I talked @fgiunchedi into enabling the client collector on tools-bastion-03 in https://gerrit.wikimedia.org/r/#/c/394571/ so we could see this work out on an instance with more data to expose. The tools worker in question is not busy at all and it's hard to get a great picture I'm seeing.

https://tools-prometheus.wmflabs.org/tools/graph?g0.range_input=1d&g0.expr=rate(node_mountstats_nfs_read_bytes_total%7Bexport%3D%22nfs-tools-project.svc.eqiad.wmnet%3A%2Fproject%2Ftools%2Fproject%22%7D%5B5m%5D)&g0.tab=0

fgiunchedi updated the task description. (Show Details)Dec 14 2017, 9:44 AM

Change 392438 merged by Alexandros Kosiaris:
[operations/puppet@production] Add postgresql::prometheus class

https://gerrit.wikimedia.org/r/392438

fgiunchedi updated the task description. (Show Details)Dec 18 2017, 2:10 PM
fgiunchedi updated the task description. (Show Details)Dec 18 2017, 4:08 PM

Change 392441 merged by Alexandros Kosiaris:
[operations/puppet@production] Add prometheus::postgres_exporter class to users

https://gerrit.wikimedia.org/r/392441

akosiaris updated the task description. (Show Details)
akosiaris updated the task description. (Show Details)
fgiunchedi updated the task description. (Show Details)Dec 21 2017, 9:47 AM
fgiunchedi updated the task description. (Show Details)Feb 12 2018, 5:22 PM
fgiunchedi moved this task from Doing to Up next on the User-fgiunchedi board.Jul 11 2018, 1:50 PM
fgiunchedi moved this task from Backlog to In progress on the observability board.
bd808 updated the task description. (Show Details)Jan 24 2019, 4:03 PM
fgiunchedi closed this task as Resolved.Feb 5 2019, 8:43 AM
fgiunchedi claimed this task.

Agreed, resolving