Page MenuHomePhabricator

Deprecate Diamond collectors in Cloud VPS
Open, Stalled, MediumPublic

Description

Diamond is being deprecated, I checked for remaining collectors used in Cloud VPS:

  • role::labs::instance: diamond::collector { 'SSHSessions' }
    • This is a custom collector deployed via Puppet (modules/diamond/files/collector/sshsessions.py), it should be pretty straightforward to convert. I think this would even be a useful metric to be collected in general (i.e. also for production), integration could be done via the prometheus-node-exporter textfile collector (see prometheus::node_intel_microcode).
  • role::labs::instance: diamond::collector diamond::collector::minimalpuppetagent { 'minimal-puppet-agent' }
    • Replaced with prometheus::node_puppet_agent
  • class puppetmaster::gitsync (used by role::puppetmaster::standalone): diamond::collector { 'CherryPickCounter' }
    • This is a custom collector deployed via Puppet (modules/puppetmaster/files/cherry-pick-counter-collector.py). it should be pretty straightforward to convert.
  • toollabs::services: diamond::collector { 'SGE' } (custom exporter from puppet.git)
    • Will need to be ported to a custom exporter (if still needed).
  • profile::toolforge::services::basic: diamond::collector { 'SGE':
    • Will need to be ported to a custom exporter (if still needed).
  • profile::toolforge::redis: diamond::collector { 'Redis' }
    • In production metrics were ported to Prometheus in T148637, that can probably be reused/adapted.
    • NOTE: The redis collector setup in prod requires PuppetDB to discover the nodes to scrape
  • define labstore::nfs_mount: diamond::collector { 'Nfsiostat' }
    • This is a custom collector deployed via Puppet (modules/diamond/files/collector/nfsiostat.py), I doubt there's an existing Prometheus exporter for this use case, so probably the existing one needs to be converted. Can likely be replaced with the mountstats module of node_exporter (T210993#5769021)
  • toollabs::mailrelay: diamond::collector::extendedexim { 'extended_exim_collector': }; profile::toolforge::mailrelay: diamond::collector::extendedexim { 'extended_exim_collector': }
    • In production metrics were ported to Prometheus in T179565, that can probably be reused/adapted.
    • NOTE: The exim mtail collector setup in prod requires PuppetDB to discover the nodes to scrape
  • class toollabs: diamond::collector::localcrontab { 'localcrontabcollector': }; profile::toolforge::grid::base: diamond::collector::localcrontab { 'localcrontabcollector': }
    • Needs some investigation for replacement.
  • role::beta::availability_collector: diamond::collector { 'VarnishStatus' }
    • This is a custom collector deployed via Puppet (modules/diamond/files/collector/varnishstatus.py), it should be pretty straightforward to convert (per watroles it's currently in use on deployment-cache-upload04, so seems still in use).

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+0 -30
operations/puppetproduction+0 -780
operations/puppetproduction+0 -102
operations/puppetproduction+0 -8
operations/puppetproduction+0 -3
operations/puppetproduction+6 -0
operations/puppetproduction+0 -3
operations/puppetproduction+0 -173
operations/puppetproduction+0 -492
operations/puppetproduction+0 -43
operations/puppetproduction+1 -1
operations/puppetproduction+89 -100
operations/puppetproduction+31 -4
operations/puppetproduction+1 -3
operations/puppetproduction+0 -1
operations/puppetproduction+2 -122
operations/puppetproduction+60 -27
operations/puppetproduction+2 -30
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
GTirloni removed a subscriber: GTirloni.Mar 23 2019, 8:46 PM
fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.May 13 2019, 8:58 AM

Change 522405 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove CherryPickCounter Diamond collector

https://gerrit.wikimedia.org/r/522405

Change 522405 merged by Muehlenhoff:
[operations/puppet@production] Remove CherryPickCounter Diamond collector

https://gerrit.wikimedia.org/r/522405

bd808 updated the task description. (Show Details)Oct 16 2019, 3:20 AM

Change 543268 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] cloud: Replace SSHSessions diamond collector with prometheus

https://gerrit.wikimedia.org/r/543268

Change 543268 merged by Bstorm:
[operations/puppet@production] cloud: Replace SSHSessions diamond collector with prometheus

https://gerrit.wikimedia.org/r/543268

bd808 updated the task description. (Show Details)Nov 7 2019, 5:03 AM

Change 549241 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] cloud: Replace diamond::collector::minimalpuppetagent

https://gerrit.wikimedia.org/r/549241

Change 549241 merged by Andrew Bogott:
[operations/puppet@production] cloud: Replace diamond::collector::minimalpuppetagent

https://gerrit.wikimedia.org/r/549241

Change 549484 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloud-vps: remove duplicate definition of node_puppet_agent

https://gerrit.wikimedia.org/r/549484

Change 549484 merged by Andrew Bogott:
[operations/puppet@production] cloud-vps: remove duplicate definition of node_puppet_agent

https://gerrit.wikimedia.org/r/549484

bd808 updated the task description. (Show Details)Nov 7 2019, 5:41 PM

JFTR, Diamond has been removed from Debian as part of the Python 2 removal: https://packages.qa.debian.org/d/diamond/news/20191127T071040Z.html

  • define labstore::nfs_mount: diamond::collector { 'Nfsiostat' }

This is a custom collector deployed via Puppet (modules/diamond/files/collector/nfsiostat.py), I doubt there's an existing Prometheus exporter for this use case, so probably the existing one needs to be converted.

I think that node_exporter's mountstats collector might have all the data we want/need to replace diamond::collector { 'Nfsiostat' }. It does not track data exactly like the diamond collector that Chase ported over from sysstat, but it appears to be very through. The biggest difference I see in a casual examination is that the Prometheus version does not recognize that /proc/self/mountstats reports total ops by mount point (because that is how the file is organized), but that these are actually stats which are computed per server, not per mount. This leads to duplicated data like:

node_mountstats_nfs_total_read_bytes_total{export="nfs-tools-project.svc.eqiad.wmnet:/project/tools/home",instance="tools-sgebastion-07:9100",job="node",protocol="tcp"}	57397540672
node_mountstats_nfs_total_read_bytes_total{export="nfs-tools-project.svc.eqiad.wmnet:/project/tools/project",instance="tools-sgebastion-07:9100",job="node",protocol="tcp"}	57397540672

Figuring out how to enable this collector selectively as we have been doing for the diamond collector within the labstore::nfs_mount define is a challenge. Our puppet code wants the prometheus::node_exporter::collectors_extra hiera setting to list any and all extra collectors to enable. One way we could handle it is just making setting up that hiera key a part of the process of adding NFS mounts to a Cloud VPS host. For projects like tools we can use prefix hiera to add the needed settings. Similar things can be done in other projects where NFS is project global or managed by instance prefix. I don't think that we want to add the collector for all Cloud VPS instances just so that we get metrics for the small(ish) number of instances which are NFS clients.

@Bstorm are there existing NFS client dashboards that you use in https://grafana-labs.wikimedia.org/ or is this diamond data something that we added at some point but never setup visualizations for? I imported an upstream dashboard for mountstats data that you can see at https://grafana-labs.wikimedia.org/d/keQ9P_-iz/nfs-mountstats?orgId=1&var-job=node&var-node=tools-sgebastion-07&var-port=9100&var-mountpoint=All. The only instance we seem to have the mountstats collector already enabled for is tools-sgebastion-07. This collector was enabled by @JHedden in rOPUP4b0033cc956e: tools: add mountstats to bastion node exporter about a month ago.

bd808 updated the task description. (Show Details)Jan 1 2020, 9:56 PM

Change 561379 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: replace diamond redis monitoring with prometheus

https://gerrit.wikimedia.org/r/561379

Change 561412 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: Monitor local crontabs with Prometheus

https://gerrit.wikimedia.org/r/561412

Change 561379 merged by Andrew Bogott:
[operations/puppet@production] toolforge: replace diamond redis monitoring with prometheus

https://gerrit.wikimedia.org/r/561379

bd808 updated the task description. (Show Details)Jan 2 2020, 4:20 AM
bd808 updated the task description. (Show Details)Jan 2 2020, 4:28 AM
bd808 updated the task description. (Show Details)Jan 2 2020, 4:51 AM

Change 561437 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: replace diamond redis monitoring with prometheus

https://gerrit.wikimedia.org/r/561437

Bstorm added a comment.Jan 2 2020, 9:28 PM

@Bstorm are there existing NFS client dashboards that you use in https://grafana-labs.wikimedia.org/ or is this diamond data something that we added at some point but never setup visualizations for? I imported an upstream dashboard for mountstats data that you can see at https://grafana-labs.wikimedia.org/d/keQ9P_-iz/nfs-mountstats?orgId=1&var-job=node&var-node=tools-sgebastion-07&var-port=9100&var-mountpoint=All. The only instance we seem to have the mountstats collector already enabled for is tools-sgebastion-07. This collector was enabled by @JHedden in rOPUP4b0033cc956e: tools: add mountstats to bastion node exporter about a month ago.

The diamond data doesn't seem to be used anywhere. It *has* caused problems in the past because monitoring NFS on the client side is a great way to build up load and cause issues whenever a mount becomes unavailable. I see little value in monitoring NFS mount stats on the client side by mount instead of by server because the problems typically affect all mounts from the server. I don't think we lose much with the prometheus exporter. I kind of wish it could be presented as "server" instead of "export", but we can always label things and mess with grafana to do that, right?

In general, I've always relied on nethogs to find the biggest consumers of NFS data. Using this might be handy with an alert watermark attached to it--especially since I see the prometheus exporter parses out every operation that takes place, which is really cool (saw that here https://tools-prometheus.wmflabs.org/tools/graph?g0.range_input=1h&g0.expr=node_mountstats_nfs_operations_sent_bytes_total%7Binstance%3D%22tools-sgebastion-07%3A9100%22%7D&g0.tab=1).

My only other comment on that topic is that we might be able to gate that exporter setting on profile::wmcs::nfsclient in general, which would make tools servers that have mount_nfs = false not use the exporter and respect the setting for all purposes--or something like that.

Change 561437 merged by Phamhi:
[operations/puppet@production] toolforge: replace diamond redis monitoring with prometheus

https://gerrit.wikimedia.org/r/561437

bd808 updated the task description. (Show Details)Jan 4 2020, 3:32 PM

Change 561412 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: Monitor local crontabs with Prometheus

https://gerrit.wikimedia.org/r/561412

Change 565530 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] prometheus: node_local_crontab: fix sudo permissions

https://gerrit.wikimedia.org/r/565530

Change 565530 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] prometheus: node_local_crontab: fix sudo permissions

https://gerrit.wikimedia.org/r/565530

bd808 updated the task description. (Show Details)Jan 17 2020, 4:23 PM

Change 582612 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] shinken: remove puppet agent checks

https://gerrit.wikimedia.org/r/582612

Change 582612 merged by Jhedden:
[operations/puppet@production] shinken: remove puppet agent checks

https://gerrit.wikimedia.org/r/582612

Nfsiostat is not only useless (it is a crash loop and nothing else), it causes crashes on nfs clients at this point. We have removed monitoring via prometheus of nfs client mounts for the same reason (though there are still nfs client stats provided by the node exporter). We should destroy that one with extreme prejudice.

Change 604931 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloud NFS: remove the nfsiostat diammond collector

https://gerrit.wikimedia.org/r/604931

Change 604931 merged by Bstorm:
[operations/puppet@production] cloud NFS: remove the nfsiostat diammond collector

https://gerrit.wikimedia.org/r/604931

Bstorm updated the task description. (Show Details)Jun 12 2020, 3:53 PM
fgiunchedi moved this task from Inbox to Radar on the observability board.Jul 20 2020, 1:27 PM
jbond updated the task description. (Show Details)Jul 27 2020, 3:45 PM
jbond added a subscriber: jbond.

updated to mark the following as done:

" toollabs::mailrelay: diamond::collector::extendedexim { 'extended_exim_collector': }; profile::toolforge::mailrelay: diamond::collector::extendedexim { 'extended_exim_collector': }"

This was fixed in the following CR

https://gerrit.wikimedia.org/r/c/operations/puppet/+/607324

This leaves VarnishStatus (added by @mmodell in 2015) running on deployment-cache-upload06.deployment-prep.eqiad.wmflabs as the last Diamond collector, is that still in use? (Or even useful given that we now use the ATS sandwich?)

Dzahn added subscribers: ema, Dzahn.Oct 5 2020, 10:50 PM

asked Mukunda about it and he says it is probably obsolete. asked -traffic because i do see @ema still logged in there not too long ago

Change 632351 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] delete role::beta::availability_collector, diamond varnishstatus.py

https://gerrit.wikimedia.org/r/632351

Change 632351 merged by Muehlenhoff:
[operations/puppet@production] delete role::beta::availability_collector, diamond varnishstatus.py

https://gerrit.wikimedia.org/r/632351

Change 632466 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove obsolete Diamond collector

https://gerrit.wikimedia.org/r/632466

Change 632471 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Stop using Diamond on Cloud VPS/Toolforge

https://gerrit.wikimedia.org/r/632471

Change 632477 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Don't add diamond to new images

https://gerrit.wikimedia.org/r/632477

Change 632466 merged by Muehlenhoff:
[operations/puppet@production] Remove obsolete Diamond collector

https://gerrit.wikimedia.org/r/632466

Dzahn awarded a token.Oct 6 2020, 9:04 PM

Change 632569 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] labs_bootstrapvz: remove diamond from lists of installed packages

https://gerrit.wikimedia.org/r/632569

Change 632570 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] wmcs::instance: remove diamond removal remnants

https://gerrit.wikimedia.org/r/632570

Change 632571 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] toolforge/dynamicproxy: remove diamond monitoring proxy

https://gerrit.wikimedia.org/r/632571

Change 632572 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] delete the diamond module

https://gerrit.wikimedia.org/r/632572

Change 632477 merged by Muehlenhoff:
[operations/puppet@production] Don't add diamond to new images

https://gerrit.wikimedia.org/r/632477

Change 632569 abandoned by Dzahn:
[operations/puppet@production] labs_bootstrapvz: remove diamond from lists of installed packages

Reason:
dupe of https://gerrit.wikimedia.org/r/c/operations/puppet/ /632477

https://gerrit.wikimedia.org/r/632569

Change 632571 merged by Dzahn:
[operations/puppet@production] toolforge: delete profile::toolforge::services::basic

https://gerrit.wikimedia.org/r/632571

Dzahn changed the task status from Open to Stalled.Tue, Nov 17, 1:10 AM

blocked by T264920

Change 632572 abandoned by Dzahn:
[operations/puppet@production] delete the diamond module

Reason:
can't be merged before T264920

https://gerrit.wikimedia.org/r/632572

Change 632570 abandoned by Dzahn:
[operations/puppet@production] wmcs::instance: remove diamond removal remnants

Reason:

https://gerrit.wikimedia.org/r/632570