Uninstall ganglia from the fleet
Open, NormalPublic

Description

Now that fundraising is no longer using ganglia we can uninstall it from the fleet (gmetad / gmond / etc) and remove the relevant puppet bits.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 2 2017, 3:38 PM
Dzahn claimed this task.Oct 7 2017, 12:53 AM
Dzahn awarded a token.
Dzahn triaged this task as Normal priority.

Change 382904 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: decom ganglia-web host, rm aggregators, rm phab include

https://gerrit.wikimedia.org/r/382904

Change 382905 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] osm: remove all ganglia support

https://gerrit.wikimedia.org/r/382905

Change 382906 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] postgresql: remove all ganglia support

https://gerrit.wikimedia.org/r/382906

Change 382907 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ocg: remove all ganglia support

https://gerrit.wikimedia.org/r/382907

Change 382909 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] apache: remove ganglia monitoring

https://gerrit.wikimedia.org/r/382909

Change 382913 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] udp2log: remove ganglia monitoring

https://gerrit.wikimedia.org/r/382913

Change 382914 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] bacula: remove ganglia backup sets

https://gerrit.wikimedia.org/r/382914

Change 382915 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] hhvm: remove ganglia monitoring

https://gerrit.wikimedia.org/r/382915

Change 382916 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] exim4/multiple roles: remove Ganglia exim stats

https://gerrit.wikimedia.org/r/382916

Change 382917 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] openstack: remove ganglia disk stats

https://gerrit.wikimedia.org/r/382917

Change 382918 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] authdns: remove ganglia support

https://gerrit.wikimedia.org/r/382918

Change 382920 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] cumin: drop ganglia::web role alias

https://gerrit.wikimedia.org/r/382920

Change 382921 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] zookeeper: remove jmxtrans for sending data to ganglia

https://gerrit.wikimedia.org/r/382921

Change 382922 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] memcached: remove ganglia monitoring

https://gerrit.wikimedia.org/r/382922

Change 382923 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] statsd: remove ganglia backend support

https://gerrit.wikimedia.org/r/382923

Change 382924 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] standard: decom ganglia plugin everywhere by default

https://gerrit.wikimedia.org/r/382924

Change 382926 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] standard: actually drop 'has_ganglia' param entirely

https://gerrit.wikimedia.org/r/382926

Change 382927 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] elasticsearch/logstash: drop ganglia monitoring

https://gerrit.wikimedia.org/r/382927

Change 382929 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] dnsrecursor: drop ganglia metrics support

https://gerrit.wikimedia.org/r/382929

Dzahn added a subscriber: Ottomata.Oct 7 2017, 3:04 AM

@Ottomata Hi, wondering what do you think should happen with modules/confluent/manifests/kafka/mirror/jmxtrans.pp and modules/confluent/manifests/kafka/broker/jmxtrans.pp: once Ganglia gets removed?

Change 382930 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] pybal: support RunCommand everywhere, not just appservers?

https://gerrit.wikimedia.org/r/382930

Change 382931 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] hiera/wmflib: drop ganglia_clusters variable entirely?

https://gerrit.wikimedia.org/r/382931

Change 382932 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ganglia: delete ganglia-web classes and role

https://gerrit.wikimedia.org/r/382932

Change 382933 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ganglia: delete the module

https://gerrit.wikimedia.org/r/382933

Change 382914 merged by Alexandros Kosiaris:
[operations/puppet@production] bacula: remove ganglia backup sets

https://gerrit.wikimedia.org/r/382914

I 've seen all the nice changes. While merging the bacula one, it got me thinking how are we going to clean up the fleet ? Should we use puppet (so all these changes require amending) ? Should we use cumin (and we should come up with the sequence of commands to run)? Something else? We should take some care to craft the correct approach as some hosts are not going to be reimaged for years and I would hate to see us repeatedly spending time in the future wondering why salt still runs on a host or why there's some salt configuration or some salt packages/modules around

For dropping the salt minion, the removal was done in two stages, first a commit which purged the packages and when that had run across the fleet and WMCS, it was dropped in puppet. Could be done here as well.

@Ottomata Hi, wondering what do you think should happen with modules/confluent/manifests/kafka/mirror/jmxtrans.pp and modules/confluent/manifests/kafka/broker/jmxtrans.pp: once Ganglia gets removed?

Those don't necessarily use Ganglia. I don't think they do now, but if they do, we can just remove the Ganglia part of the configuration, and only use statsd/graphite.

BTW, we are working on porting Kafka configuration over to Prometheus anyway, and will not be using jmxtrans anymore. We need to keep these classes around for a while, as the main Kafka clusters still use them, and will for a while.

faidon added a subscriber: faidon.Oct 9 2017, 6:01 PM

I saw some of these commits fly by. These are obviously well agreed in principle but I think it's important to not have regressions here -- if we remove a service from being monitored by Ganglia, we should have the equivalent metrics in Prometheus and Graphite, and these need to show up in a suitable Grafana dashboard. Has this been taken into account?

For dropping the salt minion, the removal was done in two stages, first a commit which purged the packages and when that had run across the fleet and WMCS, it was dropped in puppet. Could be done here as well.

I 've been thinking about using cumin instead, but given the number of hosts we have it's almost certain some will be missed and will have to be followed upon on an individual basis, which is not ideal. I am thinking the above approach is saner. It's not like we are in any rush to drop the code purging the packages from puppet anyway.

I saw some of these commits fly by. These are obviously well agreed in principle but I think it's important to not have regressions here -- if we remove a service from being monitored by Ganglia, we should have the equivalent metrics in Prometheus and Graphite, and these need to show up in a suitable Grafana dashboard. Has this been taken into account?

Agreed. No, I don't think it does. I can think at least one example (postgresql metrics/dashboard) that haven't been migrated yet.

Dzahn added a comment.Oct 10 2017, 5:51 PM
it got me thinking how are we going to clean up the fleet ? Should we use puppet (so all these changes require amending) ?

I was planning to use this: https://gerrit.wikimedia.org/r/#/c/382924/1/modules/standard/manifests/init.pp to uninstall the plugin everywhere because i saw we already have a "decom" class. For the rest i was thinking more cumin.

Dzahn added a comment.Oct 10 2017, 5:54 PM

equivalent metrics in Prometheus and Graphite, and these need to show up in a suitable Grafana dashboard. Has this been taken into account?

I was planning to make this part of the review process. I thought to myself "well, some will be obvious and for some we need to check if there are equivalents, so split it up into many patches and let people chime on on Gerrit, and maybe have to ask different people on a "per module" basis anyways.

Change 382927 merged by Dzahn:
[operations/puppet@production] elasticsearch/logstash: drop ganglia monitoring

https://gerrit.wikimedia.org/r/382927

Mentioned in SAL (#wikimedia-operations) [2017-10-10T18:41:19Z] <mutante> logstash*: delete /usr/lib/ganglia/python_modules/elasticsearch_monitoring.py and /etc/ganglia/conf.d/elasticsearch.pyconf via cumin (T177225)

Change 382920 merged by Dzahn:
[operations/puppet@production] cumin: drop ganglia::web role alias

https://gerrit.wikimedia.org/r/382920

Change 382921 merged by Dzahn:
[operations/puppet@production] zookeeper: update comment about ganglia stats

https://gerrit.wikimedia.org/r/382921

Looks like this is the relevant list https://wikitech.wikimedia.org/wiki/Prometheus#Ganglia_plugins to see which plugins have been replaced with what. Can any updates be made to that list?

Looks like this is the relevant list https://wikitech.wikimedia.org/wiki/Prometheus#Ganglia_plugins to see which plugins have been replaced with what. Can any updates be made to that list?

The updated list of what is there or missing can be found at T145659: Port application-specific metrics from ganglia to prometheus, note that redis/pg/exim will also be ported this quarter as part of T177196: Port non-deprecated Diamond collectors to Prometheus

it got me thinking how are we going to clean up the fleet ? Should we use puppet (so all these changes require amending) ?

I was planning to use this: https://gerrit.wikimedia.org/r/#/c/382924/1/modules/standard/manifests/init.pp to uninstall the plugin everywhere because i saw we already have a "decom" class. For the rest i was thinking more cumin.

That, as is at least, would not be enough (and hence the cumin work would be required). But I 'd say that decom class also needs a forced removal of /etc/ganglia and /usr/lib/ganglia/python_modules (and other if I am forgetting something). It would also probably be a good idea to change the state of the package from absent to purged. Maybe we can avoid the synchronous cumin work.

equivalent metrics in Prometheus and Graphite, and these need to show up in a suitable Grafana dashboard. Has this been taken into account?

I was planning to make this part of the review process. I thought to myself "well, some will be obvious and for some we need to check if there are equivalents, so split it up into many patches and let people chime on on Gerrit, and maybe have to ask different people on a "per module" basis anyways.

Ok, this is going to take long then, but was bound to anyways.

That, as is at least, would not be enough (and hence the cumin work would be required). But I 'd say that decom class also needs a forced removal of /etc/ganglia and /usr/lib/ganglia/python_modules (and other if I am forgetting something).

It would also probably be a good idea to change the state of the package from absent to purged. Maybe we can avoid the synchronous cumin work.

Ack, ganglia::monitor::decommision should use "ensure => purged"; /etc/ganglia/gmond.conf is a conffile owned by the ganglia-monitor package and purging would remove it along (same for the /etc/ganglia directory itself).

Change 383945 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ganglia: decom class should use "ensure => purged"

https://gerrit.wikimedia.org/r/383945

Change 383945 merged by Dzahn:
[operations/puppet@production] ganglia: decom class should use "ensure => purged"

https://gerrit.wikimedia.org/r/383945

Change 382918 merged by Dzahn:
[operations/puppet@production] authdns: remove ganglia support

https://gerrit.wikimedia.org/r/382918

removed ganglia stats for gdnsd (authdns servers).

I checked that there was no regression because stats were ported in T147426 , merged the above, then did:

[baham:~] $ sudo rm /usr/lib/ganglia/python_modules/gdnsd.py
[baham:~] $ sudo rm /etc/ganglia/conf.d/gdnsd.pyconf

on baham, eeden and radon.

Change 383991 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Purge /usr/lib/ganglia/ and /etc/ganglia

https://gerrit.wikimedia.org/r/383991

removed ganglia stats for gdnsd (authdns servers).

I checked that there was no regression because stats were ported in T147426 , merged the above, then did:

Nice!

[baham:~] $ sudo rm /usr/lib/ganglia/python_modules/gdnsd.py
[baham:~] $ sudo rm /etc/ganglia/conf.d/gdnsd.pyconf

Yeah, let's actually automate this. The ensure => purged is a good first step, let's also ensure /usr/lib/ganglia/python_modules/. is purged. Done in https://gerrit.wikimedia.org/r/383991

on baham, eeden and radon.

So these hosts now having only the "basic" ganglia stuff. Should we switch has_ganglia to no for them (or preferably the role), so that the ganglia::monitor::decom class takes over and cleans up (and also to test it) ?

Change 383991 merged by Dzahn:
[operations/puppet@production] Purge /usr/lib/ganglia/ and /etc/ganglia

https://gerrit.wikimedia.org/r/383991

Change 384070 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] authdns: purge ganglia files

https://gerrit.wikimedia.org/r/384070

Change 384070 merged by Dzahn:
[operations/puppet@production] authdns: purge ganglia files

https://gerrit.wikimedia.org/r/384070

Dzahn added a comment.Oct 13 2017, 4:18 PM

So these hosts now having only the "basic" ganglia stuff. Should we switch has_ganglia to no for them (or preferably the role), so that the ganglia::monitor::decom class takes over and cleans up (and also to test it) ?

Yes, done :) And it worked fine. On radon:

Notice: /Stage[main]/Ganglia::Monitor::Decommission/Package[ganglia-monitor]/ensure: ensure changed '3.6.0-6' to 'purged'
Notice: /Stage[main]/Ganglia::Monitor::Decommission/File[/usr/lib/ganglia/]/ensure: removed
Notice: /Stage[main]/Ganglia::Monitor::Decommission/File[/etc/ganglia/]/ensure: removed
Notice: Finished catalog run in 22.04 seconds

Change 382929 merged by Dzahn:
[operations/puppet@production] dnsrecursor: drop ganglia metrics support

https://gerrit.wikimedia.org/r/382929

Dzahn added a comment.Oct 13 2017, 4:49 PM

^ I amended the one for dnsrecursors to include the "ganglia: false" Hiera setting right away and then merged that too. Metrics were converted to Diamond collector in T169600

Change 382907 abandoned by Dzahn:
ocg: remove all ganglia support

Reason:
ocg already deleted in https://gerrit.wikimedia.org/r/#/c/383580/

https://gerrit.wikimedia.org/r/382907

faidon moved this task from Backlog to In progress on the monitoring board.Oct 16 2017, 1:32 PM

Change 382917 merged by Rush:
[operations/puppet@production] openstack: remove ganglia disk stats

https://gerrit.wikimedia.org/r/382917

Change 382913 merged by Dzahn:
[operations/puppet@production] udp2log: remove ganglia monitoring

https://gerrit.wikimedia.org/r/382913

Change 385218 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] xenon: decom ganglia from mwlog hosts

https://gerrit.wikimedia.org/r/385218

Change 385218 merged by Dzahn:
[operations/puppet@production] xenon: decom ganglia from mwlog hosts

https://gerrit.wikimedia.org/r/385218

Change 385229 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ganglia: also remove systemd unit files on decom

https://gerrit.wikimedia.org/r/385229

Change 385229 merged by Dzahn:
[operations/puppet@production] ganglia: also remove systemd unit files on decom

https://gerrit.wikimedia.org/r/385229

Change 382909 merged by Dzahn:
[operations/puppet@production] apache: remove ganglia monitoring

https://gerrit.wikimedia.org/r/382909

Change 382915 merged by Dzahn:
[operations/puppet@production] hhvm: remove ganglia monitoring

https://gerrit.wikimedia.org/r/382915

Mentioned in SAL (#wikimedia-operations) [2017-10-20T17:38:57Z] <mutante> removing Apache and HHVM Ganglia stats from all appservers, part of retiring Ganglia (T177225) - some transient puppet issues while purging package and remnants - use grafana dashboard at https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats

Change 385412 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ganglia::decom: also purge package libganglia1

https://gerrit.wikimedia.org/r/385412

Change 385412 merged by Dzahn:
[operations/puppet@production] ganglia::decom: also purge package libganglia1

https://gerrit.wikimedia.org/r/385412

Change 391558 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/tendril@master] Link to grafana rather than to ganglia on tendril

https://gerrit.wikimedia.org/r/391558