Page MenuHomePhabricator

remove ganglia(old), replace with ganglia_new
Closed, ResolvedPublic

Description

as Mark said in ops meeting today, cleanup the ganglia puppet situation. we have different ways to remove ganglia(old), merge into ganglia_new.

Event Timeline

Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn added a project: acl*sre-team.
Dzahn subscribed.

Change 199942 had a related patch set uploaded (by Dzahn):
move jobqueue monitoring out of ganglia.pp

https://gerrit.wikimedia.org/r/199942

Change 199943 had a related patch set uploaded (by Dzahn):
ganglia: remove class ganglia::logtailer

https://gerrit.wikimedia.org/r/199943

Change 198566 had a related patch set uploaded (by Dzahn):
ganglia: DRY, use hiera

https://gerrit.wikimedia.org/r/198566

Change 198721 had a related patch set uploaded (by Dzahn):
ganglia: autogenerate datasources from the list of clusters

https://gerrit.wikimedia.org/r/198721

Change 198720 had a related patch set uploaded (by Dzahn):
ganglia: remove unused configs from ganglia::collector::config

https://gerrit.wikimedia.org/r/198720

Dzahn triaged this task as Medium priority.Mar 28 2015, 12:22 AM

Change 198566 merged by Giuseppe Lavagetto:
ganglia: DRY, use hiera

https://gerrit.wikimedia.org/r/198566

Change 198720 merged by Giuseppe Lavagetto:
ganglia: remove unused configs from ganglia::collector::config

https://gerrit.wikimedia.org/r/198720

Change 198721 merged by Giuseppe Lavagetto:
ganglia: autogenerate datasources from the list of clusters

https://gerrit.wikimedia.org/r/198721

Change 199942 merged by Dzahn:
move jobqueue monitoring out of ganglia.pp

https://gerrit.wikimedia.org/r/199942

Change 199943 merged by Dzahn:
ganglia: remove class ganglia::logtailer

https://gerrit.wikimedia.org/r/199943

Change 204978 had a related patch set uploaded (by Dzahn):
make carbon a ganglia_new aggregator for eqiad

https://gerrit.wikimedia.org/r/204978

Change 204978 merged by Alexandros Kosiaris:
make carbon a ganglia_new aggregator for eqiad

https://gerrit.wikimedia.org/r/204978

carbon is now an aggregator for misc eqiad.

on uranium in /etc/ganglia/gmetad.conf there is:

data_source "Miscellaneous eqiad" carbon.wikimedia.org ms1004.eqiad.wmnet

i switched zirconium over to use ganglia_new with https://gerrit.wikimedia.org/r/#/c/205997/

Change 206043 had a related patch set uploaded (by Dzahn):
role::dumps::zim: hiera -> ganglia_class: "new"

https://gerrit.wikimedia.org/r/206043

Change 206043 merged by Dzahn:
role::dumps::zim: hiera -> ganglia_class: "new"

https://gerrit.wikimedia.org/r/206043

Change 206047 had a related patch set uploaded (by Dzahn):
ganglia: role::dumps -> ganglia_new

https://gerrit.wikimedia.org/r/206047

Change 206047 merged by Dzahn:
ganglia: role::dumps -> ganglia_new

https://gerrit.wikimedia.org/r/206047

Please note that right now the role lookup comes after *any* other lookup, so using role-based hiera defs for this won't work.

I am working on fixing this (by having one unified, logically well structured top-down hiera hierarchy to search through).

Change 207128 had a related patch set uploaded (by Giuseppe Lavagetto):
hiera: Add a proxy backend

https://gerrit.wikimedia.org/r/207128

Change 207128 merged by Giuseppe Lavagetto:
hiera: Add a proxy backend

https://gerrit.wikimedia.org/r/207128

Change 209388 had a related patch set uploaded (by Dzahn):
ganglia: switch PDF cluster to ganglia_new

https://gerrit.wikimedia.org/r/209388

Change 209389 had a related patch set uploaded (by Dzahn):
ganglia: switch ocg servers to ganglia_new

https://gerrit.wikimedia.org/r/209389

Change 209389 merged by Dzahn:
ganglia: switch ocg servers to ganglia_new

https://gerrit.wikimedia.org/r/209389

Change 209388 merged by Dzahn:
ganglia: switch PDF cluster to ganglia_new

https://gerrit.wikimedia.org/r/209388

@Joe so re: our IRC talk. To try the switch of a cluster i picked 'PDF' and made this change to switch it over. And made the hosts use ganglia_new here.

The result was that on ganglia-web/uranium in /etc/ganglia/gmetad.conf

data_source "PDF servers eqiad" ocg1001.eqiad.wmnet

becomes just

data_source "PDF servers eqiad"

and that config broke ganglia-web

so i set it back to ocg1001 which unbroke it but PDF disappeared now because the hosts use ganglia_new already.

ocg, jobrunners, tmh, memcached, parsoid, parsoid-cache,redis, elasticsearch, rcstream and restbase have been switched now

the rest will follow soon.

this is the pattern that works now:

https://gerrit.wikimedia.org/r/#/c/216658/

https://gerrit.wikimedia.org/r/#/q/status:merged+project:operations/puppet+branch:production+topic:ganglia-new,n,z

Change 223230 had a related patch set uploaded (by Dzahn):
logstash: switch to ganglia_new

https://gerrit.wikimedia.org/r/223230

Change 223231 had a related patch set uploaded (by Dzahn):
ganglia: add aggregator for ulsfo on bast4001

https://gerrit.wikimedia.org/r/223231

Change 223230 merged by Dzahn:
logstash: switch to ganglia_new

https://gerrit.wikimedia.org/r/223230

eqiad is completely switched now.


logstash was the last cluster.
caveat: only the unicast hosts are in the ganglia cluster:

21 elasticsearch::unicast_hosts:
22 - logstash1004.eqiad.wmnet
23 - logstash1005.eqiad.wmnet
24 - logstash1006.eqiad.wmnet

Change 223231 merged by Dzahn:
ganglia: add aggregator for ulsfo on bast4001

https://gerrit.wikimedia.org/r/223231

Change 223705 had a related patch set uploaded (by Dzahn):
ganglia_new: add aggregator config for ulsfo

https://gerrit.wikimedia.org/r/223705

Change 223705 merged by Dzahn:
ganglia_new: add aggregator config for ulsfo

https://gerrit.wikimedia.org/r/223705

logstash1001-1003: These hosts are older than 1004-1006, and run Precise instead of Jessie. Gmond wouldn't stop or start.

gage@logstash1002:~$ sudo /usr/sbin/gmond -f
[apache_status] Received the following parameters
{'url': 'http://127.0.0.1:80/server-status', 'collect_ssl': 'False', 'metric_group': 'apache'}
Fatal Python error: PyThreadState_Get: no current thread
Jul  9 00:06:37 logstash1002 kernel: [13984018.998725] init: ganglia-monitor main process (14196) killed by ABRT signal
Jul  9 00:06:37 logstash1002 kernel: [13984018.998751] init: ganglia-monitor main process ended, respawning
Jul  9 00:06:37 logstash1002 kernel: [13984019.086671] init: ganglia-monitor main process (14210) killed by ABRT signal
Jul  9 00:06:37 logstash1002 kernel: [13984019.086698] init: ganglia-monitor main process ended, respawning

The above error messages were unhelpful, so I used strace -f /usr/sbin/gmond -f, saw that gmond parses /etc/ganglia/conf.d/* before aborting, and then did a binary search to remove files in conf.d/ until the problem disappeared.

There was an existing config file snippet /etc/ganglia/conf.d/modpython.conf which declares "python_module". ganglia_new declares this module within gmond.conf, leading to duplicate declaration of this resource after these hosts were migrated. Solution was to kill gmond manually, remove that file, and restart the service normally. Now all hosts appear in the Ganglia UI.

Thanks very much for helping @Gage. That fixed the logstash issues.


So, i have:

.. so far so good. that did add aggregators on bast4001 for all the ULSFO clusters:

[bast4001:/etc/ganglia/aggregators] $ grep name *
4002.conf:  name = "LVS loadbalancers ulsfo"
4020.conf:  name = "Text caches ulsfo"
4021.conf:  name = "Bits caches ulsfo"
4022.conf:  name = "Upload caches ulsfo"
4028.conf:  name = "Mobile caches ulsfo"
[bast4001:/etc/ganglia/aggregators] $ grep port * | uniq
4002.conf:  port = 12651
4020.conf:  port = 12669
4021.conf:  port = 12670
4022.conf:  port = 12671
4028.conf:  port = 12677

Then i tried to move on by switching one cluster over to ganglia_new, like we did in eqiad:

https://gerrit.wikimedia.org/r/#/c/223706/

But that resulted in uranium (ganglia-web) setting:

data_source "Mobile caches ulsfo" carbon.wikimedia.org:9677

That _should_ be bast4001.wikimedia.org:12677 instead.

Why does it use carbon even though in ulsfo and after config was added?

Is it related to the $default_sites being just eqiad and codfw?

Change 225111 had a related patch set uploaded (by Dzahn):
ganglia_new: add aggregator setting for ULSFO

https://gerrit.wikimedia.org/r/225111

Change 225111 merged by Dzahn:
ganglia_new: add aggregator setting for ULSFO

https://gerrit.wikimedia.org/r/225111

^ that was the issue for ULSFO. I was able to switch the cluster "mobile caches ulsfo" succesfully now :)

and since all ULSFO was done, Giuseppe has merged my changed to also switch ULSFO to default:

https://gerrit.wikimedia.org/r/#/c/225276/

and could then rename ganglia_old to ganglia

https://gerrit.wikimedia.org/r/#/c/225881/

so the ticket is done, ganglia(old) is no more and ganglia_new is now ganglia