as Mark said in ops meeting today, cleanup the ganglia puppet situation. we have different ways to remove ganglia(old), merge into ganglia_new.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Dzahn | T93776 remove ganglia(old), replace with ganglia_new | |||
Declined | RobH | T95792 hardware for global ganglia aggregator in eqiad | |||
Resolved | BBlack | T104036 networking: adjust ACLs to allow analytics clusters to talk to new ganglia aggregator |
Event Timeline
Change 199942 had a related patch set uploaded (by Dzahn):
move jobqueue monitoring out of ganglia.pp
Change 199943 had a related patch set uploaded (by Dzahn):
ganglia: remove class ganglia::logtailer
Change 198721 had a related patch set uploaded (by Dzahn):
ganglia: autogenerate datasources from the list of clusters
Change 198720 had a related patch set uploaded (by Dzahn):
ganglia: remove unused configs from ganglia::collector::config
Change 198720 merged by Giuseppe Lavagetto:
ganglia: remove unused configs from ganglia::collector::config
Change 198721 merged by Giuseppe Lavagetto:
ganglia: autogenerate datasources from the list of clusters
Change 204978 had a related patch set uploaded (by Dzahn):
make carbon a ganglia_new aggregator for eqiad
Change 204978 merged by Alexandros Kosiaris:
make carbon a ganglia_new aggregator for eqiad
carbon is now an aggregator for misc eqiad.
on uranium in /etc/ganglia/gmetad.conf there is:
data_source "Miscellaneous eqiad" carbon.wikimedia.org ms1004.eqiad.wmnet
i switched zirconium over to use ganglia_new with https://gerrit.wikimedia.org/r/#/c/205997/
Change 206043 had a related patch set uploaded (by Dzahn):
role::dumps::zim: hiera -> ganglia_class: "new"
Change 206047 had a related patch set uploaded (by Dzahn):
ganglia: role::dumps -> ganglia_new
switching over to ganglia_new per host:
including roles with the role keyword to allow role-based hiera lookup:
question: why does the approach to use roles not work yet on examples like:
https://gerrit.wikimedia.org/r/#/c/206047/
or
https://gerrit.wikimedia.org/r/#/c/206039/
Please note that right now the role lookup comes after *any* other lookup, so using role-based hiera defs for this won't work.
I am working on fixing this (by having one unified, logically well structured top-down hiera hierarchy to search through).
Change 207128 had a related patch set uploaded (by Giuseppe Lavagetto):
hiera: Add a proxy backend
Change 209388 had a related patch set uploaded (by Dzahn):
ganglia: switch PDF cluster to ganglia_new
Change 209389 had a related patch set uploaded (by Dzahn):
ganglia: switch ocg servers to ganglia_new
@Joe so re: our IRC talk. To try the switch of a cluster i picked 'PDF' and made this change to switch it over. And made the hosts use ganglia_new here.
The result was that on ganglia-web/uranium in /etc/ganglia/gmetad.conf
data_source "PDF servers eqiad" ocg1001.eqiad.wmnet
becomes just
data_source "PDF servers eqiad"
and that config broke ganglia-web
so i set it back to ocg1001 which unbroke it but PDF disappeared now because the hosts use ganglia_new already.
ocg, jobrunners, tmh, memcached, parsoid, parsoid-cache,redis, elasticsearch, rcstream and restbase have been switched now
the rest will follow soon.
this is the pattern that works now:
https://gerrit.wikimedia.org/r/#/c/216658/
https://gerrit.wikimedia.org/r/217191
https://gerrit.wikimedia.org/r/217872
https://gerrit.wikimedia.org/r/217874
https://gerrit.wikimedia.org/r/217875
https://gerrit.wikimedia.org/r/217877
https://gerrit.wikimedia.org/r/217879
https://gerrit.wikimedia.org/r/217897
https://gerrit.wikimedia.org/r/217900
https://gerrit.wikimedia.org/r/217948
https://gerrit.wikimedia.org/r/217950
https://gerrit.wikimedia.org/r/218396
https://gerrit.wikimedia.org/r/218400
https://gerrit.wikimedia.org/r/218402
https://gerrit.wikimedia.org/r/218398
https://gerrit.wikimedia.org/r/218570
https://gerrit.wikimedia.org/r/218571
https://gerrit.wikimedia.org/r/218573
https://gerrit.wikimedia.org/r/218575
https://gerrit.wikimedia.org/r/217445
https://gerrit.wikimedia.org/r/218580
https://gerrit.wikimedia.org/r/218932
https://gerrit.wikimedia.org/r/218934
https://gerrit.wikimedia.org/r/218944
https://gerrit.wikimedia.org/r/219020
https://gerrit.wikimedia.org/r/219024
https://gerrit.wikimedia.org/r/219026
https://gerrit.wikimedia.org/r/219030
https://gerrit.wikimedia.org/r/219069
https://gerrit.wikimedia.org/r/219071
https://gerrit.wikimedia.org/r/219122
https://gerrit.wikimedia.org/r/219073
https://gerrit.wikimedia.org/r/219072
https://gerrit.wikimedia.org/r/219268
most other clusters are done, including db, lvs and all appserver types
missing because issues: analytics, fundraising
wip: misc
https://gerrit.wikimedia.org/r/#/c/217214/
https://gerrit.wikimedia.org/r/#/c/219418/
https://gerrit.wikimedia.org/r/#/c/219079/
https://gerrit.wikimedia.org/r/219429
https://gerrit.wikimedia.org/r/219074
https://gerrit.wikimedia.org/r/219075
https://gerrit.wikimedia.org/r/219464
https://gerrit.wikimedia.org/r/219070
Change 223230 had a related patch set uploaded (by Dzahn):
logstash: switch to ganglia_new
Change 223231 had a related patch set uploaded (by Dzahn):
ganglia: add aggregator for ulsfo on bast4001
eqiad is completely switched now.
logstash was the last cluster.
caveat: only the unicast hosts are in the ganglia cluster:
21 elasticsearch::unicast_hosts:
22 - logstash1004.eqiad.wmnet
23 - logstash1005.eqiad.wmnet
24 - logstash1006.eqiad.wmnet
Change 223705 had a related patch set uploaded (by Dzahn):
ganglia_new: add aggregator config for ulsfo
logstash1001-1003: These hosts are older than 1004-1006, and run Precise instead of Jessie. Gmond wouldn't stop or start.
gage@logstash1002:~$ sudo /usr/sbin/gmond -f [apache_status] Received the following parameters {'url': 'http://127.0.0.1:80/server-status', 'collect_ssl': 'False', 'metric_group': 'apache'} Fatal Python error: PyThreadState_Get: no current thread
Jul 9 00:06:37 logstash1002 kernel: [13984018.998725] init: ganglia-monitor main process (14196) killed by ABRT signal Jul 9 00:06:37 logstash1002 kernel: [13984018.998751] init: ganglia-monitor main process ended, respawning Jul 9 00:06:37 logstash1002 kernel: [13984019.086671] init: ganglia-monitor main process (14210) killed by ABRT signal Jul 9 00:06:37 logstash1002 kernel: [13984019.086698] init: ganglia-monitor main process ended, respawning
The above error messages were unhelpful, so I used strace -f /usr/sbin/gmond -f, saw that gmond parses /etc/ganglia/conf.d/* before aborting, and then did a binary search to remove files in conf.d/ until the problem disappeared.
There was an existing config file snippet /etc/ganglia/conf.d/modpython.conf which declares "python_module". ganglia_new declares this module within gmond.conf, leading to duplicate declaration of this resource after these hosts were migrated. Solution was to kill gmond manually, remove that file, and restart the service normally. Now all hosts appear in the Ganglia UI.
Thanks very much for helping @Gage. That fixed the logstash issues.
So, i have:
- added aggregator for ULSFO on bast4001 https://gerrit.wikimedia.org/r/#/c/223231/
- made it use ganglia_new https://gerrit.wikimedia.org/r/#/c/223703/
- added config to ganglia_new for ULSFO to be used https://gerrit.wikimedia.org/r/#/c/223705/
.. so far so good. that did add aggregators on bast4001 for all the ULSFO clusters:
[bast4001:/etc/ganglia/aggregators] $ grep name * 4002.conf: name = "LVS loadbalancers ulsfo" 4020.conf: name = "Text caches ulsfo" 4021.conf: name = "Bits caches ulsfo" 4022.conf: name = "Upload caches ulsfo" 4028.conf: name = "Mobile caches ulsfo" [bast4001:/etc/ganglia/aggregators] $ grep port * | uniq 4002.conf: port = 12651 4020.conf: port = 12669 4021.conf: port = 12670 4022.conf: port = 12671 4028.conf: port = 12677
Then i tried to move on by switching one cluster over to ganglia_new, like we did in eqiad:
https://gerrit.wikimedia.org/r/#/c/223706/
But that resulted in uranium (ganglia-web) setting:
data_source "Mobile caches ulsfo" carbon.wikimedia.org:9677
That _should_ be bast4001.wikimedia.org:12677 instead.
Why does it use carbon even though in ulsfo and after config was added?
Is it related to the $default_sites being just eqiad and codfw?
Change 225111 had a related patch set uploaded (by Dzahn):
ganglia_new: add aggregator setting for ULSFO
^ that was the issue for ULSFO. I was able to switch the cluster "mobile caches ulsfo" succesfully now :)
merged:
https://gerrit.wikimedia.org/r/#/c/225268/
https://gerrit.wikimedia.org/r/#/c/225272/
https://gerrit.wikimedia.org/r/#/c/225273/
https://gerrit.wikimedia.org/r/#/c/225274/
https://gerrit.wikimedia.org/r/#/c/225275/
open:
https://gerrit.wikimedia.org/r/#/c/225276/
all ULSFO clusters switched :))
and since all ULSFO was done, Giuseppe has merged my changed to also switch ULSFO to default:
https://gerrit.wikimedia.org/r/#/c/225276/
and could then rename ganglia_old to ganglia
https://gerrit.wikimedia.org/r/#/c/225881/
so the ticket is done, ganglia(old) is no more and ganglia_new is now ganglia