Our Ganglia module is currently upstart-specific and aggregators running on precise/trusty cannot be upgraded to jessie until we port it to systemd.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Dzahn | T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production | |||
| Resolved | faidon | T123674 reinstall bast4001 with jessie | |||
| Resolved | Dzahn | T123712 Reimage hooft with jessie and rename to bast3001 | |||
| Resolved | hashar | T97402 Make services manageable by systemd | |||
| Resolved | Dzahn | T124197 Port Ganglia aggregator setup to systemd |
Event Timeline
Is https://github.com/wikimedia/operations-puppet/blob/production/modules/ganglia/templates/gmetad.upstart what should be converted to systemd?
I did not test the following at all (just made it quickly using the manuals, it might not even work at all), but hopefully you still want to try it:
[Unit] Description=Ganglia Metadata Daemon [Service] Type=forking PIDFile=/var/run/gmetad.pid Environment="PIDFILE=/var/run/gmeta.pid" Environment="RRDCACHED_ADDRESS=<%= @rrdcached_socket %>" Restart=on-failure ExecStart=/usr/sbin/gmetad --pid-file=$PIDFILE [Install] WantedBy=multi-user.target
See also https://github.com/ganglia/monitor-core/blob/master/gmetad/gmetad.service.in. It would be cool if someone can test it!
Actually, I guess you meant https://github.com/wikimedia/operations-puppet/blob/production/modules/ganglia/files/upstart/ganglia-monitor-aggregator.conf and https://github.com/wikimedia/operations-puppet/blob/production/modules/ganglia/files/upstart/ganglia-monitor-aggregator-instance.conf. I should have spent more time looking at the ganglia module, sorry. Discard my previous comment then.
Change 277340 had a related patch set uploaded (by Dzahn):
ganglia: on jessie, spawn aggregators with systemd
Change 277347 had a related patch set uploaded (by Dzahn):
ganglia: fix systemd instance service name
instances now get spawned by puppet on alsafi:
Notice: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[redis_codfw]/Service[ganglia-monitor-aggregator@2039.service]/ensure: ensure changed 'stopped' to 'running' Info: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[redis_codfw]/Service[ganglia-monitor-aggregator@2039.service]: Unscheduling refresh on Service[ganglia-monitor-aggregator@2039.service] Notice: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[rcstream_codfw]/Service[ganglia-monitor-aggregator@2044.service]/ensure: ensure changed 'stopped' to 'running' Info: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[rcstream_codfw]/Service[ganglia-monitor-aggregator@2044.service]: Unscheduling refresh on Service[ganglia-monitor-aggregator@2044.service] Notice: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[memcached_codfw]/Service[ganglia-monitor-aggregator@2033.service]/ensure: ensure changed 'stopped' to 'running' Info: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[memcached_codfw]/Service[ganglia-monitor-aggregator@2033.service]: Unscheduling refresh on Service[ganglia-monitor-aggregator@2033.service] ...
..
ganglia 3941 0.0 0.3 48724 3236 ? Ssl 21:13 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2048.conf -p /var/run/gmond-2048.pid ganglia 4059 0.0 0.3 48724 3240 ? Ssl 21:13 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2022.conf -p /var/run/gmond-2022.pid ganglia 4151 0.0 0.3 48724 3260 ? Ssl 21:13 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2040.conf -p /var/run/gmond-2040.pid
Change 277354 had a related patch set uploaded (by Dzahn):
ganglia: do not start meta-service on jessie/systemd
Change 277451 had a related patch set uploaded (by Dzahn):
ganglia: don't install old init script if systemd is used
Change 277451 merged by Dzahn:
ganglia: don't install old init scripts if systemd is used
Change 277455 had a related patch set uploaded (by Dzahn):
ganglia: no dependency for old upstart service on systemd
Change 277455 merged by Dzahn:
ganglia: no dependency for old upstart service on systemd
Change 277458 had a related patch set uploaded (by Dzahn):
ganglia: fix me - service notify systemd (WIP)
This works now. You can see it on alsafi.
I can killall -u ganglia , run puppet and puppet starts all the services:
ganglia 459 0.0 0.3 48724 3164 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2020.conf -p /var/run/gmond-2020.pid ganglia 460 0.0 0.2 48724 3000 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2005.conf -p /var/run/gmond-2005.pid ganglia 471 0.0 0.3 48724 3196 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2034.conf -p /var/run/gmond-2034.pid ganglia 477 0.0 0.3 48724 3196 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2052.conf -p /var/run/gmond-2052.pid ganglia 479 0.0 0.3 48724 3156 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2050.conf -p /var/run/gmond-2050.pid ganglia 485 0.0 0.3 48724 3196 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2048.conf -p /var/run/gmond-2048.pid ganglia 493 0.0 0.3 48724 3164 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2011.conf -p /var/run/gmond-2011.pid ganglia 495 0.0 0.3 48724 3068 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2040.conf -p /var/run/gmond-2040.pid ganglia 504 0.0 0.3 48724 3288 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2022.conf -p /var/run/gmond-2022.pid ganglia 506 0.0 0.2 48724 2980 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2002.conf -p /var/run/gmond-2002.pid ganglia 513 0.0 0.2 48724 3024 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2051.conf -p /var/run/gmond-2051.pid ganglia 518 0.0 0.3 48724 3220 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2027.conf -p /var/run/gmond-2027.pid ganglia 522 0.0 0.3 48724 3288 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2008.conf -p /var/run/gmond-2008.pid ganglia 523 0.0 0.3 48724 3164 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2013.conf -p /var/run/gmond-2013.pid ganglia 534 0.0 0.3 48724 3212 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2012.conf -p /var/run/gmond-2012.pid ganglia 536 0.0 0.3 48724 3140 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2055.conf -p /var/run/gmond-2055.pid ganglia 540 0.0 0.3 48724 3168 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2033.conf -p /var/run/gmond-2033.pid ganglia 544 0.0 0.3 48724 3196 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2039.conf -p /var/run/gmond-2039.pid ganglia 549 0.0 0.3 48724 3200 ? Ssl 18:58 0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2031.conf -p /var/run/gmond-2031.pid
I also tested rebooting the server and that the services come back fine, they do.
template unit file from which instances are spawned:
root@alsafi:/# cat /etc/systemd/system/ganglia-monitor-aggregator\@.service [Unit] Description=Ganglia monitor aggregator (cluster %I) Documentation=https://wikitech.wikimedia.org/wiki/Ganglia After=network.target [Service] Type=simple ExecStart=/usr/sbin/gmond -c /etc/ganglia/aggregators/%i.conf -p /var/run/gmond-%i.pid SyslogIdentifier=ganglia-monitor-aggregator-%i [Install] WantedBy=multi-user.target
puppet installing one of them for each instance in instance.pp
if $::initsystem == 'systemd' {
service { "ganglia-monitor-aggregator@${id}.service":
ensure => running,
provider => systemd,
enable => true,snippet from systemctl status
└─system-ganglia\x2dmonitor\x2daggregator.slice ├─ganglia-monitor-aggregator@2050.service │ └─479 /usr/sbin/gmond -c /etc/ganglia/aggregators/2050.conf -p /var/run/gmond-2050.pid ├─ganglia-monitor-aggregator@2031.service │ └─549 /usr/sbin/gmond -c /etc/ganglia/aggregators/2031.conf -p /var/run/gmond-2031.pid ├─ganglia-monitor-aggregator@2052.service │ └─477 /usr/sbin/gmond -c /etc/ganglia/aggregators/2052.conf -p /var/run/gmond-2052.pid ├─ganglia-monitor-aggregator@2012.service │ └─534 /usr/sbin/gmond -c /etc/ganglia/aggregators/2012.conf -p /var/run/gmond-2012.pid ├─ganglia-monitor-aggregator@2033.service │ └─540 /usr/sbin/gmond -c /etc/ganglia/aggregators/2033.conf -p /var/run/gmond-2033.pid ├─ganglia-monitor-aggregator@2027.service │ └─518 /usr/sbin/gmond -c /etc/ganglia/aggregators/2027.conf -p /var/run/gmond-2027.pid ├─ganglia-monitor-aggregator@2020.service
status of one random aggregator, each is their own service now:
root@alsafi:/# systemctl status ganglia-monitor-aggregator@2002.service
● ganglia-monitor-aggregator@2002.service - Ganglia monitor aggregator (cluster 2002)
Loaded: loaded (/etc/systemd/system/ganglia-monitor-aggregator@.service; enabled)
Active: active (running) since Tue 2016-03-15 18:58:14 UTC; 9min ago
Docs: https://wikitech.wikimedia.org/wiki/Ganglia
Main PID: 506 (gmond)
CGroup: /system.slice/system-ganglia\x2dmonitor\x2daggregator.slice/ganglia-monitor-aggregator@2002.service
└─506 /usr/sbin/gmond -c /etc/ganglia/aggregators/2002.conf -p /var/run/gmond-2002.pid
Mar 15 18:58:14 alsafi systemd[1]: Started Ganglia monitor aggregator (cluster 2002).ehmm.. everything was alright earlier. then the error popped up about puppet not being able to start ganglia-monitor-service (that is not the aggregator service this whole ticket was about), then i just stopped and started it, ran puppet again. the issue disappeared.. not sure yet, looks like the service just crashed
can't reproduce. i can killall -u ganglia, run puppet. all things come back normal without issue. multiple times
alsafi needed to be rebooted today and several of the aggregators failed to start (see "systemctl list-units | grep failed")
alsafi wasn't supposed to have the aggregator anymore. that class was applied on it in the past for testing but then removed. this issue popped up because a file in /etc/systemd/system was not removed when the puppet class was removed. i deleted it manually and puppet runs on alsafi are normal again. it should only have regular ganglia from base, no aggregators.