Page MenuHomePhabricator

Port Ganglia aggregator setup to systemd
Closed, ResolvedPublic

Description

Our Ganglia module is currently upstart-specific and aggregators running on precise/trusty cannot be upgraded to jessie until we port it to systemd.

Event Timeline

faidon raised the priority of this task from to Medium.
faidon updated the task description. (Show Details)
faidon added a project: SRE.
faidon added subscribers: Krenair, Southparkfan, Aklapper and 2 others.

Is https://github.com/wikimedia/operations-puppet/blob/production/modules/ganglia/templates/gmetad.upstart what should be converted to systemd?

I did not test the following at all (just made it quickly using the manuals, it might not even work at all), but hopefully you still want to try it:

[Unit]
Description=Ganglia Metadata Daemon

[Service]
Type=forking
PIDFile=/var/run/gmetad.pid
Environment="PIDFILE=/var/run/gmeta.pid"
Environment="RRDCACHED_ADDRESS=<%= @rrdcached_socket %>"
Restart=on-failure
ExecStart=/usr/sbin/gmetad --pid-file=$PIDFILE

[Install]
WantedBy=multi-user.target

See also https://github.com/ganglia/monitor-core/blob/master/gmetad/gmetad.service.in. It would be cool if someone can test it!

Change 277340 had a related patch set uploaded (by Dzahn):
ganglia: on jessie, spawn aggregators with systemd

https://gerrit.wikimedia.org/r/277340

Change 277340 merged by Dzahn:
ganglia: on jessie, spawn aggregators with systemd

https://gerrit.wikimedia.org/r/277340

Change 277347 had a related patch set uploaded (by Dzahn):
ganglia: fix systemd instance service name

https://gerrit.wikimedia.org/r/277347

Change 277347 merged by Dzahn:
ganglia: fix systemd instance service name

https://gerrit.wikimedia.org/r/277347

instances now get spawned by puppet on alsafi:

Notice: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[redis_codfw]/Service[ganglia-monitor-aggregator@2039.service]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[redis_codfw]/Service[ganglia-monitor-aggregator@2039.service]: Unscheduling refresh on Service[ganglia-monitor-aggregator@2039.service]
Notice: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[rcstream_codfw]/Service[ganglia-monitor-aggregator@2044.service]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[rcstream_codfw]/Service[ganglia-monitor-aggregator@2044.service]: Unscheduling refresh on Service[ganglia-monitor-aggregator@2044.service]
Notice: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[memcached_codfw]/Service[ganglia-monitor-aggregator@2033.service]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[memcached_codfw]/Service[ganglia-monitor-aggregator@2033.service]: Unscheduling refresh on Service[ganglia-monitor-aggregator@2033.service]
...

..

ganglia   3941  0.0  0.3  48724  3236 ?        Ssl  21:13   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2048.conf -p /var/run/gmond-2048.pid
ganglia   4059  0.0  0.3  48724  3240 ?        Ssl  21:13   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2022.conf -p /var/run/gmond-2022.pid
ganglia   4151  0.0  0.3  48724  3260 ?        Ssl  21:13   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2040.conf -p /var/run/gmond-2040.pid

Change 277354 had a related patch set uploaded (by Dzahn):
ganglia: do not start meta-service on jessie/systemd

https://gerrit.wikimedia.org/r/277354

Change 277451 had a related patch set uploaded (by Dzahn):
ganglia: don't install old init script if systemd is used

https://gerrit.wikimedia.org/r/277451

Change 277451 merged by Dzahn:
ganglia: don't install old init scripts if systemd is used

https://gerrit.wikimedia.org/r/277451

Change 277455 had a related patch set uploaded (by Dzahn):
ganglia: no dependency for old upstart service on systemd

https://gerrit.wikimedia.org/r/277455

Change 277455 merged by Dzahn:
ganglia: no dependency for old upstart service on systemd

https://gerrit.wikimedia.org/r/277455

Change 277354 merged by Dzahn:
ganglia: do not start meta-service on jessie/systemd

https://gerrit.wikimedia.org/r/277354

Change 277458 had a related patch set uploaded (by Dzahn):
ganglia: fix me - service notify systemd (WIP)

https://gerrit.wikimedia.org/r/277458

Change 277458 merged by Dzahn:
ganglia: fix up for aggregator service on systemd

https://gerrit.wikimedia.org/r/277458

This works now. You can see it on alsafi.

I can killall -u ganglia , run puppet and puppet starts all the services:

ganglia    459  0.0  0.3  48724  3164 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2020.conf -p /var/run/gmond-2020.pid
ganglia    460  0.0  0.2  48724  3000 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2005.conf -p /var/run/gmond-2005.pid
ganglia    471  0.0  0.3  48724  3196 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2034.conf -p /var/run/gmond-2034.pid
ganglia    477  0.0  0.3  48724  3196 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2052.conf -p /var/run/gmond-2052.pid
ganglia    479  0.0  0.3  48724  3156 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2050.conf -p /var/run/gmond-2050.pid
ganglia    485  0.0  0.3  48724  3196 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2048.conf -p /var/run/gmond-2048.pid
ganglia    493  0.0  0.3  48724  3164 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2011.conf -p /var/run/gmond-2011.pid
ganglia    495  0.0  0.3  48724  3068 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2040.conf -p /var/run/gmond-2040.pid
ganglia    504  0.0  0.3  48724  3288 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2022.conf -p /var/run/gmond-2022.pid
ganglia    506  0.0  0.2  48724  2980 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2002.conf -p /var/run/gmond-2002.pid
ganglia    513  0.0  0.2  48724  3024 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2051.conf -p /var/run/gmond-2051.pid
ganglia    518  0.0  0.3  48724  3220 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2027.conf -p /var/run/gmond-2027.pid
ganglia    522  0.0  0.3  48724  3288 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2008.conf -p /var/run/gmond-2008.pid
ganglia    523  0.0  0.3  48724  3164 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2013.conf -p /var/run/gmond-2013.pid
ganglia    534  0.0  0.3  48724  3212 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2012.conf -p /var/run/gmond-2012.pid
ganglia    536  0.0  0.3  48724  3140 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2055.conf -p /var/run/gmond-2055.pid
ganglia    540  0.0  0.3  48724  3168 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2033.conf -p /var/run/gmond-2033.pid
ganglia    544  0.0  0.3  48724  3196 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2039.conf -p /var/run/gmond-2039.pid
ganglia    549  0.0  0.3  48724  3200 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2031.conf -p /var/run/gmond-2031.pid

I also tested rebooting the server and that the services come back fine, they do.

template unit file from which instances are spawned:

root@alsafi:/# cat /etc/systemd/system/ganglia-monitor-aggregator\@.service 
[Unit]
Description=Ganglia monitor aggregator (cluster %I)
Documentation=https://wikitech.wikimedia.org/wiki/Ganglia
After=network.target

[Service]
Type=simple
ExecStart=/usr/sbin/gmond -c /etc/ganglia/aggregators/%i.conf -p /var/run/gmond-%i.pid
SyslogIdentifier=ganglia-monitor-aggregator-%i

[Install]
WantedBy=multi-user.target

puppet installing one of them for each instance in instance.pp

if $::initsystem == 'systemd' {
    service { "ganglia-monitor-aggregator@${id}.service":
        ensure   => running,
        provider => systemd,
        enable   => true,

snippet from systemctl status

└─system-ganglia\x2dmonitor\x2daggregator.slice
  ├─ganglia-monitor-aggregator@2050.service
  │ └─479 /usr/sbin/gmond -c /etc/ganglia/aggregators/2050.conf -p /var/run/gmond-2050.pid
  ├─ganglia-monitor-aggregator@2031.service
  │ └─549 /usr/sbin/gmond -c /etc/ganglia/aggregators/2031.conf -p /var/run/gmond-2031.pid
  ├─ganglia-monitor-aggregator@2052.service
  │ └─477 /usr/sbin/gmond -c /etc/ganglia/aggregators/2052.conf -p /var/run/gmond-2052.pid
  ├─ganglia-monitor-aggregator@2012.service
  │ └─534 /usr/sbin/gmond -c /etc/ganglia/aggregators/2012.conf -p /var/run/gmond-2012.pid
  ├─ganglia-monitor-aggregator@2033.service
  │ └─540 /usr/sbin/gmond -c /etc/ganglia/aggregators/2033.conf -p /var/run/gmond-2033.pid
  ├─ganglia-monitor-aggregator@2027.service
  │ └─518 /usr/sbin/gmond -c /etc/ganglia/aggregators/2027.conf -p /var/run/gmond-2027.pid
  ├─ganglia-monitor-aggregator@2020.service

status of one random aggregator, each is their own service now:

root@alsafi:/# systemctl status ganglia-monitor-aggregator@2002.service
● ganglia-monitor-aggregator@2002.service - Ganglia monitor aggregator (cluster 2002)
   Loaded: loaded (/etc/systemd/system/ganglia-monitor-aggregator@.service; enabled)
   Active: active (running) since Tue 2016-03-15 18:58:14 UTC; 9min ago
     Docs: https://wikitech.wikimedia.org/wiki/Ganglia
 Main PID: 506 (gmond)
   CGroup: /system.slice/system-ganglia\x2dmonitor\x2daggregator.slice/ganglia-monitor-aggregator@2002.service
           └─506 /usr/sbin/gmond -c /etc/ganglia/aggregators/2002.conf -p /var/run/gmond-2002.pid

Mar 15 18:58:14 alsafi systemd[1]: Started Ganglia monitor aggregator (cluster 2002).

remaining puppet issue that got overlooked earlier

ehmm.. everything was alright earlier. then the error popped up about puppet not being able to start ganglia-monitor-service (that is not the aggregator service this whole ticket was about), then i just stopped and started it, ran puppet again. the issue disappeared.. not sure yet, looks like the service just crashed

can't reproduce. i can killall -u ganglia, run puppet. all things come back normal without issue. multiple times

alsafi needed to be rebooted today and several of the aggregators failed to start (see "systemctl list-units | grep failed")

alsafi wasn't supposed to have the aggregator anymore. that class was applied on it in the past for testing but then removed. this issue popped up because a file in /etc/systemd/system was not removed when the puppet class was removed. i deleted it manually and puppet runs on alsafi are normal again. it should only have regular ganglia from base, no aggregators.