Page MenuHomePhabricator

Port Ganglia aggregator setup to systemd
Closed, ResolvedPublic

Description

Our Ganglia module is currently upstart-specific and aggregators running on precise/trusty cannot be upgraded to jessie until we port it to systemd.

Event Timeline

faidon created this task.Jan 20 2016, 6:27 PM
faidon raised the priority of this task from to Normal.
faidon updated the task description. (Show Details)
faidon added a project: Operations.
faidon added subscribers: Krenair, Southparkfan, Aklapper and 2 others.

Is https://github.com/wikimedia/operations-puppet/blob/production/modules/ganglia/templates/gmetad.upstart what should be converted to systemd?

I did not test the following at all (just made it quickly using the manuals, it might not even work at all), but hopefully you still want to try it:

[Unit]
Description=Ganglia Metadata Daemon

[Service]
Type=forking
PIDFile=/var/run/gmetad.pid
Environment="PIDFILE=/var/run/gmeta.pid"
Environment="RRDCACHED_ADDRESS=<%= @rrdcached_socket %>"
Restart=on-failure
ExecStart=/usr/sbin/gmetad --pid-file=$PIDFILE

[Install]
WantedBy=multi-user.target

See also https://github.com/ganglia/monitor-core/blob/master/gmetad/gmetad.service.in. It would be cool if someone can test it!

Dzahn claimed this task.Mar 8 2016, 10:59 PM

Change 277340 had a related patch set uploaded (by Dzahn):
ganglia: on jessie, spawn aggregators with systemd

https://gerrit.wikimedia.org/r/277340

Change 277340 merged by Dzahn:
ganglia: on jessie, spawn aggregators with systemd

https://gerrit.wikimedia.org/r/277340

Change 277347 had a related patch set uploaded (by Dzahn):
ganglia: fix systemd instance service name

https://gerrit.wikimedia.org/r/277347

Change 277347 merged by Dzahn:
ganglia: fix systemd instance service name

https://gerrit.wikimedia.org/r/277347

Dzahn added a comment.EditedMar 14 2016, 9:14 PM

instances now get spawned by puppet on alsafi:

Notice: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[redis_codfw]/Service[ganglia-monitor-aggregator@2039.service]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[redis_codfw]/Service[ganglia-monitor-aggregator@2039.service]: Unscheduling refresh on Service[ganglia-monitor-aggregator@2039.service]
Notice: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[rcstream_codfw]/Service[ganglia-monitor-aggregator@2044.service]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[rcstream_codfw]/Service[ganglia-monitor-aggregator@2044.service]: Unscheduling refresh on Service[ganglia-monitor-aggregator@2044.service]
Notice: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[memcached_codfw]/Service[ganglia-monitor-aggregator@2033.service]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Ganglia::Monitor::Aggregator/Ganglia::Monitor::Aggregator::Site_instances[codfw]/Ganglia::Monitor::Aggregator::Instance[memcached_codfw]/Service[ganglia-monitor-aggregator@2033.service]: Unscheduling refresh on Service[ganglia-monitor-aggregator@2033.service]
...

..

ganglia   3941  0.0  0.3  48724  3236 ?        Ssl  21:13   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2048.conf -p /var/run/gmond-2048.pid
ganglia   4059  0.0  0.3  48724  3240 ?        Ssl  21:13   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2022.conf -p /var/run/gmond-2022.pid
ganglia   4151  0.0  0.3  48724  3260 ?        Ssl  21:13   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2040.conf -p /var/run/gmond-2040.pid

Change 277354 had a related patch set uploaded (by Dzahn):
ganglia: do not start meta-service on jessie/systemd

https://gerrit.wikimedia.org/r/277354

Change 277451 had a related patch set uploaded (by Dzahn):
ganglia: don't install old init script if systemd is used

https://gerrit.wikimedia.org/r/277451

Change 277451 merged by Dzahn:
ganglia: don't install old init scripts if systemd is used

https://gerrit.wikimedia.org/r/277451

Change 277455 had a related patch set uploaded (by Dzahn):
ganglia: no dependency for old upstart service on systemd

https://gerrit.wikimedia.org/r/277455

Change 277455 merged by Dzahn:
ganglia: no dependency for old upstart service on systemd

https://gerrit.wikimedia.org/r/277455

Change 277354 merged by Dzahn:
ganglia: do not start meta-service on jessie/systemd

https://gerrit.wikimedia.org/r/277354

Change 277458 had a related patch set uploaded (by Dzahn):
ganglia: fix me - service notify systemd (WIP)

https://gerrit.wikimedia.org/r/277458

Change 277458 merged by Dzahn:
ganglia: fix up for aggregator service on systemd

https://gerrit.wikimedia.org/r/277458

Dzahn added a comment.Mar 15 2016, 7:02 PM

This works now. You can see it on alsafi.

I can killall -u ganglia , run puppet and puppet starts all the services:

ganglia    459  0.0  0.3  48724  3164 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2020.conf -p /var/run/gmond-2020.pid
ganglia    460  0.0  0.2  48724  3000 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2005.conf -p /var/run/gmond-2005.pid
ganglia    471  0.0  0.3  48724  3196 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2034.conf -p /var/run/gmond-2034.pid
ganglia    477  0.0  0.3  48724  3196 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2052.conf -p /var/run/gmond-2052.pid
ganglia    479  0.0  0.3  48724  3156 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2050.conf -p /var/run/gmond-2050.pid
ganglia    485  0.0  0.3  48724  3196 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2048.conf -p /var/run/gmond-2048.pid
ganglia    493  0.0  0.3  48724  3164 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2011.conf -p /var/run/gmond-2011.pid
ganglia    495  0.0  0.3  48724  3068 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2040.conf -p /var/run/gmond-2040.pid
ganglia    504  0.0  0.3  48724  3288 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2022.conf -p /var/run/gmond-2022.pid
ganglia    506  0.0  0.2  48724  2980 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2002.conf -p /var/run/gmond-2002.pid
ganglia    513  0.0  0.2  48724  3024 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2051.conf -p /var/run/gmond-2051.pid
ganglia    518  0.0  0.3  48724  3220 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2027.conf -p /var/run/gmond-2027.pid
ganglia    522  0.0  0.3  48724  3288 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2008.conf -p /var/run/gmond-2008.pid
ganglia    523  0.0  0.3  48724  3164 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2013.conf -p /var/run/gmond-2013.pid
ganglia    534  0.0  0.3  48724  3212 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2012.conf -p /var/run/gmond-2012.pid
ganglia    536  0.0  0.3  48724  3140 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2055.conf -p /var/run/gmond-2055.pid
ganglia    540  0.0  0.3  48724  3168 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2033.conf -p /var/run/gmond-2033.pid
ganglia    544  0.0  0.3  48724  3196 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2039.conf -p /var/run/gmond-2039.pid
ganglia    549  0.0  0.3  48724  3200 ?        Ssl  18:58   0:00 /usr/sbin/gmond -c /etc/ganglia/aggregators/2031.conf -p /var/run/gmond-2031.pid

I also tested rebooting the server and that the services come back fine, they do.

Dzahn closed this task as Resolved.Mar 15 2016, 7:09 PM

template unit file from which instances are spawned:

root@alsafi:/# cat /etc/systemd/system/ganglia-monitor-aggregator\@.service 
[Unit]
Description=Ganglia monitor aggregator (cluster %I)
Documentation=https://wikitech.wikimedia.org/wiki/Ganglia
After=network.target

[Service]
Type=simple
ExecStart=/usr/sbin/gmond -c /etc/ganglia/aggregators/%i.conf -p /var/run/gmond-%i.pid
SyslogIdentifier=ganglia-monitor-aggregator-%i

[Install]
WantedBy=multi-user.target

puppet installing one of them for each instance in instance.pp

if $::initsystem == 'systemd' {
    service { "ganglia-monitor-aggregator@${id}.service":
        ensure   => running,
        provider => systemd,
        enable   => true,

snippet from systemctl status

└─system-ganglia\x2dmonitor\x2daggregator.slice
  ├─ganglia-monitor-aggregator@2050.service
  │ └─479 /usr/sbin/gmond -c /etc/ganglia/aggregators/2050.conf -p /var/run/gmond-2050.pid
  ├─ganglia-monitor-aggregator@2031.service
  │ └─549 /usr/sbin/gmond -c /etc/ganglia/aggregators/2031.conf -p /var/run/gmond-2031.pid
  ├─ganglia-monitor-aggregator@2052.service
  │ └─477 /usr/sbin/gmond -c /etc/ganglia/aggregators/2052.conf -p /var/run/gmond-2052.pid
  ├─ganglia-monitor-aggregator@2012.service
  │ └─534 /usr/sbin/gmond -c /etc/ganglia/aggregators/2012.conf -p /var/run/gmond-2012.pid
  ├─ganglia-monitor-aggregator@2033.service
  │ └─540 /usr/sbin/gmond -c /etc/ganglia/aggregators/2033.conf -p /var/run/gmond-2033.pid
  ├─ganglia-monitor-aggregator@2027.service
  │ └─518 /usr/sbin/gmond -c /etc/ganglia/aggregators/2027.conf -p /var/run/gmond-2027.pid
  ├─ganglia-monitor-aggregator@2020.service

status of one random aggregator, each is their own service now:

root@alsafi:/# systemctl status ganglia-monitor-aggregator@2002.service
● ganglia-monitor-aggregator@2002.service - Ganglia monitor aggregator (cluster 2002)
   Loaded: loaded (/etc/systemd/system/ganglia-monitor-aggregator@.service; enabled)
   Active: active (running) since Tue 2016-03-15 18:58:14 UTC; 9min ago
     Docs: https://wikitech.wikimedia.org/wiki/Ganglia
 Main PID: 506 (gmond)
   CGroup: /system.slice/system-ganglia\x2dmonitor\x2daggregator.slice/ganglia-monitor-aggregator@2002.service
           └─506 /usr/sbin/gmond -c /etc/ganglia/aggregators/2002.conf -p /var/run/gmond-2002.pid

Mar 15 18:58:14 alsafi systemd[1]: Started Ganglia monitor aggregator (cluster 2002).
Dzahn set Security to None.
Dzahn reopened this task as Open.Mar 15 2016, 10:26 PM

remaining puppet issue that got overlooked earlier

ehmm.. everything was alright earlier. then the error popped up about puppet not being able to start ganglia-monitor-service (that is not the aggregator service this whole ticket was about), then i just stopped and started it, ran puppet again. the issue disappeared.. not sure yet, looks like the service just crashed

Dzahn closed this task as Resolved.Mar 15 2016, 10:43 PM

can't reproduce. i can killall -u ganglia, run puppet. all things come back normal without issue. multiple times

alsafi needed to be rebooted today and several of the aggregators failed to start (see "systemctl list-units | grep failed")

Dzahn added a comment.Mar 29 2016, 7:57 PM

alsafi wasn't supposed to have the aggregator anymore. that class was applied on it in the past for testing but then removed. this issue popped up because a file in /etc/systemd/system was not removed when the puppet class was removed. i deleted it manually and puppet runs on alsafi are normal again. it should only have regular ganglia from base, no aggregators.