Page MenuHomePhabricator

Move admin cron jobs to systemd timers
Closed, ResolvedPublic

Description

Since the backup cron is now a systemd timer to reduce unread email spam and improve monitoring (a serious alert by icinga needs to be tested still, but in theory it should catch a systemd timer gone bad that exits with non-zero). It seems like we should move other crons to that. The backup cron from the "secondary" file server role is the best example of one working in puppet.

A good example of where I think we'd benefit is the issues with emails coming out from the keystone jobs.

  • profile::base::labs
  • profile::wmcs::tenants::libraryupgrader
  • profile::wmcs::monitoring
  • profile::toolforge::clush::master
  • role::prometheus::labs_project
  • role::prometheus::tools
  • role::labs::db::check_private_data
  • role::labs::nfs::secondary
  • role::openldap::labs
  • dnsrecursor::labsaliaser
  • labstore::fileserver::exports
  • graphite::wmcs::archiver
  • openstack::keystone::cleanup
  • openstack::glance::image_sync
  • openstack::wikitech::web
  • openstack::designate::dns_floating_ip_updater
  • openstack::puppet::master::instance_info_dumper

Event Timeline

systemd timers look like a good replacement for cron, if a bit more complex to set up (and ignoring systemd annoyances since almost all Linux distros have decided to live with it). I like that stdout/stderr are captured and sent to journald and also the ability to monitor jobs more easily.

Would this be a good rough estimation of the number of places we would have to touch with this change? I'm surprised by the low number and feel like I'm missing something obvious.

$ find . -name '*.pp' | grep -E '(openstack|lab|wmcs|cloud|tool)' | xargs -i% grep -EH 'cron.*{' %  | wc -l
42

Since I'm just thinking of crons we puppetize (and it's pretty easy to do timers using puppet now), that's probably it. We've tended to move things into services when they would have been a cron.

Throwing this in the discussion column, though I don't think it will be very controversial as a background activity.

GTirloni triaged this task as Medium priority.

Change 489393 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] toolforge::clush::master - Convert cronjob to systemd timer

https://gerrit.wikimedia.org/r/489393

Change 489394 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] wmcs::monitoring - Convert cronjob to systemd timer

https://gerrit.wikimedia.org/r/489394

Change 489393 merged by GTirloni:
[operations/puppet@production] toolforge::clush::master - Convert cronjob to systemd timer

https://gerrit.wikimedia.org/r/489393

Change 490052 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] toolforge::clush::master - Fix systemd timer definition

https://gerrit.wikimedia.org/r/490052

Change 490052 merged by GTirloni:
[operations/puppet@production] toolforge::clush::master - Fix systemd timer definition

https://gerrit.wikimedia.org/r/490052

Change 490056 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] toolforge::clush::master - Fix typo

https://gerrit.wikimedia.org/r/490056

Change 490056 merged by GTirloni:
[operations/puppet@production] toolforge::clush::master - Fix typo

https://gerrit.wikimedia.org/r/490056

cron { 'update_tools_clush':
    ensure  => absent,
}

systemd::timer::job { 'toolfoge_clush_update':
    ensure                    => present,
    description               => 'Update list of Toolforge servers for clush',
    command                   => "/usr/local/sbin/tools-clush-generator /etc/clustershell/tools.yaml --observer-pass ${observer_pass}",
    interval                  => {
        'start'    => 'OnCalendar',
        'interval' => '*-*-* *:00:00', # hourly
    },
    logging_enabled           => false,
    monitoring_enabled        => true,
    monitoring_contact_groups => 'wmcs-team',
    user                      => 'root',
}
# systemctl status toolfoge_clush_update.timer --no-pager
● toolfoge_clush_update.timer - Periodic execution of toolfoge_clush_update.service
   Loaded: loaded (/lib/systemd/system/toolfoge_clush_update.timer; enabled; vendor preset: enabled)
   Active: active (waiting) since Tue 2019-02-12 13:49:51 UTC; 25min ago

Feb 12 13:49:51 tools-clushmaster-02 systemd[1]: Started Periodic execution of toolfoge_clush_update.service.

# systemctl status toolfoge_clush_update.service --no-pager
● toolfoge_clush_update.service - Update list of Toolforge servers for clush
   Loaded: loaded (/lib/systemd/system/toolfoge_clush_update.service; static; vendor preset: enabled)
   Active: inactive (dead) since Tue 2019-02-12 14:00:05 UTC; 15min ago
  Process: 17932 ExecStart=/usr/local/sbin/tools-clush-generator /etc/clustershell/tools.yaml --observer-pass Fs6Dq2RtG8KwmM2Z (code=exited, status=0/SUCCESS)
 Main PID: 17932 (code=exited, status=0/SUCCESS)

Feb 12 14:00:01 tools-clushmaster-02 systemd[1]: Started Update list of Toolforge servers for clush.

# ls -l /etc/clustershell/tools.yaml
-rw-r--r-- 1 root root 26977 Feb 12 14:00 /etc/clustershell/tools.yaml

Change 490112 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: convert our first systemd timer to the new format

https://gerrit.wikimedia.org/r/490112

Change 489394 merged by GTirloni:
[operations/puppet@production] wmcs::monitoring - Convert cronjob to systemd timer

https://gerrit.wikimedia.org/r/489394

Change 490137 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] wmcs::monitoring - Fix typo

https://gerrit.wikimedia.org/r/490137

Change 490137 merged by GTirloni:
[operations/puppet@production] wmcs::monitoring - Fix typo

https://gerrit.wikimedia.org/r/490137

Change 490197 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openstack - Convert cron jobs to systemd timers

https://gerrit.wikimedia.org/r/490197

Mentioned in SAL (#wikimedia-operations) [2019-03-06T18:04:46Z] <bstorm_> disabled puppet and downtimed labstore2004 while deploying a change for T210818

Change 490112 merged by Bstorm:
[operations/puppet@production] labstore: convert our first systemd timer to the new format

https://gerrit.wikimedia.org/r/490112

Mentioned in SAL (#wikimedia-operations) [2019-03-06T18:08:52Z] <bstorm_> re-enabled puppet after observing the change works well on the partner for labstore2004 and T210818

Change 490197 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openstack - Convert cron jobs to systemd timers

https://gerrit.wikimedia.org/r/490197

Change 490197 merged by GTirloni:
[operations/puppet@production] openstack - Convert cron jobs to systemd timers

https://gerrit.wikimedia.org/r/490197

Mentioned in SAL (#wikimedia-operations) [2019-03-21T13:18:00Z] <gtirloni> downtimed cloudcontrol*, cloudservices*, labcontrol*, labweb* (T210818)

Change 498085 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openstack - Fix errors in timers definitions

https://gerrit.wikimedia.org/r/498085

Change 498085 merged by GTirloni:
[operations/puppet@production] openstack - Fix errors in timers definitions

https://gerrit.wikimedia.org/r/498085

Mentioned in SAL (#wikimedia-cloud) [2019-03-21T13:49:10Z] <gtirloni> converted openstack cronjobs to systemd timers (T210818)

Change 498141 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] profile::base::labs - Convert cronjobs to systemd timers

https://gerrit.wikimedia.org/r/498141

Change 498193 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openstack::glance::image_sync - Fix systemd timer user

https://gerrit.wikimedia.org/r/498193

Change 498193 merged by GTirloni:
[operations/puppet@production] openstack::glance::image_sync - Fix systemd timer user

https://gerrit.wikimedia.org/r/498193

Change 498199 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] openstack::keystone::cleanup - Do not hide keystone-manage token_flush output

https://gerrit.wikimedia.org/r/498199

Change 498199 merged by GTirloni:
[operations/puppet@production] openstack::keystone::cleanup - Do not hide keystone-manage token_flush output

https://gerrit.wikimedia.org/r/498199

Mentioned in SAL (#wikimedia-operations) [2019-03-21T23:53:49Z] <gtirloni> downtimed systemd check in labwen1001 (T210818)

Change 498141 merged by GTirloni:
[operations/puppet@production] profile::base::labs - Convert cronjobs to systemd timers

https://gerrit.wikimedia.org/r/498141

Change 498358 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] profile::base::labs - Fix timer definition

https://gerrit.wikimedia.org/r/498358

Change 498358 merged by GTirloni:
[operations/puppet@production] profile::base::labs - Fix timer definition

https://gerrit.wikimedia.org/r/498358

Change 600928 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] wmflib: add systemd.timer onCalendar support to cron_splay

https://gerrit.wikimedia.org/r/600928

Change 600928 merged by Cwhite:
[operations/puppet@production] wmflib: add systemd.timer OnCalendar support to cron_splay

https://gerrit.wikimedia.org/r/600928