Page MenuHomePhabricator

systemd.timer not executing on cumin2001 after command was modified
Closed, ResolvedPublic

Description

cumin2001 systemd.timer for regular_backups no longer runs. As reported here: T255132#6650632 It failed once, on the day the puppet code was changed: https://gerrit.wikimedia.org/r/c/operations/puppet/+/643223/4/modules/profile/manifests/mariadb/backup/transfer.pp but it has not been executed (or attempted to execute, according to logs) since then -no regular backup (snapshots) on codfw since Tuesday.

Nov 25 16:41:48 cumin2001 systemd[1]: regular_snapshot.service: Current command vanished from the unit file, execution of the command list won't be resumed.

Funnily, the same change, applied to cumin1001 created no issue. Something special about cumin2001?

Today I disabled and reenabled the timer manually, to see if it helps. I had done multiple systemctl daemon-reload s before. Is there a way to reset all of systemd without restarting?

A systemd.timer that fails to execute and silently refuses to run without any errors is a big blocker to us (backup taking team). Luckly, we have monitoring based on the outcome of the commands, but that may not be the case for other commands. Either systemd.timers have to be as reliable to execute as crons, or puppet code needs changes to handle commands changing (and race conditions?).

Event Timeline

jcrespo triaged this task as Unbreak Now! priority.Nov 30 2020, 9:31 AM
jcrespo updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)
Volans added a subscriber: Volans.

[removing the SRE-tools tag as this is not specific to SRE-tools but a general Puppet/Systemd issue]

Change 644197 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] P:mariadb::backup::transfer: use full path in systemd unit

https://gerrit.wikimedia.org/r/644197

Is there a different systemd version on cumin1001 vs cumin2001 (or active version, e.g. time since reboot)? That would explain why cumin1001 code keeps working.

Change 644197 merged by Jcrespo:
[operations/puppet@production] P:mariadb::backup::transfer: use full path in systemd unit

https://gerrit.wikimedia.org/r/644197

jcrespo lowered the priority of this task from Unbreak Now! to High.Nov 30 2020, 11:58 AM

Lowering priority, will wait for:

  • Checking next scheduled snapshot runs normally (I will take care of this)
  • Followup, if needed: Document the issue on wikitech/puppet code, search for examples where this could be happening, etc.

Is there a different systemd version on cumin1001 vs cumin2001 (or active version, e.g. time since reboot)? That would explain why cumin1001 code keeps working.

systemctl --version returns systemd 241 (241) on both, so that should not be an issue.

jcrespo raised the priority of this task from High to Unbreak Now!.EditedDec 2 2020, 8:42 AM

The previous patch did not have an effect- backups continue to be executed on eqiad, but codfw ones (cumin2001) do not run at all- they are not even attempted, according to journalctl.

Is there a way to "kick"/reload a timer (daemon-reload seems to have no effect).

I am not anti-systemd timer, but for practical reasons (not having to run the backups manually) could I revert to a cron until this is sorted out? Manual running backups is a huge toll to me, and it could also provide some sanity check it a systemd issue.

The previous patch did not have an effect- backups continue to be executed on eqiad, but codfw ones (cumin2001) do not run at all- they are not even attempted, according to journalctl.

Is there a way to "kick"/reload a timer (daemon-reload seems to have no effect).

I have just manuly restarted the service and it looks lie it is running now, i see the following in the logs

Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - Create a new empty directory at dbprov2001.codfw.wmnet
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - Create a new empty directory at dbprov2002.codfw.wmnet
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - Create a new empty directory at dbprov2003.codfw.wmnet
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - Executing commands [cumin.transports.Command('/bin/mkdir /srv/backups/snapshots/ongoing/snapshot.x1.2020-12-02--08-49-35')] on '1' hosts: dbprov2003.codfw.wmnet
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - Executing commands [cumin.transports.Command('/bin/mkdir /srv/backups/snapshots/ongoing/snapshot.s8.2020-12-02--08-49-35')] on '1' hosts: dbprov2001.codfw.wmnet
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - Executing commands [cumin.transports.Command('/bin/mkdir /srv/backups/snapshots/ongoing/snapshot.s1.2020-12-02--08-49-35')] on '1' hosts: dbprov2002.codfw.wmnet
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - Completed command '/bin/mkdir /srv/backups/snapshots/ongoing/snapshot.s1.2020-12-02--08-49-35'
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - Completed command '/bin/mkdir /srv/backups/snapshots/ongoing/snapshot.s8.2020-12-02--08-49-35'
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/mkdir /srv/...-12-02--08-49-35'.
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/mkdir /srv/...-12-02--08-49-35'.
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - Running XtraBackup at db2100.codfw.wmnet:3318 and sending it to dbprov2001.codfw.wmnet
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - Running XtraBackup at db2097.codfw.wmnet:3311 and sending it to dbprov2002.codfw.wmnet
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - Completed command '/bin/mkdir /srv/backups/snapshots/ongoing/snapshot.x1.2020-12-02--08-49-35'
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/mkdir /srv/...-12-02--08-49-35'.
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Dec  2 08:49:35 cumin2001 remote-backup-mariadb[17846]: [08:49:35]: INFO - Running XtraBackup at db2101.codfw.wmnet:3320 and sending it to dbprov2003.codfw.wmnet
Dec  2 08:49:54 cumin2001 remote-backup-mariadb[18091]: [08:49:54]: INFO - Create a new empty directory at dbprov2001.codfw.wmnet
Dec  2 08:49:54 cumin2001 remote-backup-mariadb[18091]: [08:49:54]: INFO - Create a new empty directory at dbprov2002.codfw.wmnet
Dec  2 08:49:54 cumin2001 remote-backup-mariadb[18091]: [08:49:54]: INFO - Create a new empty directory at dbprov2003.codfw.wmnet
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - Executing commands [cumin.transports.Command('/bin/mkdir /srv/backups/snapshots/ongoing/snapshot.s8.2020-12-02--08-49-54')] on '1' hosts: dbprov2001.codfw.wmnet
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - Executing commands [cumin.transports.Command('/bin/mkdir /srv/backups/snapshots/ongoing/snapshot.s1.2020-12-02--08-49-54')] on '1' hosts: dbprov2002.codfw.wmnet
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - Executing commands [cumin.transports.Command('/bin/mkdir /srv/backups/snapshots/ongoing/snapshot.x1.2020-12-02--08-49-54')] on '1' hosts: dbprov2003.codfw.wmnet
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - Completed command '/bin/mkdir /srv/backups/snapshots/ongoing/snapshot.s8.2020-12-02--08-49-54'
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/mkdir /srv/...-12-02--08-49-54'.
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - Running XtraBackup at db2100.codfw.wmnet:3318 and sending it to dbprov2001.codfw.wmnet
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - Completed command '/bin/mkdir /srv/backups/snapshots/ongoing/snapshot.s1.2020-12-02--08-49-54'
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/mkdir /srv/...-12-02--08-49-54'.
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - Running XtraBackup at db2097.codfw.wmnet:3311 and sending it to dbprov2002.codfw.wmnet
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - Completed command '/bin/mkdir /srv/backups/snapshots/ongoing/snapshot.x1.2020-12-02--08-49-54'
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/mkdir /srv/...-12-02--08-49-54'.
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Dec  2 08:49:55 cumin2001 remote-backup-mariadb[18091]: [08:49:55]: INFO - Running XtraBackup at db2101.codfw.wmnet:3320 and sending it to dbprov2003.codfw.wmnet

however for some reason the timer is not listed
scratch this looking for wrong job

Wed 2020-12-02 19:00:00 UTC  10h left      Wed 2020-12-02 08:49:34 UTC  10min ago    regular_snapshot.timer                           regular_snapshot.service

Another data point: The systemd::timer::job for clear_gerrit_logs also changed it's command and it worked fine there (both gerrit1001/2001)

Change 644773 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] keyholder: fix initscript

https://gerrit.wikimedia.org/r/644773

Change 644773 merged by Jbond:
[operations/puppet@production] keyholder: fix initscript

https://gerrit.wikimedia.org/r/644773

Change 644784 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] systemd: add validate command for unit files

https://gerrit.wikimedia.org/r/644784

I just got a new behaviour: yesterday didn't work on cumin1001 either (this is new), after adding the full path.

The timer there says:

OnCalendar=Sun,Tue,Wed,Fri *-*-* 19:00:00

Last run is from 29-30, after it had failed on cumin2001. I saw no "vanished" command there, though.
But no run happened yesterday. I was fooled by some manual runs I did.

Could it be the underscore? https://github.com/coreos/fleet/issues/579

I saw some people mentioning issues with systemd and underscore names in units.

Change 644801 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Rename snapshoting systemd unit to avoid underscores

https://gerrit.wikimedia.org/r/644801

Change 644801 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Rename snapshoting systemd unit to avoid underscores

https://gerrit.wikimedia.org/r/644801

Change 644851 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Rename snapshoting systemd unit to avoid underscores

https://gerrit.wikimedia.org/r/644851

Change 644851 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Rename snapshoting systemd unit to avoid underscores

https://gerrit.wikimedia.org/r/644851

Change 644784 merged by Jbond:
[operations/puppet@production] systemd: add validate command for unit files

https://gerrit.wikimedia.org/r/644784

This looks promising enough:

Active: active (waiting)
Started Periodic execution of database-backups-snapshots.

Change 644784 merged by Jbond:
[operations/puppet@production] systemd: add validate command for unit files

https://gerrit.wikimedia.org/r/644784

@jbond I got some strange failures today:

Notice: /Stage[main]/Profile::Analytics::Refinery::Job::Druid_load/Profile::Analytics::Refinery::Job::Eventlogging_to_druid_job[editattemptstep]/Profile::Analytics::Refinery::Job::Spark_job[eventlogging_to_druid_editattemptstep_daily]/Kerberos::Systemd_timer[eventlogging_to_druid_editattemptstep_daily]/Systemd::Timer::Job[eventlogging_to_druid_editattemptstep_daily]/Systemd::Unit[eventlogging_to_druid_editattemptstep_daily.service]/File[/lib/systemd/system/eventlogging_to_druid_editattemptstep_daily.service]/content:
--- /lib/systemd/system/eventlogging_to_druid_editattemptstep_daily.service	2020-06-25 16:30:24.993428461 +0000
+++ /tmp/puppet-file20201202-18158-1deuwq2	2020-12-02 16:54:54.995992540 +0000
@@ -4,4 +4,4 @@
 [Service]
 User=analytics
 SyslogIdentifier=eventlogging_to_druid_editattemptstep_daily
-ExecStart=/usr/local/bin/eventlogging_to_druid_editattemptstep_daily
+ExecStart=/usr/local/bin/kerberos-run-command analytics /usr/local/bin/eventlogging_to_druid_editattemptstep_daily

Info: Computing checksum on file /lib/systemd/system/eventlogging_to_druid_editattemptstep_daily.service
Info: /Stage[main]/Profile::Analytics::Refinery::Job::Druid_load/Profile::Analytics::Refinery::Job::Eventlogging_to_druid_job[editattemptstep]/Profile::Analytics::Refinery::Job::Spark_job[eventlogging_to_druid_editattemptstep_daily]/Kerberos::Systemd_timer[eventlogging_to_druid_editattemptstep_daily]/Systemd::Timer::Job[eventlogging_to_druid_editattemptstep_daily]/Systemd::Unit[eventlogging_to_druid_editattemptstep_daily.service]/File[/lib/systemd/system/eventlogging_to_druid_editattemptstep_daily.service]: Filebucketed /lib/systemd/system/eventlogging_to_druid_editattemptstep_daily.service to puppet with sum 4bf9bf19f560e1508030e503ea4d4279
Error: Execution of '/bin/systemd-analyze verify /lib/systemd/system/eventlogging_to_druid_editattemptstep_daily.service20201202-18158-iowwjz' returned 1: Failed to prepare filename /lib/systemd/system/eventlogging_to_druid_editattemptstep_daily.service20201202-18158-iowwjz: Invalid argument
Error: /Stage[main]/Profile::Analytics::Refinery::Job::Druid_load/Profile::Analytics::Refinery::Job::Eventlogging_to_druid_job[editattemptstep]/Profile::Analytics::Refinery::Job::Spark_job[eventlogging_to_druid_editattemptstep_daily]/Kerberos::Systemd_timer[eventlogging_to_druid_editattemptstep_daily]/Systemd::Timer::Job[eventlogging_to_druid_editattemptstep_daily]/Systemd::Unit[eventlogging_to_druid_editattemptstep_daily.service]/File[/lib/systemd/system/eventlogging_to_druid_editattemptstep_daily.service]/content: change from '{md5}4bf9bf19f560e1508030e503ea4d4279' to '{md5}1f3a3206d0101e79cbcf9b666a79d587' failed: Execution of '/bin/systemd-analyze verify /lib/systemd/system/eventlogging_to_druid_editattemptstep_daily.service20201202-18158-iowwjz' returned 1: Failed to prepare filename /lib/systemd/system/eventlogging_to_druid_editattemptstep_daily.service20201202-18158-iowwjz: Invalid argument

Now the weird thing is that if I make the change manually and run systemd-analyze verify, it works fine :(

(this is on an-launcher1002, I am doing manual changes, but there might be something to tune in the unit verification).

jcrespo lowered the priority of this task from Unbreak Now! to Medium.Dec 2 2020, 7:37 PM

Good news:

journalctl -u database-backups-snapshots


Dec 02 19:00:01 cumin2001 systemd[1]: Started Generate mysql snapshot backup batch.

confirms that backups started on both hosts on time. As far as I am concerned, this is fixed to me. Leaving it open in case someone else wants to do more research.