Page MenuHomePhabricator

labweb - wikitech_run_jobs.service in failed state
Closed, ResolvedPublic

Description

● wikitech_run_jobs.service - Run Wikitech runJobs.php maintenance script
   Loaded: loaded (/lib/systemd/system/wikitech_run_jobs.service; static; vendor preset: enabled)
   Active: failed (Result: start-limit-hit) since Thu 2019-03-21 13:46:21 UTC; 752ms ago
  Process: 20367 ExecStart=/usr/local/bin/mwscript maintenance/runJobs.php --wiki=labswiki > /dev/null 2>&
 Main PID: 20367 (code=exited, status=0/SUCCESS)

Mar 21 13:46:21 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 13:46:21 labweb1001 mwscript[20367]: Job queue is empty.
Mar 21 13:46:22 labweb1001 systemd[1]: wikitech_run_jobs.service: Start request repeated too quickly.
Mar 21 13:46:22 labweb1001 systemd[1]: Failed to start Run Wikitech runJobs.php maintenance script.
Mar 21 13:46:22 labweb1001 systemd[1]: wikitech_run_jobs.service: Unit entered failed state.
Mar 21 13:46:22 labweb1001 systemd[1]: wikitech_run_jobs.service: Failed with result 'start-limit-hit'.

Event Timeline

GTirloni created this task.

Additionally, labtestweb2001 is paging for this error while labweb1001 is not. Pages are not expected because the timer is not defined as critical in Puppet.

To clarify, this is simply surfacing errors that were being ignored when this was a cronjob. Now that it's a systemd timer, we get these alerts.

Failure history:

Mar 21 14:15:58 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:15:58 labweb1001 mwscript[36424]: Job queue is empty.
Mar 21 14:15:59 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:15:59 labweb1001 mwscript[36446]: Job queue is empty.
Mar 21 14:16:07 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:08 labweb1001 mwscript[36475]: Job queue is empty.
Mar 21 14:16:08 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:08 labweb1001 mwscript[36487]: Job queue is empty.
Mar 21 14:16:09 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:09 labweb1001 mwscript[36499]: Job queue is empty.
Mar 21 14:16:12 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:12 labweb1001 mwscript[36534]: Job queue is empty.
Mar 21 14:16:13 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:14 labweb1001 mwscript[36553]: Job queue is empty.
Mar 21 14:16:14 labweb1001 systemd[1]: wikitech_run_jobs.service: Start request repeated too quickly.
Mar 21 14:16:14 labweb1001 systemd[1]: Failed to start Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:14 labweb1001 systemd[1]: wikitech_run_jobs.service: Unit entered failed state.
Mar 21 14:16:14 labweb1001 systemd[1]: wikitech_run_jobs.service: Failed with result 'start-limit-hit'.
Mar 21 14:16:15 labweb1001 systemd[1]: wikitech_run_jobs.service: Start request repeated too quickly.
Mar 21 14:16:15 labweb1001 systemd[1]: Failed to start Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:15 labweb1001 systemd[1]: wikitech_run_jobs.service: Failed with result 'start-limit-hit'.
Mar 21 14:16:17 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:18 labweb1001 mwscript[36606]: Job queue is empty.
Mar 21 14:16:18 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:19 labweb1001 mwscript[36618]: Job queue is empty.
Mar 21 14:16:19 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:19 labweb1001 mwscript[36645]: Job queue is empty.
Mar 21 14:16:22 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:22 labweb1001 mwscript[36664]: Job queue is empty.
Mar 21 14:16:24 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:25 labweb1001 mwscript[36692]: Job queue is empty.
Mar 21 14:16:25 labweb1001 systemd[1]: wikitech_run_jobs.service: Start request repeated too quickly.
Mar 21 14:16:25 labweb1001 systemd[1]: Failed to start Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:25 labweb1001 systemd[1]: wikitech_run_jobs.service: Unit entered failed state.
Mar 21 14:16:25 labweb1001 systemd[1]: wikitech_run_jobs.service: Failed with result 'start-limit-hit'.
Mar 21 14:16:27 labweb1001 systemd[1]: wikitech_run_jobs.service: Start request repeated too quickly.
Mar 21 14:16:27 labweb1001 systemd[1]: Failed to start Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:27 labweb1001 systemd[1]: wikitech_run_jobs.service: Failed with result 'start-limit-hit'.
Mar 21 14:16:28 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:29 labweb1001 mwscript[36713]: Job queue is empty.
Mar 21 14:16:29 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:30 labweb1001 mwscript[36726]: Job queue is empty.
Mar 21 14:16:30 labweb1001 systemd[1]: Started Run Wikitech runJobs.php maintenance script.
Mar 21 14:16:30 labweb1001 mwscript[36738]: Job queue is empty.

Mentioned in SAL (#wikimedia-operations) [2019-03-21T14:18:04Z] <gtirloni> downtimed labtestweb2001 (T218881)

Change 498099 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] wikitech: Do not hide output from mwscript runjobs

https://gerrit.wikimedia.org/r/498099

Change 498099 merged by GTirloni:
[operations/puppet@production] wikitech: Do not hide output from mwscript runjobs

https://gerrit.wikimedia.org/r/498099

Change 498105 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] wikitech: Fix error in timer specification (every 1s -> 1min)

https://gerrit.wikimedia.org/r/498105

Change 498105 merged by GTirloni:
[operations/puppet@production] wikitech: Fix error in timer specification (every 1s -> 1min)

https://gerrit.wikimedia.org/r/498105

Change 498107 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] wikitech: Add missing zero in time specification

https://gerrit.wikimedia.org/r/498107

Change 498107 merged by GTirloni:
[operations/puppet@production] wikitech: Add missing zero in time specification

https://gerrit.wikimedia.org/r/498107

Due to an error in the timer specification, mwscript was running every second instead of every minute. That was too much for systemd and the unit would be put in failed state ocassionally.

I've fixed the timer specification and also allowed the command to create logs (by removing "2>&1"). If this happens again, we should see why it failed.

This confirmed systemd timers have better monitoring than cronjobs, which was very nice to see in action.