Page MenuHomePhabricator

Refactor analytics cronjobs to alarm on failure reliably
Closed, ResolvedPublic21 Estimated Story Points

Description

We are currently logging most of our important cron jobs' stderr/out to file. This does not allow us to easily recognize when a script fails, and this might lead to outages or general issues of our infrastructure.

We decided to replace all our crons with systemd timers (where possible).
List of critical jobs:

  • Camus import jobs
  • HDFS Balancer
  • profile::analytics::refinery::job::data_check
  • profile::analytics::refinery::job::data_purge
  • profile::analytics::refinery::job::refine_job (profile::analytics::refinery::job::spark_job)
  • profile::analytics::refinery::job::eventlogging_to_druid_job
  • profile::analytics::refinery::job::project_namespace_map
  • profile::analytics::refinery::job::sqoop_mediawiki
  • report updater jobs

And last but not the least:

  • Final puppet clean up from old/not-used code

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+10 -27
operations/puppetproduction+23 -51
operations/puppetproduction+21 -40
operations/puppetproduction+15 -5
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+3 -1
operations/puppetproduction+38 -2
operations/puppetproduction+2 -0
operations/puppetproduction+4 -0
operations/puppetproduction+1 -0
operations/puppetproduction+160 -123
operations/puppetproduction+5 -4
operations/puppetproduction+1 -1
operations/puppetproduction+34 -17
operations/puppetproduction+9 -5
operations/puppetproduction+35 -17
operations/puppetproduction+21 -3
operations/puppetproduction+61 -1
operations/puppetproduction+20 -60
operations/puppetproduction+8 -18
operations/puppetproduction+8 -1
operations/puppetproduction+13 -0
operations/puppetproduction+2 -2
operations/puppetproduction+56 -1
operations/puppetproduction+159 -286
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+64 -2
operations/puppetproduction+58 -3
operations/puppetproduction+101 -1
analytics/refinerymaster+3 -1
operations/puppetproduction+9 -0
operations/puppetproduction+4 -4
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+18 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -14
operations/puppetproduction+12 -1
operations/puppetproduction+19 -6
operations/puppetproduction+2 -1
operations/puppetproduction+14 -7
operations/puppetproduction+2 -2
operations/puppetproduction+1 -3
operations/puppetproduction+0 -6
operations/puppetproduction+53 -0
analytics/refinerymaster+7 -0
operations/puppetproduction+4 -25
operations/puppetproduction+2 -0
operations/puppetproduction+6 -0
operations/puppetproduction+1 -1
operations/puppetproduction+16 -6
operations/puppetproduction+3 -6
operations/puppetproduction+13 -1
operations/puppetproduction+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+62 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 462642 merged by Elukey:
[operations/puppet@production] profile::mariadb::misc::el::sanitization: add correct user to timer

https://gerrit.wikimedia.org/r/462642

Change 463259 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add the analytics-alerts contact to the analytics contact group

https://gerrit.wikimedia.org/r/463259

Change 463259 merged by Elukey:
[operations/puppet@production] Add the analytics-alerts contact to the analytics contact group

https://gerrit.wikimedia.org/r/463259

Change 463306 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Replace analytics team's contacts with analytics-alerts

https://gerrit.wikimedia.org/r/463306

Change 463742 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::data_purge: add two timers

https://gerrit.wikimedia.org/r/463742

Change 463742 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::data_purge: add two timers

https://gerrit.wikimedia.org/r/463742

Change 463306 merged by Elukey:
[operations/puppet@production] Replace analytics team's contacts with analytics-alerts

https://gerrit.wikimedia.org/r/463306

Change 465630 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Refactor type Systemd::Timer::DateTime to include more normal forms

https://gerrit.wikimedia.org/r/465630

Change 465630 merged by Elukey:
[operations/puppet@production] Refactor type Systemd::Timer::DateTime to include more normal forms

https://gerrit.wikimedia.org/r/465630

Change 467262 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove starting ' ' from Calendar date/tim in analytics' systemd timers

https://gerrit.wikimedia.org/r/467262

Change 467263 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::data_purge: add systemd timer

https://gerrit.wikimedia.org/r/467263

Change 467262 merged by Elukey:
[operations/puppet@production] Remove starting ' ' from Calendar date/tim in analytics' systemd timers

https://gerrit.wikimedia.org/r/467262

Change 467263 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::data_purge: add systemd timer

https://gerrit.wikimedia.org/r/467263

Change 467291 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] refinery-drop-druid-snapshots: avoid exit 1 when no ds available

https://gerrit.wikimedia.org/r/467291

Change 467291 merged by Elukey:
[analytics/refinery@master] refinery-drop-druid-snapshots: avoid exit 1 when no ds available

https://gerrit.wikimedia.org/r/467291

Change 467299 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::data_purge: move crons to timers

https://gerrit.wikimedia.org/r/467299

Change 467299 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::data_purge: move crons to timers

https://gerrit.wikimedia.org/r/467299

Change 467956 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::balancer: move to systemd timer

https://gerrit.wikimedia.org/r/467956

Change 467956 merged by Elukey:
[operations/puppet@production] profile::hadoop::balancer: move to systemd timer

https://gerrit.wikimedia.org/r/467956

Change 468251 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::sqoop_mw: move to timers

https://gerrit.wikimedia.org/r/468251

Change 468251 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::sqoop_mw: move to timers

https://gerrit.wikimedia.org/r/468251

Change 471211 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::sqoop_mediawiki: fix permissions

https://gerrit.wikimedia.org/r/471211

Change 471211 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::sqoop_mediawiki: fix permissions

https://gerrit.wikimedia.org/r/471211

Change 471669 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix refinery-sqoop-mediawiki's script escaping

https://gerrit.wikimedia.org/r/471669

Change 471669 merged by Elukey:
[operations/puppet@production] Fix refinery-sqoop-mediawiki's script escaping

https://gerrit.wikimedia.org/r/471669

Change 475077 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job:data_purge: remove unused items

https://gerrit.wikimedia.org/r/475077

Change 475077 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job:data_purge: remove unused items

https://gerrit.wikimedia.org/r/475077

Change 483069 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::refinery::job::camus: conver netflow to systemd timer

https://gerrit.wikimedia.org/r/483069

Change 483085 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] systemd::timer: allow more normal forms for datetime type

https://gerrit.wikimedia.org/r/483085

Change 483085 merged by Elukey:
[operations/puppet@production] systemd::timer: allow more normal forms for datetime type

https://gerrit.wikimedia.org/r/483085

Change 483069 merged by Elukey:
[operations/puppet@production] profile::refinery::job::camus: conver netflow to systemd timer

https://gerrit.wikimedia.org/r/483069

Change 483345 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] camus::job: properly clean up crons when systemd is used

https://gerrit.wikimedia.org/r/483345

Change 483345 merged by Elukey:
[operations/puppet@production] camus::job: properly clean up crons when systemd is used

https://gerrit.wikimedia.org/r/483345

Change 483359 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::camus: move more crons to timers

https://gerrit.wikimedia.org/r/483359

Change 483364 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] systemd::timer::job: allow to specify SyslogIdentifier

https://gerrit.wikimedia.org/r/483364

Change 483364 merged by Elukey:
[operations/puppet@production] systemd::timer::job: allow to specify SyslogIdentifier

https://gerrit.wikimedia.org/r/483364

Change 483359 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::camus: move more crons to timers

https://gerrit.wikimedia.org/r/483359

Change 483378 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::camus: move all crons to timers

https://gerrit.wikimedia.org/r/483378

Change 483378 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::camus: move all crons to timers

https://gerrit.wikimedia.org/r/483378

Change 483383 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] camus: clean up references about crons

https://gerrit.wikimedia.org/r/483383

Change 483383 merged by Elukey:
[operations/puppet@production] camus: clean up references about crons

https://gerrit.wikimedia.org/r/483383

Change 483426 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery: move sanitize_eventlogging_analytics to timer

https://gerrit.wikimedia.org/r/483426

Change 483691 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] systemd::syslog: allow to modify the $local_logdir convention

https://gerrit.wikimedia.org/r/483691

Change 483698 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] systemd::syslog|timer: add proper handling of ensure

https://gerrit.wikimedia.org/r/483698

Change 483715 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::reportupdater::jobs::hadoop: move jobs to systemd timers

https://gerrit.wikimedia.org/r/483715

Change 483691 abandoned by Elukey:
systemd::syslog: allow to modify the $local_logdir convention

https://gerrit.wikimedia.org/r/483691

Change 483715 abandoned by Elukey:
profile::reportupdater::jobs::hadoop: move jobs to systemd timers

https://gerrit.wikimedia.org/r/483715

Change 485210 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::reportupdater::jobs::hadoop: move jobs to systemd timers

https://gerrit.wikimedia.org/r/485210

Change 483698 abandoned by Elukey:
systemd::syslog|timer: add proper handling of ensure

https://gerrit.wikimedia.org/r/483698

Change 485210 merged by Elukey:
[operations/puppet@production] profile::reportupdater::jobs::hadoop: move jobs to systemd timers

https://gerrit.wikimedia.org/r/485210

Change 485591 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] reportupdater::job: use absolute paths in timer's definition

https://gerrit.wikimedia.org/r/485591

Change 485591 merged by Elukey:
[operations/puppet@production] reportupdater::job: use absolute paths in timer's definition

https://gerrit.wikimedia.org/r/485591

Change 485656 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] reportupdate: move all jobs to systemd timers

https://gerrit.wikimedia.org/r/485656

Change 485656 merged by Elukey:
[operations/puppet@production] reportupdater: move all jobs to systemd timers

https://gerrit.wikimedia.org/r/485656

Change 483426 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery: move sanitize_eventlogging_analytics to timer

https://gerrit.wikimedia.org/r/483426

Change 485689 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::refinery::job::spark_job: add shebang to sh template

https://gerrit.wikimedia.org/r/485689

Change 485689 merged by Elukey:
[operations/puppet@production] profile::refinery::job::spark_job: add shebang to sh template

https://gerrit.wikimedia.org/r/485689

Change 485744 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::refinery::job::refine: move all crons to timers

https://gerrit.wikimedia.org/r/485744

Change 485744 merged by Elukey:
[operations/puppet@production] profile::refinery::job::refine: move all crons to timers

https://gerrit.wikimedia.org/r/485744

Change 485750 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::eventlogging_to_druid_job: move to timers

https://gerrit.wikimedia.org/r/485750

Change 485750 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::eventlogging_to_druid_job: move to timers

https://gerrit.wikimedia.org/r/485750

Change 485755 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::project_namespace_map: move to timers

https://gerrit.wikimedia.org/r/485755

Change 485755 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::project_namespace_map: move to timers

https://gerrit.wikimedia.org/r/485755

Change 485856 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::project_namespace_map: fix timer's script

https://gerrit.wikimedia.org/r/485856

Change 485856 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::project_namespace_map: fix timer's script

https://gerrit.wikimedia.org/r/485856

Change 485858 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::project_namespace_map: fix variables in erb

https://gerrit.wikimedia.org/r/485858

Change 485858 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::project_namespace_map: fix variables in erb

https://gerrit.wikimedia.org/r/485858

Change 485860 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] refinery-download-project-namespace-map.sh.erb: remove unnecessary escapes

https://gerrit.wikimedia.org/r/485860

Change 485860 merged by Elukey:
[operations/puppet@production] refinery-download-project-namespace-map.sh.erb: remove unnecessary escapes

https://gerrit.wikimedia.org/r/485860

Change 486017 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] systemd::timer::job: add proper handling of ensure

https://gerrit.wikimedia.org/r/486017

Change 486024 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::spark_job: handle ensure and cleanup

https://gerrit.wikimedia.org/r/486024

Change 486017 merged by Elukey:
[operations/puppet@production] systemd::timer::job: add proper handling of ensure

https://gerrit.wikimedia.org/r/486017

Change 486024 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::spark_job: handle ensure and cleanup

https://gerrit.wikimedia.org/r/486024

Change 486037 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Clean up old code in Analytics' classes using systemd::timer::job

https://gerrit.wikimedia.org/r/486037

Change 486037 merged by Elukey:
[operations/puppet@production] Clean up old code in Analytics' classes using systemd::timer::job

https://gerrit.wikimedia.org/r/486037

Change 486040 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analyytics::refinery::job: remove old cron-related parameters

https://gerrit.wikimedia.org/r/486040

Change 486040 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job: remove old cron-related parameters

https://gerrit.wikimedia.org/r/486040

elukey set the point value for this task to 21.
elukey updated the task description. (Show Details)

This is such an aesome task to close.