We are currently logging most of our important cron jobs' stderr/out to file. This does not allow us to easily recognize when a script fails, and this might lead to outages or general issues of our infrastructure.
We decided to replace all our crons with systemd timers (where possible).
List of critical jobs:
- Camus import jobs
- HDFS Balancer
- profile::analytics::refinery::job::data_check
- profile::analytics::refinery::job::data_purge
- profile::analytics::refinery::job::refine_job (profile::analytics::refinery::job::spark_job)
- profile::analytics::refinery::job::eventlogging_to_druid_job
- profile::analytics::refinery::job::project_namespace_map
- profile::analytics::refinery::job::sqoop_mediawiki
- report updater jobs
And last but not the least:
- Final puppet clean up from old/not-used code