Why
Some benefits:
- Telemetry into failures and progress of jobs, such as would have been useful during T282761.
- Less disruption to human consumers after DCs are switched. Same place as before.
- Not losing logs when the server is upgraded/re-imaged.
- Not losing logs when server is lost.
- Not losing logs when we ensure => absent a job, as happened today.
How
Not sure. I believe we use rsyslog for this in other places, which seems like a good fit. It would the scripts can continue to work "as expected" via stdout when invoked manually (including in production with production configuration), no runtime coupling with Kafka and Elastic, and dealing with all that batching/connectivity/retrying via a local buffer first.
What
- Decide how.
- Make it work.
- Update/write documentation on Wikitech for how to tail logs of scheduled maintenance scripts.