Could the Wikimedia Cloud run a cronjob monitoring system and make it accessible to external volunteer developers? After each successful run, cronjobs would send a heartbeat via HTTP POST. When no heartbeat has been received for a while, an alert should go to the group that maintains the dead job. A discussion on the Wikimedia cloud mailing list suggested github.com/healthchecks/healthchecks which is BSD-licensed.
Description
Related Objects
Event Timeline
The problem with it is, as I said on the ML, when Wikimedia network is down, there is simply no way to send out the down notices, so it will be pretty useless for the purposes. So it has to be on other DC or hosted externally (ie. Some chapter).
This seems like a duplicate of T278097: Monitoring and alerting for Toolforge tools or T53434: Establish an internal system or a recommended external system for monitoring user-created Toolforge web services.
We already have sufficient (partially externally hosted) monitoring if a large portition of our core network or the Cloud VPS / Toolforge platform itself goes down. So any solutions proposed here would only need to account for the tool itself having problems.
This seems like a duplicate
Somewhat, although this ticket here was meant specifically for monitoring cronjob completions. This is different (and simpler) than setting up Cortex/Thanos-like monitoring on metrics exposed by continuously running services.
any solutions proposed here would only need to account for the tool itself having problems
Agree. As the developer of a cronjob running on Toolforge, I can’t do anything when the entire Wikimedia network goes down, so I wouldn’t need to get alerted in this case. But when my cronjob hasn’t run for weeks because it’s been crashing while processing data (or has run out of disk, etc.), I’d find it super useful to get an alert.
But when my cronjob hasn’t run for weeks because it’s been crashing while processing data (or has run out of disk, etc.), I’d find it super useful to get an alert.
As long as the mailing service is operating correctly, you can have your tool send email alerts whenever the cronjob crashes (I do so by inspecting the return code):
m h * * * jsub my_script.sh > /dev/null
# my_script.sh doSomethingAndReturnErrorCodeOnFailure || send-mail
Just noting that grid jobs will be soon phased out, so any effort should probably focus on toolforge jobs (kubernetes based).
I think kubernetes should be logging somewhere all cronjob failures.
What if we extend the jobs framework emailer logic a bit to monitor for such cronjob failures and send emails about them.