Page MenuHomePhabricator

Set up monitoring for community cronjobs
Open, MediumPublic

Description

Could the Wikimedia Cloud run a cronjob monitoring system and make it accessible to external volunteer developers? After each successful run, cronjobs would send a heartbeat via HTTP POST. When no heartbeat has been received for a while, an alert should go to the group that maintains the dead job. A discussion on the Wikimedia cloud mailing list suggested github.com/healthchecks/healthchecks which is BSD-licensed.

Event Timeline

The problem with it is, as I said on the ML, when Wikimedia network is down, there is simply no way to send out the down notices, so it will be pretty useless for the purposes. So it has to be on other DC or hosted externally (ie. Some chapter).

This seems like a duplicate of T278097: Monitoring and alerting for Toolforge tools or T53434: Establish an internal system or a recommended external system for monitoring user-created Toolforge web services.

The problem with it is, as I said on the ML, when Wikimedia network is down, there is simply no way to send out the down notices, so it will be pretty useless for the purposes. So it has to be on other DC or hosted externally (ie. Some chapter).

We already have sufficient (partially externally hosted) monitoring if a large portition of our core network or the Cloud VPS / Toolforge platform itself goes down. So any solutions proposed here would only need to account for the tool itself having problems.

This seems like a duplicate

Somewhat, although this ticket here was meant specifically for monitoring cronjob completions. This is different (and simpler) than setting up Cortex/Thanos-like monitoring on metrics exposed by continuously running services.

any solutions proposed here would only need to account for the tool itself having problems

Agree. As the developer of a cronjob running on Toolforge, I can’t do anything when the entire Wikimedia network goes down, so I wouldn’t need to get alerted in this case. But when my cronjob hasn’t run for weeks because it’s been crashing while processing data (or has run out of disk, etc.), I’d find it super useful to get an alert.

But when my cronjob hasn’t run for weeks because it’s been crashing while processing data (or has run out of disk, etc.), I’d find it super useful to get an alert.

As long as the mailing service is operating correctly, you can have your tool send email alerts whenever the cronjob crashes (I do so by inspecting the return code):

m h * * * jsub my_script.sh > /dev/null

# my_script.sh
doSomethingAndReturnErrorCodeOnFailure || send-mail

Just noting that grid jobs will be soon phased out, so any effort should probably focus on toolforge jobs (kubernetes based).

dcaro triaged this task as Medium priority.Jan 30 2024, 12:01 PM
dcaro moved this task from Backlog to Workspace for triaging whenever needed on the Toolforge board.

I think kubernetes should be logging somewhere all cronjob failures.

What if we extend the jobs framework emailer logic a bit to monitor for such cronjob failures and send emails about them.