Set up monitoring for community cronjobs
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Sascha
	Apr 25 2022, 11:44 AM

Description

Could the Wikimedia Cloud run a cronjob monitoring system and make it accessible to external volunteer developers? After each successful run, cronjobs would send a heartbeat via HTTP POST. When no heartbeat has been received for a while, an alert should go to the group that maintains the dead job. A discussion on the Wikimedia cloud mailing list suggested github.com/healthchecks/healthchecks which is BSD-licensed.

Related Objects

Mentioned Here: T53434: Establish an internal system or a recommended external system for monitoring user-created Toolforge web services
T278097: Monitoring and alerting for Toolforge tools

Event Timeline

Sascha created this task.Apr 25 2022, 11:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2022, 11:44 AM

-jem- subscribed.Apr 25 2022, 11:59 AM

RhinosF1 added a project: Cloud-Services.Apr 25 2022, 11:59 AM

Restricted Application added a subscriber: RhinosF1. · View Herald TranscriptApr 25 2022, 11:59 AM

Aklapper added a project: observability.Apr 25 2022, 12:09 PM

The problem with it is, as I said on the ML, when Wikimedia network is down, there is simply no way to send out the down notices, so it will be pretty useless for the purposes. So it has to be on other DC or hosted externally (ie. Some chapter).

This seems like a duplicate of T278097: Monitoring and alerting for Toolforge tools or T53434: Establish an internal system or a recommended external system for monitoring user-created Toolforge web services.

In T306790#7876656, @revi wrote:

The problem with it is, as I said on the ML, when Wikimedia network is down, there is simply no way to send out the down notices, so it will be pretty useless for the purposes. So it has to be on other DC or hosted externally (ie. Some chapter).

We already have sufficient (partially externally hosted) monitoring if a large portition of our core network or the Cloud VPS / Toolforge platform itself goes down. So any solutions proposed here would only need to account for the tool itself having problems.

This seems like a duplicate

Somewhat, although this ticket here was meant specifically for monitoring cronjob completions. This is different (and simpler) than setting up Cortex/Thanos-like monitoring on metrics exposed by continuously running services.

any solutions proposed here would only need to account for the tool itself having problems

Agree. As the developer of a cronjob running on Toolforge, I can’t do anything when the entire Wikimedia network goes down, so I wouldn’t need to get alerted in this case. But when my cronjob hasn’t run for weeks because it’s been crashing while processing data (or has run out of disk, etc.), I’d find it super useful to get an alert.

TheresNoTime awarded a token.Apr 25 2022, 1:15 PM

TheresNoTime subscribed.

But when my cronjob hasn’t run for weeks because it’s been crashing while processing data (or has run out of disk, etc.), I’d find it super useful to get an alert.

As long as the mailing service is operating correctly, you can have your tool send email alerts whenever the cronjob crashes (I do so by inspecting the return code):

m h * * * jsub my_script.sh > /dev/null

# my_script.sh
doSomethingAndReturnErrorCodeOnFailure || send-mail

lmata moved this task from Inbox to Radar on the observability board.Jun 24 2022, 1:42 PM

TheresNoTime removed a subscriber: RhinosF1.Dec 15 2022, 11:35 PM

fnegri edited projects, added Toolforge; removed Cloud-Services.Jul 11 2023, 10:12 AM

Just noting that grid jobs will be soon phased out, so any effort should probably focus on toolforge jobs (kubernetes based).

dcaro triaged this task as Medium priority.Jan 30 2024, 12:01 PM

dcaro moved this task from Backlog to Workspace for triaging whenever needed on the Toolforge board.

dcaro moved this task from Workspace for triaging whenever needed to Ready to be worked on on the Toolforge board.Feb 21 2024, 4:03 PM

I think kubernetes should be logging somewhere all cronjob failures.

What if we extend the jobs framework emailer logic a bit to monitor for such cronjob failures and send emails about them.

aborrero added a project: User-aborrero.Mar 7 2024, 1:47 PM

Superyetkin subscribed.Mar 7 2024, 2:44 PM

Don-vip awarded a token.Mar 15 2024, 6:11 PM

Don-vip subscribed.

Set up monitoring for community cronjobsOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Set up monitoring for community cronjobs
Open, MediumPublic
Actions