Page MenuHomePhabricator

Monitoring and alerting for Toolforge tools
Open, Stalled, Needs TriagePublic


Does Toolforge have shared monitoring and alerting infrastructure? If not, could/should this be added? If yes, could it be better documented?

For example, my dinky little tool computes some data file and offers it for download. I’d like to get alerted when the data goes stale, which could happen when the pipeline for computing the data file has a problem. My webserver is exporting Prometheus metrics on and I’d like to get alerted when time() - qrank_last_modified_time_seconds gets greater than four weeks (subtracting timestamps from current time as per Prometheus recommendations). Being new to Toolforge, I couldn’t find any docs on where to add such monitoring rules. How do other tools currently get monitored? Does every tool run its own Prometheus server? (That would seems a little wasteful).

As an external volunteer with limited time, I (sadly) can’t permanently keep an eye on my service. That’s why it would be quite important for me to automatically receive alerts when things go wrong. Surely other tool authors will be in a similar situation.

Event Timeline

Sascha changed the task status from Open to Stalled.EditedMar 22 2021, 10:43 AM

Thanks for the pointer! Indeed, I was hoping the Wikimedia Cloud had something like Cortex or Thanos running on behalf of custom tools. Hm, considering for how long these discussions seem to already have been taking place, it doesn’t really look like this will be coming anytime soon. So, closing this ticket here as stalled; things won’t go any faster with more tickets around.