Page MenuHomePhabricator

Monitoring and alerting for Toolforge tools
Closed, DuplicatePublic

Description

Does Toolforge have shared monitoring and alerting infrastructure? If not, could/should this be added? If yes, could it be better documented?

For example, my dinky little tool computes some data file and offers it for download. I’d like to get alerted when the data goes stale, which could happen when the pipeline for computing the data file has a problem. My webserver is exporting Prometheus metrics on https://qrank.toolforge.org/metrics and I’d like to get alerted when time() - qrank_last_modified_time_seconds gets greater than four weeks (subtracting timestamps from current time as per Prometheus recommendations). Being new to Toolforge, I couldn’t find any docs on where to add such monitoring rules. How do other tools currently get monitored? Does every tool run its own Prometheus server? (That would seems a little wasteful).

As an external volunteer with limited time, I (sadly) can’t permanently keep an eye on my service. That’s why it would be quite important for me to automatically receive alerts when things go wrong. Surely other tool authors will be in a similar situation.

Event Timeline

Sascha changed the task status from Open to Stalled.EditedMar 22 2021, 10:43 AM

Thanks for the pointer! Indeed, I was hoping the Wikimedia Cloud had something like Cortex or Thanos running on behalf of custom tools. Hm, considering for how long these discussions seem to already have been taking place, it doesn’t really look like this will be coming anytime soon. So, closing this ticket here as stalled; things won’t go any faster with more tickets around.

I'd just like to add my support to this idea. Many of the tools that run in toolforge are critical parts of the technical infrastructure that keeps the project going. They deserve all the normal logging, alerting and monitoring support that any serious production system has.

I'd love to see something like https://en.wikipedia.org/wiki/Graphite_(software) set up that any tool could easily feed performance data to and tool maintainers could build their own dashboards. There's really no reason for each tool developer to reinvent the wheel on this kind of stuff.