Does Toolforge have shared monitoring and alerting infrastructure? If not, could/should this be added? If yes, could it be better documented?
For example, my dinky little tool computes some data file and offers it for download. I’d like to get alerted when the data goes stale, which could happen when the pipeline for computing the data file has a problem. My webserver is exporting Prometheus metrics on https://qrank.toolforge.org/metrics and I’d like to get alerted when time() - qrank_last_modified_time_seconds gets greater than four weeks (subtracting timestamps from current time as per Prometheus recommendations). Being new to Toolforge, I couldn’t find any docs on where to add such monitoring rules. How do other tools currently get monitored? Does every tool run its own Prometheus server? (That would seems a little wasteful).
As an external volunteer with limited time, I (sadly) can’t permanently keep an eye on my service. That’s why it would be quite important for me to automatically receive alerts when things go wrong. Surely other tool authors will be in a similar situation.