Page MenuHomePhabricator

Decide on monitoring solution
Open, Needs TriagePublic4 Estimated Story Points

Description

Figure out what solution that we should use for our servers.

Requirements

  • Alert via email (to our "drift"-address)
  • Monitor if a server is not reachable (for a given period of time)
    • May or may not be needed since we have DownNotifier
  • Abnormal usage of resources
    • E.g. high CPU usage for an extended period of time indicates that something is wrong
    • Preferably with the option to automatically shut down the server
  • Low disk space remaining

Candidates

These haven't been checked against requirements.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone

Event Timeline

Can't think of any additional requirements at this time.

The remaining two issues: OS becoming outdated and certificates expiring is something we will likely have to monitor in other ways.

Checking if the OS is still maintained feels like something that could be automated. We should look into that.

As for certificates (I'm assuming that you refer to HTTPS certificates): Certbot, which we are currently using, sends an email if it's about to expire, meaning that an automatic update has failed.

@kalle, do you have any experience of setting up things like this? Do you know what tools work or if something is missing from the requirements?

We could probably do something lightweight with Google Scripts to check if the sites are up or not. I found this script, although the code looks like it would be hard to adapt.

I don't know what kind of servers and services are going to be monitored, but these requirements would also be satisfied by the simplest of Prometheus+Grafana setup:

  • Prometheus with node-exporter (for CPU metrics etc.) and blackbox exporter (for pings and HTTPS requests), data retention a few weeks;
  • Grafana with alerts enabled.

It might even be enough to have Grafana alone, if you have a compatible datasource offered by your hosting provider. You can then set up Grafana alerts, which are enough when you don't need particularly sophisticated conditions.

There are a few subscription-based services which will run Grafana for you, for instance https://grafana.com/products/cloud/ (which shouldn't contain any proprietary add-on, but I'm not sure how to check) and https://aiven.io/grafana (probably vanilla upstream Grafana).

@kalle setup a server running Influx for the Wikispeech server. If that works we should be able to add it to the other servers to.