Page MenuHomePhabricator

Decide on monitoring solution
Closed, ResolvedPublic

Description

Figure out what solution that we should use for our servers.

Requirements

  • Alert via email (to our "drift"-address)
  • Monitor if a server is not reachable (for a given period of time)
    • May or may not be needed since we have DownNotifier
  • Abnormal usage of resources
    • E.g. high CPU usage for an extended period of time indicates that something is wrong
    • Preferably with the option to automatically shut down the server
  • Low disk space remaining

Candidates

These haven't been checked against requirements.

Related Objects

Event Timeline

Can't think of any additional requirements at this time.

The remaining two issues: OS becoming outdated and certificates expiring is something we will likely have to monitor in other ways.

Checking if the OS is still maintained feels like something that could be automated. We should look into that.

As for certificates (I'm assuming that you refer to HTTPS certificates): Certbot, which we are currently using, sends an email if it's about to expire, meaning that an automatic update has failed.

@kalle, do you have any experience of setting up things like this? Do you know what tools work or if something is missing from the requirements?

We could probably do something lightweight with Google Scripts to check if the sites are up or not. I found this script, although the code looks like it would be hard to adapt.

I don't know what kind of servers and services are going to be monitored, but these requirements would also be satisfied by the simplest of Prometheus+Grafana setup:

  • Prometheus with node-exporter (for CPU metrics etc.) and blackbox exporter (for pings and HTTPS requests), data retention a few weeks;
  • Grafana with alerts enabled.

It might even be enough to have Grafana alone, if you have a compatible datasource offered by your hosting provider. You can then set up Grafana alerts, which are enough when you don't need particularly sophisticated conditions.

There are a few subscription-based services which will run Grafana for you, for instance https://grafana.com/products/cloud/ (which shouldn't contain any proprietary add-on, but I'm not sure how to check) and https://aiven.io/grafana (probably vanilla upstream Grafana).

@kalle setup a server running Influx for the Wikispeech server. If that works we should be able to add it to the other servers to.

I've created an account on Grafana Cloud. Setting up integration on the Sites server was easy enough using the Linux template. It comes with a bunch metrics and alerts, many of which I don't know what they do, but I guess it's all good stuff🙂

Synthetic monitoring can check reachability. Again there are some default rules that are probably fine.

The only thing I haven't gotten to work yet is alerts. They fire in Grafana, but no email is sent, even though drift@ is set as default.

After a bit of flailing around I managed to get it to send emails for alerts. I think what did it was adding things to "Group by" under AlertingNotification policies. I makes little sense to me why this was needed or why the email settings differ between alertmanagers, but 🤷.

I'd say we go with Grafana for now. It seems to work fine for what we need.

Lokal_Profil claimed this task.

Grafana seems to work fine

Documentation at T332679