Page MenuHomePhabricator

Add basic MediaWiki/web site up alerting to the Beta Cluster
Open, MediumPublic

Description

Not wishing to duplicate T53497: Setup monitoring for Beta Cluster (tracking), we should implement some very basic alerting for Beta Cluster availability.

I've run https://uptime.theresnotime.io/status/beta-cluster for a few months now, and have it configured to "page" me via pushover (as does jenkins-watch for Beta deployment jobs going overdue) — I'm happy to continue running this, and could potentially have Beta alerts feed into AlertManager etc via a webhook?

Actionable from https://wikitech.wikimedia.org/wiki/Incidents/2022-08-16_Beta_Cluster_502#Actionables

Event Timeline

Change 832326 had a related patch set uploaded (by Samtar; author: Samtar):

[operations/puppet@production] profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta

https://gerrit.wikimedia.org/r/832326

Change 833782 had a related patch set uploaded (by Samtar; author: Samtar):

[operations/puppet@production] [WIP] prometheus/alerts_beta.yml: Add HostDown alert

https://gerrit.wikimedia.org/r/833782

Change 833782 abandoned by Samtar:

[operations/puppet@production] prometheus/alerts_beta.yml: Add HostDown alert

Reason:

https://gerrit.wikimedia.org/r/833782

Change 832326 abandoned by Samtar:

[operations/puppet@production] profile::beta::monitoring: Add blackbox check for meta.wikimedia.beta

Reason:

https://gerrit.wikimedia.org/r/832326

Joe subscribed.

Incident followup tags should be reserved for production issues.

We have prometheus powered alerts for things like Puppet breaking in deployment-prep now that get reported to the https://lists.wikimedia.org/postorius/lists/betacluster-alerts.lists.wikimedia.org/ mailing list and also to the #wikimedia-releng IRC channel. We should figure out how to add some web status alerting now.

I think in theory that could use the something like the prometheus::blackbox::check::http stuff like @TheresNoTime proposed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/832326, but it needs to be picked up the prometheus stack in the metricsinfra Cloud VPS project instead of the prod prometheus stack. Maybe @taavi can clue us into whether that is actually reasonably possible and what to do?

See also:

bd808 renamed this task from Add basic alerting to the Beta Cluster to Add basic MediaWiki/web site up alerting to the Beta Cluster.Jun 11 2025, 12:48 AM
bd808 triaged this task as Medium priority.
bd808 moved this task from To Triage to Backlog on the Beta-Cluster-Infrastructure board.

In theory we could set up a Prometheus server in beta cluster (I thought there was one already, but apparently not?), set it up to import Blackbox probes from PuppetDB, and feed alerts from there to the metricsinfra alertmanager. This is what we do in Toolforge and friends. But:

Prometheus and alertmanager are not really multi-tenant software, and metricsinfra is working around that by restricting the configuration of both pieces of software to only come from trusted sources. Unfortunately this means that we can't hook up any Prometheus instances in deployment-prep to the metricsinfra alertmanager, because I don't want to give everyone with deployment-prep access (200+ people) the ability to send arbitrary alerts or pages to anyone relying on that setup.

So that leaves us with two options in the short term:

  1. Set up an entirely separate alerting stack in deployment-prep, in addition to the separate Prometheus stack.
  2. Define any additional alerts for deployment-prep via the metricsinfra config database.

So that leaves us with two options in the short term:

  1. Set up an entirely separate alerting stack in deployment-prep, in addition to the separate Prometheus stack.

This feels like a poor solution if only because we are still struggling to decide how anything is maintained in Beta Cluster which makes it feel fragile to be adding more software there to monitor if things that are already there are broken.

  1. Define any additional alerts for deployment-prep via the metricsinfra config database.

This sounds reasonable to me, especially if you are willing to give me a hand in figuring out how to make things work. I don't want to see you get stuck being the only human who knows how the bits fit together.

I don't want to see you get stuck being the only human who knows how the bits fit together.

I'm afraid I dug myself into this hole already when I built the very elaborate system in the hopes of eventually making this self-service but then never actually finished building a proper user interface for that system.

Dumping some notes to my future self here.

Following hints from https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS and T386416#10566954 I figured out that the current alerting we have is from these contact group mappings.

MariaDB [prometheusconfig]> select cgm.id, cg.name, cgm.type, cgm.value from projects as p join contact_groups as cg join contact_group_members as cgm on p.default_contact_group_id=cg.id and cg.id=cgm.contact_group_id where p.name = 'deployment-prep';
+----+-------------+-------+----------------------------------------+
| id | name        | type  | value                                  |
+----+-------------+-------+----------------------------------------+
|  9 | betacluster | EMAIL | betacluster-alerts@lists.wikimedia.org |
| 10 | betacluster | IRC   | #wikimedia-releng                      |
| 11 | betacluster | PHAB  | PHID-PROJ-dthybs72vou24ydhgpbq         |
+----+-------------+-------+----------------------------------------+
3 rows in set (0.003 sec)

To add new alerts to the system we need to add new rows into the alerts table:

MariaDB [prometheusconfig]> describe alerts;
+-------------+---------------+------+-----+---------+----------------+
| Field       | Type          | Null | Key | Default | Extra          |
+-------------+---------------+------+-----+---------+----------------+
| id          | int(11)       | NO   | PRI | NULL    | auto_increment |
| project_id  | int(11)       | NO   | MUL | NULL    |                |
| name        | varchar(255)  | NO   |     | NULL    |                |
| expr        | varchar(2048) | NO   |     | NULL    |                |
| duration    | varchar(32)   | NO   |     | 1m      |                |
| severity    | varchar(32)   | NO   |     | warn    |                |
| annotations | longtext      | NO   |     | NULL    |                |
+-------------+---------------+------+-----+---------+----------------+
7 rows in set (0.003 sec)

In our case these alerts will probably also need new Prometheus metrics from a custom endpoint in the scrapes table:

MariaDB [prometheusconfig]> describe scrapes;
+------------+----------------------------------+------+-----+----------+----------------+
| Field      | Type                             | Null | Key | Default  | Extra          |
+------------+----------------------------------+------+-----+----------+----------------+
| id         | int(11)                          | NO   | PRI | NULL     | auto_increment |
| project_id | int(11)                          | NO   | MUL | NULL     |                |
| name       | varchar(255)                     | NO   |     | NULL     |                |
| path       | varchar(255)                     | NO   |     | /metrics |                |
| scheme     | enum('tcp','udp','http','https') | NO   |     | http     |                |
+------------+----------------------------------+------+-----+----------+----------------+
5 rows in set (0.004 sec)

The quarry project has a custom alert that is roughly what we want to add:

MariaDB [prometheusconfig]> select p.name, a.* from alerts a join projects p on
a.project_id = p.id where p.name = 'quarry'\G
*************************** 1. row ***************************
       name: quarry
         id: 29
 project_id: 16
       name: Down
       expr: up{project="quarry",job="app"} == 0
   duration: 5m
   severity: critical
annotations: {"summary": "Quarry application is unreachable"}
1 row in set (0.002 sec)

MariaDB [prometheusconfig]> select s.id, p.name, s.name, s.scheme, sds.host, s.path, sds.port from scrapes s join scrape_discovery_static sds on s.id = sds.scrape_id join projects p on s.project_id = p.id where p.name = 'quarry' and s.name
= 'app'\G
*************************** 1. row ***************************
    id: 10
  name: quarry
  name: app
scheme: https
  host: quarry.wmcloud.org
  path: /app/metrics
  port: 443
1 row in set (0.004 sec)

There are several different scrape types:

MariaDB [prometheusconfig]> show tables like 'scrape\_%';
+----------------------------------------+
| Tables_in_prometheusconfig (scrape\_%) |
+----------------------------------------+
| scrape_blackbox_dns                    |
| scrape_blackbox_http                   |
| scrape_discovery_openstack             |
| scrape_discovery_static                |
+----------------------------------------+
4 rows in set (0.003 sec)

We may only need a scrape_blackbox_http to accomplish our desired "is meta.beta up?" check:

MariaDB [prometheusconfig]> describe scrape_blackbox_http;
+------------------------+--------------+------+-----+---------+----------------+
| Field                  | Type         | Null | Key | Default | Extra          |
+------------------------+--------------+------+-----+---------+----------------+
| id                     | int(11)      | NO   | PRI | NULL    | auto_increment |
| scrape_id              | int(11)      | YES  | UNI | NULL    |                |
| method                 | varchar(255) | NO   |     | NULL    |                |
| headers                | longtext     | YES  |     | NULL    |                |
| follow_redirects       | tinyint(1)   | NO   |     | NULL    |                |
| valid_status_codes     | longtext     | YES  |     | NULL    |                |
| require_body_match     | longtext     | YES  |     | NULL    |                |
| require_body_not_match | longtext     | YES  |     | NULL    |                |
| host                   | varchar(255) | YES  |     | NULL    |                |
+------------------------+--------------+------+-----+---------+----------------+
9 rows in set (0.004 sec)
MariaDB [prometheusconfig]> select s.id, p.name, s.name, sbbh.host, sbbh.method, sbbh.valid_status_codes, sbbh.require_body_match from scrape_blackbox_http sbbh
 join scrapes s on sbbh
G
*************************** 1. row ***************************
                id: 21
              name: project-proxy
              name: main-nginx-https
              host: download.wmcloud.org
            method: GET
valid_status_codes: [200]
require_body_match: ["Index of"]
*************************** 2. row ***************************
                id: 28
              name: project-proxy
              name: main-proxy-https-vip
              host: NULL
            method: GET
valid_status_codes: [404]
require_body_match: ["Not Found"]
2 rows in set (0.003 sec)

If my guesses are right, that main-nginx-https blackbox scrape could be alerted on with an expr=up{job="main-nginx-https",project="project-proxy"} alert. (up might not be the right function there...).