part of making sure that everything in the MW pipeline is monitored in beta as well.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T53494 Use Beta cluster as a true canary for code deployments (epic) | |||
| Stalled | None | T53497 Setup monitoring for Beta Cluster (tracking) | |||
| Open | None | T87093 Setup monitoring for database servers in beta cluster |
Event Timeline
From T97120
The beta cluster MySQL servers turned out to be down for a few hours (T96905) and there is no monitoring for it.
We would need on both instances (deployment-db1 and deployment-db2) a check to ensure the mysql process is running.
The command line looks like:
/usr/sbin/mysqld --basedir=/usr --datadir=/mnt/sqldata \ --plugin-dir=/usr/lib/mysql/plugin --user=mysql \ --log-error=/mnt/sqldata/deployment-db1.err \ --pid-file=/mnt/sqldata/deployment-db1.pid \ --socket=/tmp/mysql.sock --port=3306
I guess we can just monitor whether /usr/bin/mysqld is present.
Per beta cluster weekly triage:
The MySQL databases only got down a couple times over 4 years and we quickly noticed it when it happened. Lack of monitoring is surely annoying but is not that much of a big deal, hence lowering priority.
The previous comments don't explain what/who exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status.
If this task should not be worked on and fixing this is not worth the efforts, then task status should have the "Declined" status.)
a check to ensure the mysql process is running.
Doesn't puppet already ensure that the mysql process is running?
And if puppet has a failure there are already automatic emails to project admins?
Is it still possible that these are down for hours without any notification.. as it was 10 years ago?
No, and this is a deliberate setting from the DBAs as I understand it to help ensure that a prod server restart does not further corrupt files on disk. Although I don't remember if mariadb doesn't start at all or starts in read-only mode that has to be manually switched to read-write. Either way, no Puppet is not sufficient.
Is it still possible that these are down for hours without any notification.. as it was 10 years ago?
The schema update job in Jenkins would whine to IRC, but I think that is the extent of notification.
Ah, right, I remember something like this now (puppet not supposed to auto-restart mysqld process).
Is this blocked on "existence of monitoring system" (thinking of back when icinga existed in cloud VPS) or is there one that just has to be configured with some extra checks?
There is some monitoring for deployment-prep happening these days via https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_managed_monitoring. Setting up new monitors there is a bit involved and very manual at the moment. See T315695: Add basic MediaWiki/web site up alerting to the Beta Cluster for some notes.
I see! thank you.
ftr, I was actually thinking that getting the old icinga system back might not be the hardest of all the options. It was already there once, the puppet module still exists and configuring some basic checks like "is the host up" felt easier to me than the modern monitoring systems that replaced it.