Page MenuHomePhabricator

Setup monitoring for database servers in beta cluster
Open, LowPublic

Description

part of making sure that everything in the MW pipeline is monitored in beta as well.

Event Timeline

yuvipanda claimed this task.
yuvipanda raised the priority of this task from to Medium.
yuvipanda updated the task description. (Show Details)
yuvipanda added subscribers: greg, scfc, Krinkle and 8 others.
greg renamed this task from Setup monitoring for database servers in betalabs to Setup monitoring for database servers in beta cluster.Mar 10 2015, 8:47 PM
greg set Security to None.
hashar added subscribers: Ryasmeen, Shizhao, thcipriani, Krenair.

From T97120

The beta cluster MySQL servers turned out to be down for a few hours (T96905) and there is no monitoring for it.

We would need on both instances (deployment-db1 and deployment-db2) a check to ensure the mysql process is running.

The command line looks like:

/usr/sbin/mysqld --basedir=/usr --datadir=/mnt/sqldata \
  --plugin-dir=/usr/lib/mysql/plugin --user=mysql \
  --log-error=/mnt/sqldata/deployment-db1.err \
  --pid-file=/mnt/sqldata/deployment-db1.pid \
  --socket=/tmp/mysql.sock --port=3306

I guess we can just monitor whether /usr/bin/mysqld is present.

hashar lowered the priority of this task from Medium to Low.Jun 15 2015, 7:17 PM

Per beta cluster weekly triage:

The MySQL databases only got down a couple times over 4 years and we quickly noticed it when it happened. Lack of monitoring is surely annoying but is not that much of a big deal, hence lowering priority.

hashar changed the task status from Open to Stalled.Oct 30 2015, 10:51 PM
Aklapper changed the task status from Stalled to Open.May 19 2020, 4:00 PM

The previous comments don't explain what/who exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status.

If this task should not be worked on and fixing this is not worth the efforts, then task status should have the "Declined" status.)

greg changed the task status from Open to Stalled.May 19 2020, 4:13 PM

Reflecting reality of team resourcing.

bd808 changed the task status from Stalled to Open.Sep 17 2025, 9:22 PM

a check to ensure the mysql process is running.

Doesn't puppet already ensure that the mysql process is running?

And if puppet has a failure there are already automatic emails to project admins?

Is it still possible that these are down for hours without any notification.. as it was 10 years ago?

a check to ensure the mysql process is running.

Doesn't puppet already ensure that the mysql process is running?

No, and this is a deliberate setting from the DBAs as I understand it to help ensure that a prod server restart does not further corrupt files on disk. Although I don't remember if mariadb doesn't start at all or starts in read-only mode that has to be manually switched to read-write. Either way, no Puppet is not sufficient.

Is it still possible that these are down for hours without any notification.. as it was 10 years ago?

The schema update job in Jenkins would whine to IRC, but I think that is the extent of notification.

Ah, right, I remember something like this now (puppet not supposed to auto-restart mysqld process).

Is this blocked on "existence of monitoring system" (thinking of back when icinga existed in cloud VPS) or is there one that just has to be configured with some extra checks?

Is this blocked on "existence of monitoring system" (thinking of back when icinga existed in cloud VPS) or is there one that just has to be configured with some extra checks?

There is some monitoring for deployment-prep happening these days via https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_managed_monitoring. Setting up new monitors there is a bit involved and very manual at the moment. See T315695: Add basic MediaWiki/web site up alerting to the Beta Cluster for some notes.

I see! thank you.

ftr, I was actually thinking that getting the old icinga system back might not be the hardest of all the options. It was already there once, the puppet module still exists and configuring some basic checks like "is the host up" felt easier to me than the modern monitoring systems that replaced it.