Page MenuHomePhabricator

Investigate pt-heartbeat-wikimedia failure modes
Closed, ResolvedPublic

Description

It looks like it will run forever if it can't connect to mariadb, but fails if the table schema isn't correct. Needs more investigation to confirm, and check other corner-cases too.

Summary of results: If pt-heartbeat-wikimedia is able to connect to mariadb on start, and the table schema meets its requirements, it will stay running even if mariadb goes away. Start-up problems are all fatal.

Event Timeline

Starting state: pt-hb running, mariadb running.
State change: stopping mariadb
Result: pt-hb starts logging this once per second:

Jun 17 09:55:01 zarcillo0 pt-heartbeat-wikimedia[3982]: Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock' (2)

It does not exit.

Starting state: pt-hb running, mariadb running.
State change: restarting mariadb
Result: pt-hb starts logging this once per second while mariadb is down, and then continues to work when mariadb is back:

Jun 17 09:55:01 zarcillo0 pt-heartbeat-wikimedia[3982]: Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock' (2)

It does not exit.

Starting state: pt-hb stopped, mariadb stopped.
State change: starting pt-hb
Result: pt-hb exits immediately with this error:

Jun 17 09:58:11 zarcillo0 pt-heartbeat-wikimedia[22267]: DBI connect('heartbeat;mysql_read_default_file=/dev/null;host=localhost;mysql_socket=/run/mysqld/mysqld.sock;mysql_read_default_group=client','root',...) failed: Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock' (2) at /usr/local/bin/pt-heartbeat-wikimedia line 2137.

Starting state: pt-hb stopped, mariadb running.
State change: starting pt-hb with invalid database name
Result: pt-hb exits immediately with this error:

Jun 17 10:02:32 zarcillo0 pt-heartbeat-wikimedia[23007]: DBI connect('heartbeatTEST;mysql_read_default_file=/dev/null;host=localhost;mysql_socket=/run/mysqld/mysqld.sock;mysql_read_default_group=client','root',...) failed: Unknown database 'heartbeatTEST' at /usr/local/bin/pt-heartbeat-wikimedia line 2137.

Starting state: pt-hb stopped, mariadb running.
State change: starting pt-hb with invalid table schema
Result: pt-hb exits immediately with this error:

Jun 17 10:05:07 zarcillo0 pt-heartbeat-wikimedia[23044]: Heartbeat table `heartbeatTEST`.`heartbeat` does not have a ts column at /usr/local/bin/pt-heartbeat-wikimedia line 4933.

This is quite a good approach.
The only one that can bite us is the one that happens after a reboot:

  • Both services stopped
  • Mariadb gets started before pt-hearbeat
  • pt-hearbeat remains stopped (which is fixed by the next puppet run), but we do have an alert for it.

Change 700898 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] mariadb: Monitor pt-heartbeat for expected status.

https://gerrit.wikimedia.org/r/700898

Mentioned in SAL (#wikimedia-operations) [2021-06-22T13:49:31Z] <kormat> disabling puppet on A:db-all for T285079

Change 700898 merged by Kormat:

[operations/puppet@production] mariadb: Monitor pt-heartbeat for expected status.

https://gerrit.wikimedia.org/r/700898

We now have decent monitoring for pt-heartbeat - if it's not in the expected status (running on 'masters', stopped everywhere else) then we'll get an alert after 2 minutes.

The only thing left to do is to make that alert actually page when it should (specifically: only page if it's not running on a master in an active DC), and to notify in #wikimedia-operations.

LSobanski moved this task from In progress to Backlog on the DBA board.

Change #1240680 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Alert on pt-heartbeat not running

https://gerrit.wikimedia.org/r/1240680

Change #1240680 merged by Marostegui:

[operations/puppet@production] mariadb: Alert on pt-heartbeat not running

https://gerrit.wikimedia.org/r/1240680

Change #1243070 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2230: Enable notifications

https://gerrit.wikimedia.org/r/1243070

Change #1243070 merged by Marostegui:

[operations/puppet@production] db2230: Enable notifications

https://gerrit.wikimedia.org/r/1243070

Tested this alert on core hosts and worked well (the link to the doc doesn't exist yet):

[06:44:53]  <+icinga-wm> PROBLEM - pt-heartbeat-wikimedia process on db2230 is CRITICAL: PROCS CRITICAL: 0 processes with args pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23pt-heartbeat
[06:45:53]  <+icinga-wm> RECOVERY - pt-heartbeat-wikimedia process on db2230 is OK: PROCS OK: 1 process with args pt-heartbeat-wikimedia https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23pt-heartbeat

Change #1243594 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Add monitor_heartbeat to core hosts.

https://gerrit.wikimedia.org/r/1243594

Change #1243594 merged by Marostegui:

[operations/puppet@production] mariadb: Add monitor_heartbeat to core hosts.

https://gerrit.wikimedia.org/r/1243594

Marostegui claimed this task.

I've deployed this alert to core hosts (and parsercache).
NOT deployed to misc.

I've written which goes attached to the alert https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting#pt-heartbeat-wikimedia