Page MenuHomePhabricator

wmf-auto-reinstall fails on hosts that run pt-heartbeat
Closed, ResolvedPublic

Description

The post-install puppet run fails with:

1May 12 12:01:40 pc2010 puppet-agent[8551]: (/Stage[main]/Mariadb::Heartbeat/Exec[pt-heartbeat]/returns) DBI connect('heartbeat;mysql_read_default_file=/dev/null;host=localhost;mysql_socket=/run/mysqld/mysqld.sock;mysql_read_default_group=client','root',...) failed: Can't connect to local MySQL server through socket '/run/mysqld/mysqld.sock' (2) at /usr/local/bin/pt-heartbeat-wikimedia line 2137.
2May 12 12:01:40 pc2010 puppet-agent[8551]: '/usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=pc1 --datacenter=codfw --update --replace --interval=1 --set-vars="binlog_format=STATEMENT" -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid' returned 2 instead of one of [0]
3May 12 12:01:40 pc2010 puppet-agent[8551]: (/Stage[main]/Mariadb::Heartbeat/Exec[pt-heartbeat]/returns) change from 'notrun' to ['0'] failed: '/usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=pc1 --datacenter=codfw --update --replace --interval=1 --set-vars="binlog_format=STATEMENT" -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid' returned 2 instead of one of [0]

Event Timeline

Marostegui subscribed.

This is "expected" as the host doesn't have MySQL up and running.
This is pretty much the last step of the script, so even if the installation reports as failed, it has actually completed fine, it just didn't achieve a successful puppet run.

The workaround is start mysql.
Ideally this service should check whether mysql is up before attempting to get started.
We should probably move pt-heartbeat to a deb package + systemd like we did with pt-kill and handle it with systemd and make it a dependency of mysql service.

Change 665324 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Convert pt-heartbeat to a systemd service.

https://gerrit.wikimedia.org/r/665324

Change 665337 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/software/wmfmariadbpy@master] switchover: Use heartbeat systemd service.

https://gerrit.wikimedia.org/r/665337

Change 665337 merged by jenkins-bot:

[operations/software/wmfmariadbpy@master] switchover: Use heartbeat systemd service.

https://gerrit.wikimedia.org/r/665337

Change 665324 merged by Kormat:

[operations/puppet@production] mariadb: Convert pt-heartbeat to a systemd service.

https://gerrit.wikimedia.org/r/665324

Kormat claimed this task.

This is now fixed. Puppet will no longer start/stop heartbeat. That is managed by db-switchover when changing masters. This does mean that pt-heartbeat-wikimedia needs to be started manually after a boot, however.

This does mean that pt-heartbeat-wikimedia needs to be started manually after a boot, however.

@Kormat is this captured somewhere in documentation?

This does mean that pt-heartbeat-wikimedia needs to be started manually after a boot, however.

@Kormat is this captured somewhere in documentation?

If you can point me to an appropriate place, i'd be happy to add it.