Page MenuHomePhabricator

check_mariadb_dump failing on alert[12]* hosts
Closed, ResolvedPublic

Description

Presumably because grants need to be adjusted?

e.g. these alerts are failing on alert1001

[1597755894] SERVICE ALERT: alert1001;dump of s1 in codfw;CRITICAL;HARD;3;We could not connect to the backup metadata database
[1597755894] SERVICE ALERT: alert1001;dump of s8 in codfw;CRITICAL;HARD;3;We could not connect to the backup metadata database
[1597755894] SERVICE ALERT: alert1001;snapshot of s1 in eqiad;CRITICAL;HARD;3;We could not connect to the backup metadata database
[1597755894] SERVICE ALERT: alert1001;snapshot of s8 in eqiad;CRITICAL;HARD;3;We could not connect to the backup metadata database

The new hosts in question are:

alert1001.wikimedia.org has address 208.80.154.88
alert1001.wikimedia.org has IPv6 address 2620:0:861:3:208:80:154:88

alert2001.wikimedia.org has address 208.80.153.84
alert2001.wikimedia.org has IPv6 address 2620:0:860:3:208:80:153:84

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Pending comment on the DBA board.
Marostegui subscribed.

Assigning to Jaime to see if he can take an initial look during the week

from T247966 I understand that alert1001 and alert2001 are new icinga hosts similar to the existing ones, right? If yes, the only needed change is to add them to the allow list for grants to the dbbackups database.

@fgiunchedi I've added the extra grants to fix the issue, waiting on your confirmation that the issue is fixed (or at least equivalent state to icinga1001 is shown) to puppetize the new icinga hosts grants.

If it still doesn't work, then it should be a question of firewall, although I would expect that to happen transparently?

That's correct @jcrespo, those alert* hosts will be replacing the existing icinga hosts. I can confirm that we're OK now, the check works:

[1598259185] SERVICE ALERT: alert1001;dump of m3 in codfw;CRITICAL;SOFT;1;We could not connect to the backup metadata database
[1598259185] SERVICE ALERT: alert1001;dump of s5 in eqiad;CRITICAL;SOFT;1;We could not connect to the backup metadata database
[1598259245] SERVICE ALERT: alert1001;dump of m3 in codfw;OK;SOFT;2;Last dump for m3 at codfw (db2078.codfw.wmnet:3323) taken on 2020-08-18 00:56:43 (57 GB)
[1598259245] SERVICE ALERT: alert1001;dump of s5 in eqiad;OK;SOFT;2;Last dump for s5 at eqiad (db1145.eqiad.wmnet:3315) taken on 2020-08-18 00:00:02 (102 GB)

Change 622970 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: productionize backup stats and check database grants

https://gerrit.wikimedia.org/r/622970

Change 622970 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: productionize backup stats and check database grants

https://gerrit.wikimedia.org/r/622970