Page MenuHomePhabricator

check_mariadb_dump failing on alert[12]* hosts
Closed, ResolvedPublic

Description

Presumably because grants need to be adjusted?

e.g. these alerts are failing on alert1001

[1597755894] SERVICE ALERT: alert1001;dump of s1 in codfw;CRITICAL;HARD;3;We could not connect to the backup metadata database
[1597755894] SERVICE ALERT: alert1001;dump of s8 in codfw;CRITICAL;HARD;3;We could not connect to the backup metadata database
[1597755894] SERVICE ALERT: alert1001;snapshot of s1 in eqiad;CRITICAL;HARD;3;We could not connect to the backup metadata database
[1597755894] SERVICE ALERT: alert1001;snapshot of s8 in eqiad;CRITICAL;HARD;3;We could not connect to the backup metadata database

The new hosts in question are:

alert1001.wikimedia.org has address 208.80.154.88
alert1001.wikimedia.org has IPv6 address 2620:0:861:3:208:80:154:88

alert2001.wikimedia.org has address 208.80.153.84
alert2001.wikimedia.org has IPv6 address 2620:0:860:3:208:80:153:84

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 18 2020, 1:09 PM
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Next on the DBA board.
Marostegui added a subscriber: Marostegui.

Assigning to Jaime to see if he can take an initial look during the week

Marostegui reassigned this task from Marostegui to jcrespo.Mon, Aug 24, 5:39 AM

from T247966 I understand that alert1001 and alert2001 are new icinga hosts similar to the existing ones, right? If yes, the only needed change is to add them to the allow list for grants to the dbbackups database.

jcrespo moved this task from Next to Done on the DBA board.Mon, Aug 24, 8:57 AM

@fgiunchedi I've added the extra grants to fix the issue, waiting on your confirmation that the issue is fixed (or at least equivalent state to icinga1001 is shown) to puppetize the new icinga hosts grants.

If it still doesn't work, then it should be a question of firewall, although I would expect that to happen transparently?

That's correct @jcrespo, those alert* hosts will be replacing the existing icinga hosts. I can confirm that we're OK now, the check works:

[1598259185] SERVICE ALERT: alert1001;dump of m3 in codfw;CRITICAL;SOFT;1;We could not connect to the backup metadata database
[1598259185] SERVICE ALERT: alert1001;dump of s5 in eqiad;CRITICAL;SOFT;1;We could not connect to the backup metadata database
[1598259245] SERVICE ALERT: alert1001;dump of m3 in codfw;OK;SOFT;2;Last dump for m3 at codfw (db2078.codfw.wmnet:3323) taken on 2020-08-18 00:56:43 (57 GB)
[1598259245] SERVICE ALERT: alert1001;dump of s5 in eqiad;OK;SOFT;2;Last dump for s5 at eqiad (db1145.eqiad.wmnet:3315) taken on 2020-08-18 00:00:02 (102 GB)
jcrespo moved this task from Done to In progress on the DBA board.Mon, Aug 24, 9:08 AM
lmata moved this task from Inbox to Radar on the observability board.Thu, Aug 27, 2:16 PM

Change 622970 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: productionize backup stats and check database grants

https://gerrit.wikimedia.org/r/622970

Change 622970 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: productionize backup stats and check database grants

https://gerrit.wikimedia.org/r/622970

jcrespo closed this task as Resolved.Fri, Aug 28, 9:26 AM