mariadb: Replication lag monitoring does not support circular replication
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Kormat
	Feb 23 2021, 1:29 PM

Description

The current puppet logic for replication lag monitoring monitor_replication.pp#L47-L52:

# check the lag towards the mw_primary datacenter's master
$mw_primary = mediawiki::state('primary_dc')
nrpe::monitor_service { "mariadb_replica_sql_lag_${name}":
    description   => "MariaDB Replica Lag: ${name}",
    nrpe_command  => "${check_mariadb} --check=slave_sql_lag \
                      --shard=${name} --datacenter=${mw_primary} \
                      --sql-lag-warn=${lag_warn} \
                      --sql-lag-crit=${lag_crit}",

This causes the master in the primary DC to monitor lag from itself. In the case of unidirectional replication, this is a no-op as check-mariadb.pl notices that there's no slave thread running, and skips the check. With circular replication (a la x2, or core sections in the lead-up to a DC switchover), monitoring should be looking at the lag from the other DC, but currently does not.

Details

Subject	Repo	Branch	Lines +/-
mariadb: Use section params: remaining profiles.	operations/puppet	production	+29 -28
mariadb: Use section parameters: smaller misc profiles	operations/puppet	production	+56 -33
mariadb: Set misc nodes in codfw as 'master'	operations/puppet	production	+8 -3
mariadb: Add section parameters: misc	operations/puppet	production	+14 -15
mariadb: Add section parameters: core::multiinstance	operations/puppet	production	+8 -5
mariadb: Add section parameters	operations/puppet	production	+173 -40

Customize query in gerrit

Event Timeline

• Kormat created this task.Feb 23 2021, 1:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 23 2021, 1:29 PM

• Kormat updated the task description. (Show Details)Feb 23 2021, 1:31 PM

profile::mariadb::replication_lag has a similar issue:

# Don't monitor replication lag for 'standalone' hosts, or section masters in the primary DC
    if ($role == 'master' and !$is_on_primary_dc) or $role == 'slave' {
        monitoring::check_prometheus { "mariadb-prolonged-lag-${title}":
            description     => 'MariaDB sustained replica lag',
            dashboard_links => ["https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=${::hostname}&var-port=${prom_port}"],
            query           => "scalar(avg_over_time(mysql_slave_status_seconds_behind_master{instance=\"${::hostname}:${prom_port}\"}[5m]))",

A related issue is that when we switch over to codfw as primary DC, we do _not_ switch the misc sections, so puppet code which depends on mw_primary == section primary is then wrong.

Proposal

For every section, define:

writeable DC: mwprimary/eqiad/codfw/both
replication type: none/unidirectional/circular

This will allow correct monitoring for inter-DC replication-lag, and read-only master status.

As we spoke on IRC, we'd need to switch those flags as pre-steps on the DC switchover, as for XX days before and after the switchover we do enable circular replication on sX, x1, pcX, esX sections (for the last few switchovers we've left mX aside, but it would be a matter of time to start switching those too).

Marostegui triaged this task as Medium priority.Feb 25 2021, 6:47 AM

Marostegui moved this task from Triage to In progress on the DBA board.

Change 667547 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Add section parameters

https://gerrit.wikimedia.org/r/667547

gerritbot added a project: Patch-For-Review.Mar 1 2021, 3:42 PM

Mentioned in SAL (#wikimedia-operations) [2021-03-04T09:30:01Z] <kormat> disabling puppet on all db hosts while deploying a puppet monitoring change T275497

Change 667547 merged by Kormat:
[operations/puppet@production] mariadb: Add section parameters

https://gerrit.wikimedia.org/r/667547

Maintenance_bot removed a project: Patch-For-Review.Mar 4 2021, 10:11 AM

Deployed to s4 without issues. Deploying to s5 now.

Deployment complete.

Ah, that was premature. This still needs to be fixed for the other profiles.

Change 668031 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Add section parameters: core::multiinstance

https://gerrit.wikimedia.org/r/668031

Change 668444 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Add section parameters: misc

https://gerrit.wikimedia.org/r/668444

Change 668031 merged by Kormat:
[operations/puppet@production] mariadb: Add section parameters: core::multiinstance

https://gerrit.wikimedia.org/r/668031

Change 668444 merged by Kormat:
[operations/puppet@production] mariadb: Add section parameters: misc

https://gerrit.wikimedia.org/r/668444

• Kormat removed a parent task: T269324: Productionize x2 databases.Mar 4 2021, 2:56 PM

Maintenance_bot removed a project: Patch-For-Review.Mar 4 2021, 3:12 PM

Change 668464 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Use section parameters: misc profiles.

https://gerrit.wikimedia.org/r/668464

gerritbot added a project: Patch-For-Review.Mar 4 2021, 3:32 PM

Change 669821 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Set misc nodes in codfw as 'master'

https://gerrit.wikimedia.org/r/669821

Change 669821 merged by Kormat:
[operations/puppet@production] mariadb: Set misc nodes in codfw as 'master'

https://gerrit.wikimedia.org/r/669821

Change 668464 merged by Kormat:
[operations/puppet@production] mariadb: Use section parameters: smaller misc profiles

https://gerrit.wikimedia.org/r/668464

Change 669845 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Use section params: remaining profiles.

https://gerrit.wikimedia.org/r/669845

Change 669845 merged by Kormat:
[operations/puppet@production] mariadb: Use section params: remaining profiles.

https://gerrit.wikimedia.org/r/669845

This should all be in place now.

• Kormat moved this task from In progress to Done on the DBA board.Mar 11 2021, 10:29 AM

mariadb: Replication lag monitoring does not support circular replicationClosed, ResolvedPublicActions

Description

Details

Event Timeline

mariadb: Replication lag monitoring does not support circular replication
Closed, ResolvedPublic
Actions