Page MenuHomePhabricator

mariadb: Replication lag monitoring does not support circular replication
Closed, ResolvedPublic

Description

The current puppet logic for replication lag monitoring monitor_replication.pp#L47-L52:

# check the lag towards the mw_primary datacenter's master
$mw_primary = mediawiki::state('primary_dc')
nrpe::monitor_service { "mariadb_replica_sql_lag_${name}":
    description   => "MariaDB Replica Lag: ${name}",
    nrpe_command  => "${check_mariadb} --check=slave_sql_lag \
                      --shard=${name} --datacenter=${mw_primary} \
                      --sql-lag-warn=${lag_warn} \
                      --sql-lag-crit=${lag_crit}",

This causes the master in the primary DC to monitor lag from itself. In the case of unidirectional replication, this is a no-op as check-mariadb.pl notices that there's no slave thread running, and skips the check. With circular replication (a la x2, or core sections in the lead-up to a DC switchover), monitoring should be looking at the lag from the other DC, but currently does not.

Event Timeline

profile::mariadb::replication_lag has a similar issue:

# Don't monitor replication lag for 'standalone' hosts, or section masters in the primary DC
    if ($role == 'master' and !$is_on_primary_dc) or $role == 'slave' {
        monitoring::check_prometheus { "mariadb-prolonged-lag-${title}":
            description     => 'MariaDB sustained replica lag',
            dashboard_links => ["https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=${::hostname}&var-port=${prom_port}"],
            query           => "scalar(avg_over_time(mysql_slave_status_seconds_behind_master{instance=\"${::hostname}:${prom_port}\"}[5m]))",

A related issue is that when we switch over to codfw as primary DC, we do _not_ switch the misc sections, so puppet code which depends on mw_primary == section primary is then wrong.

Proposal

For every section, define:

  • writeable DC: mwprimary/eqiad/codfw/both
  • replication type: none/unidirectional/circular

This will allow correct monitoring for inter-DC replication-lag, and read-only master status.

As we spoke on IRC, we'd need to switch those flags as pre-steps on the DC switchover, as for XX days before and after the switchover we do enable circular replication on sX, x1, pcX, esX sections (for the last few switchovers we've left mX aside, but it would be a matter of time to start switching those too).

Marostegui moved this task from Triage to In progress on the DBA board.

Change 667547 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Add section parameters

https://gerrit.wikimedia.org/r/667547

Mentioned in SAL (#wikimedia-operations) [2021-03-04T09:30:01Z] <kormat> disabling puppet on all db hosts while deploying a puppet monitoring change T275497

Change 667547 merged by Kormat:
[operations/puppet@production] mariadb: Add section parameters

https://gerrit.wikimedia.org/r/667547

Deployed to s4 without issues. Deploying to s5 now.

Deployment complete.

Ah, that was premature. This still needs to be fixed for the other profiles.

Change 668031 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Add section parameters: core::multiinstance

https://gerrit.wikimedia.org/r/668031

Change 668444 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Add section parameters: misc

https://gerrit.wikimedia.org/r/668444

Change 668031 merged by Kormat:
[operations/puppet@production] mariadb: Add section parameters: core::multiinstance

https://gerrit.wikimedia.org/r/668031

Change 668444 merged by Kormat:
[operations/puppet@production] mariadb: Add section parameters: misc

https://gerrit.wikimedia.org/r/668444

Change 668464 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Use section parameters: misc profiles.

https://gerrit.wikimedia.org/r/668464

Change 669821 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Set misc nodes in codfw as 'master'

https://gerrit.wikimedia.org/r/669821

Change 669821 merged by Kormat:
[operations/puppet@production] mariadb: Set misc nodes in codfw as 'master'

https://gerrit.wikimedia.org/r/669821

Change 668464 merged by Kormat:
[operations/puppet@production] mariadb: Use section parameters: smaller misc profiles

https://gerrit.wikimedia.org/r/668464

Change 669845 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Use section params: remaining profiles.

https://gerrit.wikimedia.org/r/669845

Change 669845 merged by Kormat:
[operations/puppet@production] mariadb: Use section params: remaining profiles.

https://gerrit.wikimedia.org/r/669845

This should all be in place now.