Page MenuHomePhabricator

Make sure multi-instance slaves page
Closed, ResolvedPublic

Description

We should enable paging for multi-instance slaves as right now they only alert on IRC.
We had one case of a multi-instance slave being lagged and given how the LB does (not) work (T180918) it caused an outage on wikidata (T198049)

Pages should come if:

  • Number of processes is smaller than the one defined in hiera
  • Replication is broken
  • Replication is lagging

But only under the following conditions:

  • We are in active-passive setup:
    • Core single and multi-instance should page for replication status and lag only on the primary datacenter
    • It should just warn on irc for the passive datacenters
  • We are in active-active setup:
    • it should page for replication status and lag on all active datacenters.

Event Timeline

Marostegui triaged this task as Normal priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 27 2018, 10:59 AM

Change 449698 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] multiinstance.pp: Page based on the number of processess

https://gerrit.wikimedia.org/r/449698

Marostegui updated the task description. (Show Details)Aug 1 2018, 11:32 AM

Change 449698 merged by Marostegui:
[operations/puppet@production] multiinstance.pp: Page based on the number of processess

https://gerrit.wikimedia.org/r/449698

Marostegui updated the task description. (Show Details)Aug 1 2018, 12:51 PM

Change 449711 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] instance.pp: Page for replication on multiinstance hosts

https://gerrit.wikimedia.org/r/449711

So https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449711/ would also get replication status and lag to page, but I guess it suffer from the same problem and codfw will alert as well?

Yes, a decision should be taken on what pages and what doesn't, and it shouldn't differ from other roles- so that multiinstance hosts do not differ from those that are single instance. However, because mw_primary is supposed not to be used, it would also require to migrate existing checks to etcd; plus also take a decision given the possibility of active-active dc in the future.

In fact, to avoid issues with this, both roles' functionality should be shared, and not implemented twice, or many times, on per role.

So my opinion for now (active-passive)

  • Single and multi-instance should page for replication status and lag only eqiad and irc only for codfw.

Obviously once we have active-active, the above line doesn't apply and everything (eqiad and codfw) should page for replication status and lag.

jcrespo added a comment.EditedAug 1 2018, 1:32 PM

I am ok with that, go ahead and implement it- I think it was the only reason why I didn't do it (not trivial). Note hardcoding eqiad was a big no from mark when I first set it up.

Marostegui added a comment.EditedAug 1 2018, 1:34 PM

I am ok with that, go ahead and implement it- I think it was the only reason why I didn't do it (not trivial). Note hardcoding eqiad was a big no from mark when I first set it up.

Yeah, it is not trivial at all as mw_primary will be gone (T199124)

jcrespo updated the task description. (Show Details)Aug 1 2018, 1:38 PM

I have updated the description with your proposal, please correct in case I have either a mistake or I misunderstood you.

Marostegui added a comment.EditedAug 1 2018, 1:46 PM

Another discussion to have is whether we not page until we solve and refactor and keep suffering T180918 if a multi-instance core slave lags, or we want to page for everything including codfw until we have solved this complex problem (which might be at the time we are active-active)

Marostegui moved this task from Triage to In progress on the DBA board.Aug 2 2018, 3:12 PM

Change 459675 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2075 and db2084:3315

https://gerrit.wikimedia.org/r/459675

Change 459675 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2075 and db2084:3315

https://gerrit.wikimedia.org/r/459675

Mentioned in SAL (#wikimedia-operations) [2018-09-11T07:27:23Z] <marostegui> Disable puppet on all the DBs for alert testing - https://phabricator.wikimedia.org/T200509

Change 449711 merged by Marostegui:
[operations/puppet@production] mariadb: Set pages for multi-instance hosts

https://gerrit.wikimedia.org/r/449711

Mentioned in SAL (#wikimedia-operations) [2018-09-11T07:33:19Z] <marostegui> Stop replication on db2084:3315 for alert testing T200509

Mentioned in SAL (#wikimedia-operations) [2018-09-11T07:49:16Z] <marostegui> Stop replication on db2075 for alert testing T200509

Change 459687 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1096:3315, db1100

https://gerrit.wikimedia.org/r/459687

Change 459687 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1096:3315, db1100

https://gerrit.wikimedia.org/r/459687

Mentioned in SAL (#wikimedia-operations) [2018-09-11T08:14:20Z] <marostegui> Stop replication on db1096:3315 for new alert testing (this should generate a page) T200509

Mentioned in SAL (#wikimedia-operations) [2018-09-11T08:27:56Z] <marostegui> Stop replication on db1100 for new alert testing (this should generate a page) T200509

Change 459734 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] multiinstance.pp: Make multi-instance slaves page

https://gerrit.wikimedia.org/r/459734

Change 459734 merged by Marostegui:
[operations/puppet@production] multiinstance.pp: Make multi-instance slaves page

https://gerrit.wikimedia.org/r/459734

Mentioned in SAL (#wikimedia-operations) [2018-09-11T10:27:40Z] <marostegui> db1096:3315 and db1100 were test pages - NO MORE TEST PAGES ARE EXPECTED FROM NOW ON - T200509

Paging for replication lag / broken has been tested nicely for active/non active replicas

codfw: passive DC
db2084:3315 only alerted on IRC
db2075 only alerted on IRC

eqiad: active DC
db1096:3315 alerted on IRC and paged
db1100 alerted on IRC and paged

Pending on this task.
Fix the fact that only active DC hosts should page if the number of mysqld processes is less than the one defined on $host.yaml (right now either active and non active DC hosts page)

Marostegui updated the task description. (Show Details)Sep 11 2018, 10:33 AM

Change 459764 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] [WIP]: Get only active replicas to page for mysqld process number

https://gerrit.wikimedia.org/r/459764

Mentioned in SAL (#wikimedia-operations) [2018-09-19T07:12:22Z] <marostegui> Disable puppet on databases to test new alerts - T200509 https://phabricator.wikimedia.org/T172489

Change 461288 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Depool some hosts

https://gerrit.wikimedia.org/r/461288

Change 461288 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Depool some hosts

https://gerrit.wikimedia.org/r/461288

Change 459764 merged by Marostegui:
[operations/puppet@production] mariadb: Get only active replicas to page for mysqld process number

https://gerrit.wikimedia.org/r/459764

Mentioned in SAL (#wikimedia-operations) [2018-09-19T07:45:51Z] <marostegui> Stop MySQL on db1096:3316 for alert testing - T200509

Mentioned in SAL (#wikimedia-operations) [2018-09-19T07:53:29Z] <marostegui> Stop MySQL on db1110 for alert testing - T200509

Change 461351 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] core.pp: Fix monitor_process and monitor_disk pages

https://gerrit.wikimedia.org/r/461351

Change 461351 merged by Marostegui:
[operations/puppet@production] core.pp: Fix monitor_process and monitor_disk pages

https://gerrit.wikimedia.org/r/461351

Mentioned in SAL (#wikimedia-operations) [2018-09-19T08:35:20Z] <marostegui> Stop MySQL on db1110 for alert testing - T200509

Marostegui updated the task description. (Show Details)Sep 19 2018, 9:22 AM
Marostegui closed this task as Resolved.
Marostegui claimed this task.

This is done