Page MenuHomePhabricator

Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1)
Closed, ResolvedPublic

Description

After T186320 is closed, we should have a master and a proper master candidate for failover with STATEMENT based replication, not replicating ROW to sanitarium, and physically separated from the original master. T186320 is only a soft blocker, some work could be done already.

  • s1
    • master: db1052
    • candidate: db1067
  • s2
    • master: db1054
    • candidate: db1076
  • s3
    • master: db1075
    • candidate: db1078
  • s4
    • master: db1068
    • candidate: db1081
  • s5
    • master: db1070
    • candidate: db1100
  • s6
    • master: db1061
    • candidate: db1093
  • s7
    • master: db1062
    • candidate: db1069 (only pending its movement to another row - T186699)
  • s8
    • master: db1071
    • candidate: db1104
  • x1
    • master: db1055
    • candidate: db1056 (we always use ROW for non-s* hosts, although it would be nice to move it to another row)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 407633 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1100: Swtiching it to STATEMENT

https://gerrit.wikimedia.org/r/407633

Change 407635 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Clarifying that db1100 status

https://gerrit.wikimedia.org/r/407635

Change 407635 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Clarifying that db1100 status

https://gerrit.wikimedia.org/r/407635

Change 407633 merged by Marostegui:
[operations/puppet@production] db1100: Switch it to STATEMENT

https://gerrit.wikimedia.org/r/407633

Marostegui triaged this task as Medium priority.Feb 2 2018, 3:45 PM
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Pending comment to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2018-02-02T15:49:44Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1100 - T186321 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2018-02-02T15:50:03Z] <marostegui> Restart MySQL on db1100 - T186321

Mentioned in SAL (#wikimedia-operations) [2018-02-02T16:02:00Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Slowly repool db1100 - T186321 (duration: 00m 54s)

For s8:

'db1054' => 0,   # A3 2.8TB  96GB, master
'db1053' => 0,   # A2 2.8TB  96GB, vslow, dump
'db1060' => 1,   # C2 2.8TB  96GB, api #master for db1102 (sanitarium 3)
'db1074' => 300, # A2 3.6TB 512GB, api
'db1076' => 500, # B1 3.6TB 512GB
'db1090' => 500, # C3 3.6TB 512GB
'db1103:3312' => 1,  # A3 3.6TB 512GB # rc, log: s2 and s4
'db1105:3312' => 1,   # C3 3.6TB 512GB # rc, log: s1 and s2

My suggestion would be db1074 (that would need to be moved to another row)
As db1053 and db1060 will go away in Q4.
Thoughts?

For s8:

'db1054' => 0,   # A3 2.8TB  96GB, master
'db1053' => 0,   # A2 2.8TB  96GB, vslow, dump
'db1060' => 1,   # C2 2.8TB  96GB, api #master for db1102 (sanitarium 3)
'db1074' => 300, # A2 3.6TB 512GB, api
'db1076' => 500, # B1 3.6TB 512GB
'db1090' => 500, # C3 3.6TB 512GB
'db1103:3312' => 1,  # A3 3.6TB 512GB # rc, log: s2 and s4
'db1105:3312' => 1,   # C3 3.6TB 512GB # rc, log: s1 and s2

My suggestion would be db1074 (that would need to be moved to another row)
As db1053 and db1060 will go away in Q4.
Thoughts?

Actually it doesn't matter if it is db1074 or db1076, so maybe db1076 so we don't have to move db1074 to another row.

This works ok for now, but we may need a proper longer strategy fpr the others - db1061-db1073 will be the only non-500GB hosts, which we should use for masters; but many could be used for misc(??). Having large servers as masters could be a waste, but also a win in performance. Then there is the possibility of introducing more multi-instance in the future for low-usage ones. I do not have yet a clear plan or philosophy for longer than a few months.

This works ok for now, but we may need a proper longer strategy fpr the others - db1061-db1073 will be the only non-500GB hosts, which we should use for masters; but many could be used for misc(??). Having large servers as masters could be a waste, but also a win in performance. Then there is the possibility of introducing more multi-instance in the future for low-usage ones. I do not have yet a clear plan or philosophy for longer than a few months.

I would start moving those towards misc and start using new servers for masters even if they are powerful. It can be a waste, but the non 500GB hosts are now almost 4-5 years old, so at some point they will start hitting BBU issues (at the very least) as we have been seeing with older hosts than those.
So inevitably, we will end up with powerful hosts as masters (under warranty as well).

Change 408238 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078

https://gerrit.wikimedia.org/r/408238

Change 408239 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1078: Change it to statement, update socket

https://gerrit.wikimedia.org/r/408239

Change 408238 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1078

https://gerrit.wikimedia.org/r/408238

Mentioned in SAL (#wikimedia-operations) [2018-02-05T08:30:34Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1078 - T186321 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2018-02-05T08:44:51Z] <marostegui> Stop MySQL on db1078, upgrade mariadb, kernel and socket location - T186321

Change 408239 merged by Marostegui:
[operations/puppet@production] db1078: Change it to statement, update socket

https://gerrit.wikimedia.org/r/408239

Mentioned in SAL (#wikimedia-operations) [2018-02-05T09:28:37Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1078 with low traffic - T186321 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2018-02-05T09:38:19Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Increase db1078 traffic - T186321 (duration: 00m 55s)

For s6 I would suggest db1094. It is a powerful host, but the only non-powerful host available is db1063 which already had a hard failure (T180714) so I don't feel comfortable with it being a master again.
It is in a different row already
Thoughts?

For s7 it should probably be db1069 (only non powerful host doesn't need to be decommissioned on the next batch, currently on vslow and has old master data).
It needs to be moved to a different row though.

Change 408257 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: db1078 is the candidate master

https://gerrit.wikimedia.org/r/408257

Change 408257 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: db1078 is the candidate master

https://gerrit.wikimedia.org/r/408257

Change 408774 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1069

https://gerrit.wikimedia.org/r/408774

Change 408775 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1069: Switch its binlog to STATEMENT

https://gerrit.wikimedia.org/r/408775

Change 408774 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1069

https://gerrit.wikimedia.org/r/408774

Mentioned in SAL (#wikimedia-operations) [2018-02-07T10:58:38Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1069 - T186321 (duration: 01m 09s)

Change 408775 merged by Marostegui:
[operations/puppet@production] db1069: Switch its binlog to STATEMENT

https://gerrit.wikimedia.org/r/408775

Mentioned in SAL (#wikimedia-operations) [2018-02-07T11:04:52Z] <marostegui> Stop MySQL on db1069 for MySQL upgrade, kernel upgrade and change binlog format to statement - T186321

Mentioned in SAL (#wikimedia-operations) [2018-02-07T11:54:23Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1069 - T186321 (duration: 01m 11s)

For s1 probably the right candidate is db1067: not a 512G host, old master, different row.

Change 413153 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: db1067 is now candidate master in s1

https://gerrit.wikimedia.org/r/413153

Change 413153 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: db1067 is now candidate master in s1

https://gerrit.wikimedia.org/r/413153

Mentioned in SAL (#wikimedia-operations) [2018-02-21T12:35:34Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Clarify that db1067 is now s1 candidate master - T186321 (duration: 01m 13s)

For s8 I propose db1104 (there are only large servers there). So I propose that one because it is in a different row.

Change 413322 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db11104

https://gerrit.wikimedia.org/r/413322

Change 413323 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1104: Switch binlog to STATEMENT

https://gerrit.wikimedia.org/r/413323

Change 413322 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db11104

https://gerrit.wikimedia.org/r/413322

Mentioned in SAL (#wikimedia-operations) [2018-02-22T09:20:03Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1104 - T186321 (duration: 01m 13s)

Mentioned in SAL (#wikimedia-operations) [2018-02-22T09:20:17Z] <marostegui> Stop MySQL on db1104 to switch its binlog to statement - T186321

Change 413323 merged by Marostegui:
[operations/puppet@production] db1104: Switch binlog to STATEMENT

https://gerrit.wikimedia.org/r/413323

Mentioned in SAL (#wikimedia-operations) [2018-02-22T09:31:26Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Slowly repool db1104 - T186321 (duration: 01m 12s)

For s2 I suggest db1076.
The only non big server is db1060 (which will go away soon (T186320)) but is is sanitarium master (running ROW).
The rest of servers are large ones, and db1076 is in a different row than the current master.

Mentioned in SAL (#wikimedia-operations) [2018-02-22T09:47:42Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Increase traffic for db1104 - T186321 (duration: 01m 12s)

Mentioned in SAL (#wikimedia-operations) [2018-02-22T10:22:30Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Increase traffic for db1104 - T186321 (duration: 01m 14s)

Change 413677 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1076: Change the binlog to STATEMENT

https://gerrit.wikimedia.org/r/413677

Change 413677 merged by Marostegui:
[operations/puppet@production] db1076: Change the binlog to STATEMENT

https://gerrit.wikimedia.org/r/413677

Change 413703 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1076

https://gerrit.wikimedia.org/r/413703

Change 413703 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1076

https://gerrit.wikimedia.org/r/413703

Mentioned in SAL (#wikimedia-operations) [2018-02-23T11:28:59Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1076 for binlog format change - T186321 (duration: 01m 08s)

Mentioned in SAL (#wikimedia-operations) [2018-02-23T11:29:17Z] <marostegui> Restart mariadb on db1076 for binlog format change - T186321

jcrespo renamed this task from Prepare and indicate proper master db failover candidates for all database sections (s1-s8, x1) to Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1).Feb 23 2018, 11:33 AM

This is for eqiad only, we can think in the future, with much much less priority, if we want to do that with codfw- it would be much easier because no labs replication.

This is for eqiad only, we can think in the future, with much much less priority, if we want to do that with codfw- it would be much easier because no labs replication.

Yeah - agreed. codfw can wait (but should definitely be done at some point)

Mentioned in SAL (#wikimedia-operations) [2018-02-23T11:38:57Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Slowly repool db1076 - T186321 (duration: 01m 13s)

Mentioned in SAL (#wikimedia-operations) [2018-02-23T12:09:14Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Increase traffic for db1076 - T186321 (duration: 01m 12s)

For s4 my suggestion is db1081.
Reasons: the only non large server is db1064 which is sanitarium master so that is not an option.
The rest are large servers, and db1081 is a different row.

Change 415018 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1081.yaml: Change binlog to statement based

https://gerrit.wikimedia.org/r/415018

Change 415018 merged by Marostegui:
[operations/puppet@production] db1081.yaml: Change binlog to statement based

https://gerrit.wikimedia.org/r/415018

Change 415020 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081

https://gerrit.wikimedia.org/r/415020

Change 415020 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1081

https://gerrit.wikimedia.org/r/415020

Mentioned in SAL (#wikimedia-operations) [2018-02-27T15:37:12Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1081 - T186321 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2018-02-27T15:37:37Z] <marostegui> Stop MySQL and reboot db1081 for kernel ugprade, mariadb upgrade and binlog format change - T186321

Change 415023 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1081

https://gerrit.wikimedia.org/r/415023

Change 415023 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Slowly repool db1081

https://gerrit.wikimedia.org/r/415023

Mentioned in SAL (#wikimedia-operations) [2018-02-27T15:47:54Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Slowly repool db1081 - T186321 (duration: 00m 56s)

For s6 probably db1063 is the only host which is not a large server.
However, I wouldn't like to place db1063 as a candidate master due to its past HW issues: storage crash (T180714#3767308) and thermal issues (T164107)

So, if we discard db1063, all the pending servers are large, so I would choose db1093 (different row than the current master)

Change 415585 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1093

https://gerrit.wikimedia.org/r/415585

Change 415586 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1093: Change binlog format

https://gerrit.wikimedia.org/r/415586

Change 415586 merged by Marostegui:
[operations/puppet@production] db1093: Change binlog format

https://gerrit.wikimedia.org/r/415586

Change 415585 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1093

https://gerrit.wikimedia.org/r/415585

Mentioned in SAL (#wikimedia-operations) [2018-03-01T16:35:57Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1093 - T186321 (duration: 01m 13s)

Mentioned in SAL (#wikimedia-operations) [2018-03-01T16:36:16Z] <marostegui> Restart mariadb on db1093 for binlog format change - T186321

Marostegui claimed this task.