Page MenuHomePhabricator

Add external store templates to SwitchMaster tool
Closed, ResolvedPublic

Description

The master switchovers have been tremendously simplified since Amir created https://switchmaster.toolforge.org/. Right now it supports sX and x1.
We need to include external store sections (only those which are writable - currently es6 and es7).
The template needs to be adapted, as it is slightly different than a sX

Depending on the active DC (as we do with sX) we need to provide two different templates, the one for the primary DC (where writes need to be disabled) and the one for the secondary DC, where writes don't need any changes.

This would be a template for the primary DC (where writes HAVE to be disabled):

1**When:** Anytime, writes will be disabled
2
3** Prerequisites **: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover
4
5**Checklist:**
6
7NEW primary: es1020
8OLD primary: es1021
9
10[] Check configuration differences between new and old primary:
11```
12sudo pt-config-diff --defaults-file /root/.my.cnf h=es1021.eqiad.wmnet h=es1020.eqiad.wmnet
13```
14
15[] Disable writes in es4 by merging: https://gerrit.wikimedia.org/r/922376
16[] Check es4 is indeed read-only
17
18**Failover prep:**
19[] Silence alerts on all hosts:
20```
21sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover es4 T337283" 'A:db-section-es4'
22```
23[] Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
24```
25sudo dbctl instance es1020 set-weight 0
26sudo dbctl config commit -m "Set es1020 with weight 0 T337283"
27```
28[] Topology changes, move all replicas under NEW primary
29```
30sudo db-switchover --timeout=25 --only-slave-move es1021 es1020
31```
32[x] Disable puppet on both nodes
33```
34sudo cumin 'es1021* or es1020*' 'disable-puppet "primary switchover T337283"'
35```
36[] Merge gerrit puppet change to promote NEW primary: https://gerrit.wikimedia.org/r/c/operations/puppet/+/922453
37
38**Failover:**
39[] Log the failover:
40```
41!log Starting es4 eqiad failover from es1021 to es1020 - T337283
42```
43[] Switch primaries:
44```
45sudo db-switchover --skip-slave-move es1021 es1020
46echo "===== es1021 (OLD)"; sudo db-mysql es1021 -e 'show slave status\G'
47echo "===== es1020 (NEW)"; sudo db-mysql es1020 -e 'show slave status\G'
48```
49
50[] Promote NEW primary in dbctl, and remove read-only
51```
52sudo dbctl --scope eqiad section es4 set-master es1020
53sudo dbctl config commit -m "Promote es1020 to es4 primary T337283"
54```
55
56[] Restart puppet on both hosts:
57```
58sudo cumin 'es1021* or es1020*' 'run-puppet-agent -e "primary switchover T337283"'
59```
60
61**Clean up tasks:**
62[] Clean up heartbeat table(s).
63```
64sudo db-mysql es1020 heartbeat -e "delete from heartbeat where file like 'es1021%';"
65```
66[] change events for query killer:
67```
68events_coredb_master.sql on the new primary es1020
69events_coredb_slave.sql on the new slave es1021
70```
71[] Update DNS: https://gerrit.wikimedia.org/r/922455
72[] Update candidate primary dbctl and orchestrator notes
73```
74sudo dbctl instance es1021 set-candidate-master --section es4 true
75sudo dbctl instance es1020 set-candidate-master --section es4 false
76(dborch1001): sudo orchestrator-client -c untag -i es1020 --tag name=candidate
77(dborch1001): sudo orchestrator-client -c tag -i es1021 --tag name=candidate
78```
79[] Check zarcillo was updated
80** db-switchover should do this. If it fails, do it manually: https://phabricator.wikimedia.org/P13956
81```
82sudo db-mysql db1215 zarcillo -e "select * from masters where section = 'es4';"
83```
84[] (If needed): Depool es1021 for maintenance.
85```
86sudo dbctl instance es1021 depool
87sudo dbctl config commit -m "Depool es1021 T337283"
88```
89[] Change es1021 weight to mimic the previous weight es1020:
90```
91sudo dbctl instance es1021 edit
92```
93
94[] Enable writes in es4 by merging the revert patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/922387
95
96[] Update/resolve this ticket.

Some considerations for the above template:

  • Puppet patches should be generated automatically like we do with sX
  • DNS patches should be generated automatically like we do with sX
  • MediaWiki config writes disablement patch isn't required to be generated in this iteration, instead just reference to an example patch like: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/922376

This would be a template for the secondary DC (where writes do not need to be touched):

1
2**When:** Anytime - no in use
3
4** Prerequisites **: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover
5
6
7**Checklist:**
8
9NEW primary: es2020
10OLD primary: es2021
11
12[] Check configuration differences between new and old primary:
13```
14sudo pt-config-diff --defaults-file /root/.my.cnf h=es2021.codfw.wmnet h=es2020.codfw.wmnet
15```
16
17**Failover prep:**
18[] Silence alerts on all hosts:
19```
20sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover es4 T337203" 'A:db-section-es4'
21```
22[] Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
23```
24sudo dbctl instance es2020 set-weight 0
25sudo dbctl config commit -m "Set es2020 with weight 0 T337203"
26```
27[] Topology changes, move all replicas under NEW primary
28```
29sudo db-switchover --read-only-master --replicating-master --timeout=25 --only-slave-move es2021 es2020
30```
31[] Disable puppet on both nodes
32```
33sudo cumin 'es2021* or es2020*' 'disable-puppet "primary switchover T337203"'
34```
35[] Merge gerrit puppet change to promote NEW primary: https://gerrit.wikimedia.org/r/c/operations/puppet/+/921772
36
37**Failover:**
38[] Log the failover:
39```
40!log Starting es4 codfw failover from es2021 to es2020 - T337203
41
42
43```
44[] Switch primaries:
45```
46sudo db-switchover --read-only-master --replicating-master --skip-slave-move es2021 es2020
47echo "===== es2021 (OLD)"; sudo db-mysql es2021 -e 'show slave status\G'
48echo "===== es2020 (NEW)"; sudo db-mysql es2020 -e 'show slave status\G'
49```
50
51[] Promote NEW primary in dbctl, and remove read-only
52```
53sudo dbctl --scope codfw section es4 set-master es2020
54sudo dbctl --scope codfw section es4 rw
55sudo dbctl config commit -m "Promote es2020 to es4 primary and set section read-write T337203"
56```
57
58[] Restart puppet on both hosts:
59```
60sudo cumin 'es2021* or es2020*' 'run-puppet-agent -e "primary switchover T337203"'
61```
62
63**Clean up tasks:**
64[] Clean up heartbeat table(s).
65```
66sudo db-mysql es2020 heartbeat -e "delete from heartbeat where file like 'es2021%';"
67```
68[] change events for query killer:
69```
70events_coredb_master.sql on the new primary es2020
71events_coredb_slave.sql on the new slave es2021
72```
73[] Update candidate primary dbctl and orchestrator notes
74```
75sudo dbctl instance es2021 set-candidate-master --section es4 true
76sudo dbctl instance es2020 set-candidate-master --section es4 false
77(dborch1001): sudo orchestrator-client -c untag -i es2020 --tag name=candidate
78(dborch1001): sudo orchestrator-client -c tag -i es2021 --tag name=candidate
79```
80[] Check zarcillo was updated
81** db-switchover should do this. If it fails, do it manually: https://phabricator.wikimedia.org/P13956
82```
83sudo db-mysql db1215 zarcillo -e "select * from masters where section = 'es4';"
84```
85[] (If needed): Depool es2021 for maintenance.
86```
87sudo dbctl instance es2021 depool
88sudo dbctl config commit -m "Depool es2021 T337203"
89```
90[] Change es2021 weight to mimic the previous weight es2020:
91```
92sudo dbctl instance es2021 edit
93```
94[] Update/resolve this ticket.

Some considerations for the above template:

  • Puppet patches should be generated automatically like we do with sX
  • DNS aren't needed for secondary DC switches.
  • MediaWiki references aren't needed for secondary DC switches.

This task is only for writable sections (as of today es6 and es7) as RO sections (es1, es2, es3, es4, es5) only need a dbctl command (which is really a NOOP so no need to cover for them)

For now the Switchover Master menu should only allow the user to use es6 and es7, if any other esX is used, an error should be displayed.

Event Timeline

Marostegui triaged this task as Medium priority.May 16 2024, 8:27 AM
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)

So far looks okay but we need to set candidate masters in puppet as comments. None has it.

Change #1038925 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] es6,es7: Add candidate masters

https://gerrit.wikimedia.org/r/1038925

Change #1038925 merged by Marostegui:

[operations/puppet@production] es6,es7: Add candidate masters

https://gerrit.wikimedia.org/r/1038925

So far looks okay but we need to set candidate masters in puppet as comments. None has it.

Fixed :)

Change #1039185 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Add CNAMEs for es6 and es7

https://gerrit.wikimedia.org/r/1039185

Change #1039185 merged by Marostegui:

[operations/dns@master] wmnet: Add CNAMEs for es6 and es7

https://gerrit.wikimedia.org/r/1039185

Deployed. I made another test one to check: T366682: Switchover es7 master (es1035 -> es1039)

If that looks okay to you, we can close this ticket @Marostegui

Thank you Amir - this is great to have! Thanks for working on it!