Page MenuHomePhabricator

Pre DC switchover eqiad -> codfw DB work
Closed, ResolvedPublic

Description

Date of the switchover: 28th June 2021 - before we switch over we have to make sure that:

  • Replication codfw -> eqiad is running
    • s1
    • s2
    • s3
    • s4
    • s5
    • s6
    • s7
    • s8
    • x1
    • x2
    • es4
    • es5
    • pc1
    • pc2
    • pc3
  • Check and disable GTID on codfw masters for the above sections. T284897#7174224
  • Check that all codfw slaves have GTID enabled T284897#7172371
  • Check which notifications are disabled for codfw hosts
  • Check event scheduler is enabled on codfw hosts
  • Check that query killers are installed and enabled on codfw hosts
  • Update section_params in hieradata/common/profile/mariadb.yaml https://gerrit.wikimedia.org/r/c/operations/puppet/+/701335
  • Review DB MW weights
    • s1
    • s2
    • s3
    • s4
    • s5
    • s6
    • s7
    • s8
    • x1
    • x2
    • es4
    • es5
    • pc1
    • pc2
    • pc3

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui renamed this task from Pre DC switchover DB work to Pre DC switchover eqiad -> eqiad DB work.Jun 14 2021, 8:10 AM
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Blocked on the DBA board.

GTID checked across codfw and eqiad.
Only missing db2095:3312 (sanitarium host). I have enabled it there.

Marostegui renamed this task from Pre DC switchover eqiad -> eqiad DB work to Pre DC switchover eqiad -> codfw DB work.Jun 23 2021, 1:29 PM
Marostegui updated the task description. (Show Details)

GTID disabled on es% and s%

db2112
                   Using_Gtid: No
db2107
                   Using_Gtid: No
db2105
                    Using_Gtid: No
db2090
                   Using_Gtid: No
db2123
                    Using_Gtid: No
db2129
                    Using_Gtid: No
db2118
                   Using_Gtid: No
db2079
                   Using_Gtid: No
es2021
                    Using_Gtid: No
es2023
                    Using_Gtid: No

codfw -> eqiad replication has been enabled everywhere:

# for i in `mysql.py -BN  -hdb1115 -A zarcillo -e "select instance from masters where section like 'es%' OR section like  's%'"`; do echo $i; mysql.py -h$i -e "show slave status\G" | egrep "Master_Host|Seconds|Using" ; done
es2021
                   Master_Host: es1021.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
es1021
                   Master_Host: es2021.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
es2023
                   Master_Host: es1024.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
es1024
                   Master_Host: es2023.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
db2112
                  Master_Host: db1163.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1163
                  Master_Host: db2112.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2107
                  Master_Host: db1122.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1122
                  Master_Host: db2107.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2105
                   Master_Host: db1157.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
db1157
                   Master_Host: db2105.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
db2090
                  Master_Host: db1138.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1138
                  Master_Host: db2090.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2123
                   Master_Host: db1130.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
db1130
                   Master_Host: db2123.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
db2129
                   Master_Host: db1173.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
db1173
                   Master_Host: db2129.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
db2118
                  Master_Host: db1136.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1136
                  Master_Host: db2118.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2079
                  Master_Host: db1104.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1104
                  Master_Host: db2079.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No

While doing this I realised that zarcillo database had the wrong entry for s4 master:

root@db1115.eqiad.wmnet[zarcillo]> select * from masters where section like 's4';
+---------+-------+----------+
| section | dc    | instance |
+---------+-------+----------+
| s4      | codfw | db2090   |
| s4      | eqiad | db1173   |
+---------+-------+----------+
2 rows in set (0.001 sec)

s4 master is db1138, db1173 is s6's, not s4's. I have fixed this.

root@db1115.eqiad.wmnet[zarcillo]> update masters set instance='db1138' where section='s4' and dc='eqiad' limit 1;
Query OK, 1 row affected (0.001 sec)
Rows matched: 1  Changed: 1  Warnings: 0

root@db1115.eqiad.wmnet[zarcillo]> select * from masters where section like 's4';
+---------+-------+----------+
| section | dc    | instance |
+---------+-------+----------+
| s4      | codfw | db2090   |
| s4      | eqiad | db1138   |
+---------+-------+----------+
2 rows in set (0.001 sec)

Mentioned in SAL (#wikimedia-operations) [2021-06-24T07:26:57Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s8 weights T284897', diff saved to https://phabricator.wikimedia.org/P16710 and previous config saved to /var/cache/conftool/dbconfig/20210624-072657-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-06-24T07:42:00Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s7 weights T284897', diff saved to https://phabricator.wikimedia.org/P16711 and previous config saved to /var/cache/conftool/dbconfig/20210624-074200-marostegui.json

Change 701335 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] mariadb: Set most sections to bidi replication.

https://gerrit.wikimedia.org/r/701335

Mentioned in SAL (#wikimedia-operations) [2021-06-24T07:56:14Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s6 weights T284897', diff saved to https://phabricator.wikimedia.org/P16712 and previous config saved to /var/cache/conftool/dbconfig/20210624-075613-marostegui.json

Change 701335 merged by Kormat:

[operations/puppet@production] mariadb: Set most sections to bidi replication.

https://gerrit.wikimedia.org/r/701335

Mentioned in SAL (#wikimedia-operations) [2021-06-24T08:08:19Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 0:45:00 on 216 hosts with reason: Change replication monitoring config T284897

Mentioned in SAL (#wikimedia-operations) [2021-06-24T08:09:37Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:45:00 on 216 hosts with reason: Change replication monitoring config T284897

Mentioned in SAL (#wikimedia-operations) [2021-06-24T08:09:46Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove db1130 from s5 api T284897', diff saved to https://phabricator.wikimedia.org/P16713 and previous config saved to /var/cache/conftool/dbconfig/20210624-080945-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-06-24T08:11:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s5 weights T284897', diff saved to https://phabricator.wikimedia.org/P16714 and previous config saved to /var/cache/conftool/dbconfig/20210624-081137-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-06-24T08:12:52Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s5 weights T284897', diff saved to https://phabricator.wikimedia.org/P16715 and previous config saved to /var/cache/conftool/dbconfig/20210624-081251-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-06-24T08:14:10Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s5 weights T284897', diff saved to https://phabricator.wikimedia.org/P16716 and previous config saved to /var/cache/conftool/dbconfig/20210624-081409-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-06-24T08:41:47Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s3 weights T284897', diff saved to https://phabricator.wikimedia.org/P16717 and previous config saved to /var/cache/conftool/dbconfig/20210624-084147-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-06-24T09:17:54Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s2 weights T284897', diff saved to https://phabricator.wikimedia.org/P16718 and previous config saved to /var/cache/conftool/dbconfig/20210624-091753-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-06-24T09:19:49Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s1 weights T284897', diff saved to https://phabricator.wikimedia.org/P16719 and previous config saved to /var/cache/conftool/dbconfig/20210624-091949-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-06-24T09:20:29Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s1 weights T284897', diff saved to https://phabricator.wikimedia.org/P16720 and previous config saved to /var/cache/conftool/dbconfig/20210624-092029-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-06-24T09:21:05Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s1 weights T284897', diff saved to https://phabricator.wikimedia.org/P16721 and previous config saved to /var/cache/conftool/dbconfig/20210624-092105-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-06-24T09:21:57Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s1 weights T284897', diff saved to https://phabricator.wikimedia.org/P16722 and previous config saved to /var/cache/conftool/dbconfig/20210624-092157-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-06-24T09:22:27Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s1 weights T284897', diff saved to https://phabricator.wikimedia.org/P16723 and previous config saved to /var/cache/conftool/dbconfig/20210624-092226-marostegui.json

Marostegui claimed this task.
Marostegui updated the task description. (Show Details)

This is all done.
Weights might need to be adjusted ad-hoc once we start getting live traffic.

Change 701891 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Temporarily change backup schedules to fit better dc switch

https://gerrit.wikimedia.org/r/701891

Change 701891 merged by Jcrespo:

[operations/puppet@production] dbbackups: Temporarily change backup schedules to fit better dc switch

https://gerrit.wikimedia.org/r/701891

Change 701948 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Temporarily disable s4 snapshots to prevent conflict with dumps

https://gerrit.wikimedia.org/r/701948

Change 701948 merged by Jcrespo:

[operations/puppet@production] dbbackups: Temporarily disable s4 snapshots to prevent conflict with dumps

https://gerrit.wikimedia.org/r/701948