Page MenuHomePhabricator

Pre DC switchover codfw -> eqiad DB work
Closed, ResolvedPublic

Description

MW switch back date: Tue 14th Sept 2021

  • Replication eqiad -> codfw is running
  • Check and disable GTID on eqiad masters for the above sections.
  • Check that all eqiad slaves have GTID enabled
  • Check which notifications are disabled for eqiad hosts
  • Check event scheduler is enabled on eqiad hosts
  • Check that query killers are installed and enabled on eqiad hosts
  • Update section_params in hieradata/common/profile/mariadb.yaml: https://gerrit.wikimedia.org/r/c/operations/puppet/+/719168
  • Review DB MW weights
    • s1
    • s2
    • s3
    • s4
    • s5
    • s6
    • s7
    • s8
    • x1
    • x2
    • es4
    • es5
    • pc1
    • pc2
    • pc3

Event Timeline

Marostegui moved this task from Triage to Blocked on the DBA board.

x1, x2, es4 and es5 never got bi-replication disconnected. I have double checked it again, so marking that as done.

Reminder for s6 replication filters for codfw master: Replicate_Wild_Ignore_Table: labswiki.%

So that would be:
change master to master host etc etc etc;
SET GLOBAL replicate_wild_ignore_table='labswiki.%';
show slave status\G -> check
start slave;

Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2021-09-07T12:27:09Z] <marostegui@cumin1001> dbctl commit (dc=all): 'fix s1 weights T288594', diff saved to https://phabricator.wikimedia.org/P17246 and previous config saved to /var/cache/conftool/dbconfig/20210907-122708-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-09-07T12:27:48Z] <marostegui@cumin1001> dbctl commit (dc=all): 'fix s1 weights T288594', diff saved to https://phabricator.wikimedia.org/P17247 and previous config saved to /var/cache/conftool/dbconfig/20210907-122747-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-09-07T13:02:44Z] <marostegui@cumin1001> dbctl commit (dc=all): 'fix s8 weights T288594', diff saved to https://phabricator.wikimedia.org/P17248 and previous config saved to /var/cache/conftool/dbconfig/20210907-130244-marostegui.json

test-s4 db1125 didn't have GTID enabled, even if it is a test host, I have enabled it.

Change 719270 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] check_flags_per_dc.sh: One liner to check a few things

https://gerrit.wikimedia.org/r/719270

Mentioned in SAL (#wikimedia-operations) [2021-09-07T14:17:33Z] <marostegui> No more db maintenance on eqiad T288594

Change 719270 merged by Marostegui:

[operations/software@master] check_flags_per_dc.sh: One liner to check a few things

https://gerrit.wikimedia.org/r/719270

Change 719397 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2094,db2095: Enable notifications

https://gerrit.wikimedia.org/r/719397

Change 719397 merged by Marostegui:

[operations/puppet@production] db2094,db2095: Enable notifications

https://gerrit.wikimedia.org/r/719397

replication has been re-enabled eqiad -> codfw everywhere (GTID not yet disabled in eqiad masters). Please keep in mind that no maintenance, especially schema changes should be deployed anymore until we have done the DC switch. Anything done in eqiad will now get replicated to codfw (ie: alter, optimizes...)

s6 has the special configuration in order to replicate labswiki from m5, details as follow:

Replication from eqiad to codfw has a special replication filter to exclude labswiki (wikitech) as that database isn't present on codfw yet (see the wikitech move doc for more detail):

root@cumin1001:~# mysql.py -hdb2129 -e "show slave status\G"
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: db1173.eqiad.wmnet
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: db1173-bin.000537
           Read_Master_Log_Pos: 984172754
                Relay_Log_File: db2129-relay-bin.000002
                 Relay_Log_Pos: 237998
         Relay_Master_Log_File: db1173-bin.000537
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table: labswiki.% <<<<<<<<<<<<<<<<<<< replication filter
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 984172754
               Relay_Log_Space: 238308
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: Yes
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 171978805
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: No
                   Gtid_IO_Pos:
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 17
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 218116336

db1173 (s6 eqiad master) has multi-source configured, one thread to replicate from m5 (only wikitech) and another one for s6:

root@cumin1001:~# mysql.py -hdb1173 -e "show slave status\G"
root@cumin1001:~#


root@cumin1001:~# mysql.py -hdb1173 -e "show slave 'm5' status\G"
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: db1128.eqiad.wmnet
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: db1128-bin.000249
           Read_Master_Log_Pos: 807795637
                Relay_Log_File: db1173-relay-bin-m5.000010
                 Relay_Log_Pos: 807795937
         Relay_Master_Log_File: db1128-bin.000249
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table: labswiki.% <<<<<<<<<<<<<<<<<<< replication filter
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 807795637
               Relay_Log_Space: 807796298
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: Yes
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 171966562
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: No
                   Gtid_IO_Pos:
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 36
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 4989012


root@cumin1001:~# mysql.py -hdb1173 -e "show slave 's6' status\G"
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: db2129.codfw.wmnet
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: db2129-bin.001905
           Read_Master_Log_Pos: 597378818
                Relay_Log_File: db1173-relay-bin-s6.000298
                 Relay_Log_Pos: 597096264
         Relay_Master_Log_File: db2129-bin.001905
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 597378818
               Relay_Log_Space: 597096625
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: Yes
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 180367475
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: No
                   Gtid_IO_Pos:
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 6
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 198878355

@Kormat if you think we are fully ready to merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/719168 please do so if you have time today.

@Kormat if you think we are fully ready to merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/719168 please do so if you have time today.

Deployed, all looks good.

@Kormat if you think we are fully ready to merge: https://gerrit.wikimedia.org/r/c/operations/puppet/+/719168 please do so if you have time today.

Deployed, all looks good.

Thank you!

All the pre work is now done. Before closing this task I am going to run a few compares per section to make sure everything is ok, data-wise.

Marostegui claimed this task.

Initial runs look good. As part of the warmups I will do more compares so closing this.