Page MenuHomePhabricator

Disconnect codfw -> eqiad replication
Closed, ResolvedPublic

Description

After the successful DC switch back from codfw to eqiad (T243318) we need to disconnect the replication thread from codfw to eqiad on the following sections

  • s1 db1083
  • s2 db1122
  • s3 db1123
  • s4 db1081
  • s5 db1100
  • s6 db1131
  • s7 db1086
  • s8 db1104
  • x1 db1103
  • pc1 pc1007
  • pc2 pc1008
  • pc3 pc1009
  • es4 es1021
  • es5 es1024
  • Enable GTID on all the sections' codfw masters
    • s1 db2112
    • s2 db2107
    • s3 db2105
    • s4 db2090
    • s5 db2123
    • s6 db2129
    • s7 db2118
    • s8 db2079
    • x1 db2096
    • es4 es2021
    • es5 es2023
    • pc2007
    • pc2008
    • pc2009

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Oct 28, 1:00 PM
Marostegui triaged this task as Medium priority.Wed, Oct 28, 1:01 PM
Marostegui moved this task from Triage to Blocked on the DBA board.

Not before Thursday 29th Oct 2020

Mentioned in SAL (#wikimedia-operations) [2020-10-29T05:58:27Z] <marostegui> Disconnect replication codfw -> eqiad on pc1, pc2 and pc3 T266663

Marostegui updated the task description. (Show Details)Thu, Oct 29, 6:00 AM

Mentioned in SAL (#wikimedia-operations) [2020-10-29T06:07:24Z] <marostegui> Disconnect replication codfw -> eqiad on x1 T266663

Marostegui updated the task description. (Show Details)Thu, Oct 29, 6:07 AM

Mentioned in SAL (#wikimedia-operations) [2020-10-29T06:10:53Z] <marostegui> Disconnect replication codfw -> eqiad on es4 and es5 T266663

Marostegui updated the task description. (Show Details)Thu, Oct 29, 6:11 AM
Marostegui moved this task from Blocked to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2020-10-29T06:23:25Z] <marostegui> Disconnect replication codfw -> eqiad on s5 T266663

Marostegui updated the task description. (Show Details)Thu, Oct 29, 6:23 AM
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-10-29T06:36:08Z] <marostegui> Disconnect replication codfw -> eqiad on s6 T266663

Marostegui updated the task description. (Show Details)Thu, Oct 29, 6:36 AM

Mentioned in SAL (#wikimedia-operations) [2020-10-29T06:38:17Z] <marostegui> Disconnect replication codfw -> eqiad on s7 T266663

Marostegui updated the task description. (Show Details)Thu, Oct 29, 6:38 AM
Marostegui updated the task description. (Show Details)Thu, Oct 29, 6:45 AM
Marostegui updated the task description. (Show Details)Thu, Oct 29, 6:52 AM

Mentioned in SAL (#wikimedia-operations) [2020-10-29T06:52:34Z] <marostegui> Disconnect replication codfw -> eqiad on s2 T266663

Mentioned in SAL (#wikimedia-operations) [2020-10-29T07:46:11Z] <marostegui> Disconnect replication codfw -> eqiad on s3 T266663

Marostegui updated the task description. (Show Details)Thu, Oct 29, 7:46 AM
Marostegui updated the task description. (Show Details)Thu, Oct 29, 7:49 AM

Mentioned in SAL (#wikimedia-operations) [2020-10-29T07:54:39Z] <marostegui> Disconnect replication codfw -> eqiad on s4 T266663

Marostegui updated the task description. (Show Details)Thu, Oct 29, 7:54 AM

Mentioned in SAL (#wikimedia-operations) [2020-10-29T08:02:51Z] <marostegui> Disconnect replication codfw -> eqiad on s1 T266663

Marostegui updated the task description. (Show Details)Thu, Oct 29, 8:02 AM
Marostegui updated the task description. (Show Details)EditedThu, Oct 29, 8:05 AM

This is all done, before closing it I am doing a quick data check on some tables across all wikis to make sure nothing has drifted.

Progress

  • s1
  • s2
  • s3
  • s4
  • s5
  • s6
  • s7
  • s8
  • x1
  • es4
  • es5
Marostegui updated the task description. (Show Details)Thu, Oct 29, 8:06 AM

GTID enabled everywhere on codfw masters:

sudo cumin "P{P:mariadb::mysql_role%role = master and *.codfw.wmnet}" 'mysql -e "show slave status\G" | grep Using'
15 hosts will be targeted:
db[2079,2090,2096,2105,2107,2112,2118,2123,2129].codfw.wmnet,es[2021,2023].codfw.wmnet,pc[2007-2010].codfw.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(7) db2096.codfw.wmnet,es[2021,2023].codfw.wmnet,pc[2007-2010].codfw.wmnet
----- OUTPUT of 'mysql -e "show s...\G" | grep Using' -----
                    Using_Gtid: Slave_Pos
===== NODE GROUP =====
(8) db[2079,2090,2105,2107,2112,2118,2123,2129].codfw.wmnet
----- OUTPUT of 'mysql -e "show s...\G" | grep Using' -----
                   Using_Gtid: Slave_Pos
================
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (15/15) [00:01<00:00, 10.00hosts/s]
FAIL |                                                                                                                                                                                                                               |   0% (0/15) [00:01<?, ?hosts/s]
100.0% (15/15) success ratio (>= 100.0% threshold) for command: 'mysql -e "show s...\G" | grep Using'.
100.0% (15/15) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

And replication disabled everywhere on eqiad masters (pc1010 can be ignored):

sudo cumin "P{P:mariadb::mysql_role%role = master and *.eqiad.wmnet}" 'mysql -e "show slave status\G"'
20 hosts will be targeted:
db[1080-1081,1083,1086,1100,1103-1104,1107,1115,1122-1123,1128,1131-1132].eqiad.wmnet,es[1021,1024].eqiad.wmnet,pc[1007-1010].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) pc1010.eqiad.wmnet
----- OUTPUT of 'mysql -e "show slave status\G"' -----
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: pc1007.eqiad.wmnet
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: pc1007-bin.137067
           Read_Master_Log_Pos: 131496335
                Relay_Log_File: pc1010-relay-bin.037718
                 Relay_Log_Pos: 131496635
         Relay_Master_Log_File: pc1007-bin.137067
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 131496335
               Relay_Log_Space: 131497586
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: Yes
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 171966644
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-171966644-48816503458
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 2
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 1862570640
================
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (20/20) [00:00<00:00, 31.77hosts/s]
FAIL |                                                                                                                                                                                                                               |   0% (0/20) [00:00<?, ?hosts/s]
100.0% (20/20) success ratio (>= 100.0% threshold) for command: 'mysql -e "show slave status\G"'.
100.0% (20/20) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Marostegui closed this task as Resolved.Fri, Oct 30, 5:55 AM

This is all done, before closing it I am doing a quick data check on some tables across all wikis to make sure nothing has drifted.

Progress

  • s1
  • s2
  • s3
  • s4
  • s5
  • s6
  • s7
  • s8
  • x1
  • es4
  • es5

This all came clean. Resolving!