Page MenuHomePhabricator

Enable DB replication codfw -> eqiad before the switchover and some other checks
Closed, ResolvedPublic

Description

Date of the switchover still to be defined but before we switch over we have to make sure that:

  • Replication codfw -> eqiad is running
    • s1
    • s2
    • s3
    • s4
    • s5
    • s6
    • s7
    • s8
    • x1
    • es4
    • es5
    • pc1
    • pc2
    • pc3
  • Check and disable GTID on codfw masters for the above sections.
  • Check that all codfw slaves have GTID enabled
  • Check which notifications are disabled for codfw hosts
  • Check event scheduler is enabled on codfw hosts
  • Check that query killers are installed and enabled on codfw hosts
  • Review DB MW weights
    • s1
    • s2
    • s3
    • s4
    • s5
    • s6
    • s7
    • s8
    • x1
    • es4
    • es5
    • pc1
    • pc2
    • pc3

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptJan 22 2020, 7:41 AM
Marostegui renamed this task from Enable replication codfw -> eqiad before the switchover to Enable DB replication codfw -> eqiad before the switchover.Jan 22 2020, 7:41 AM
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Backlog on the DBA board.

@jcrespo @Kormat thoughts on codfw -> eqiad replication on misc sections? Initially I didn't include them as we are not switching those services, but maybe for consistency we should

Mentioned in SAL (#wikimedia-operations) [2020-08-27T07:35:06Z] <marostegui> Move pc2010 under pc2007 T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T08:13:22Z] <marostegui> Enable replication codfw -> eqiad on pc1 T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T08:44:32Z] <kormat> enabling replication from pc2008 to pc1008 (pc2) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T08:53:40Z] <kormat> enabling replication from pc2009 to pc1009 (pc3) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T09:07:08Z] <kormat> enabling replication from db2090 to db1081 (s4) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T09:15:07Z] <kormat> enabling replication from db2079 to db1109 (s8) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T09:20:54Z] <kormat> enabling replication from db2105 to db1123 (s3) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T10:23:57Z] <kormat> enabling replication from es2021 to es1021 (es4) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T10:30:38Z] <kormat> enabling replication from es2023 to es1024 (es5) T243373

Kormat updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-08-27T10:51:18Z] <kormat> enabling replication from db2123 to db1100 (s5) T243373

I am re-doing a compare on all those sections where cross replication is enabled to be double sure (and to keep warming up tables), and so far so good.

Checked all codfw hosts with notifications disabled.
Only found db2135 (m5)

Mentioned in SAL (#wikimedia-operations) [2020-08-27T11:22:13Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust db2126 weight T243373', diff saved to https://phabricator.wikimedia.org/P12394 and previous config saved to /var/cache/conftool/dbconfig/20200827-112213-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-08-27T11:45:10Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s1 codfw weights T243373', diff saved to https://phabricator.wikimedia.org/P12395 and previous config saved to /var/cache/conftool/dbconfig/20200827-114509-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-08-27T11:51:11Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s4 codfw weights T243373', diff saved to https://phabricator.wikimedia.org/P12396 and previous config saved to /var/cache/conftool/dbconfig/20200827-115110-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-08-27T11:59:35Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s5 eqiad weights T243373', diff saved to https://phabricator.wikimedia.org/P12397 and previous config saved to /var/cache/conftool/dbconfig/20200827-115934-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-08-27T12:02:11Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s5 codfw weights T243373', diff saved to https://phabricator.wikimedia.org/P12398 and previous config saved to /var/cache/conftool/dbconfig/20200827-120211-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-08-27T12:08:16Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s6 weights T243373', diff saved to https://phabricator.wikimedia.org/P12399 and previous config saved to /var/cache/conftool/dbconfig/20200827-120816-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-08-27T12:14:54Z] <kormat> enabling replication from db2129 to db1093 (s6) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T12:24:10Z] <marostegui> Fix password format for in db2129 (s6 codfw master) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T12:30:03Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s7 weights T243373', diff saved to https://phabricator.wikimedia.org/P12400 and previous config saved to /var/cache/conftool/dbconfig/20200827-123003-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-08-27T12:30:28Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s7 weights T243373', diff saved to https://phabricator.wikimedia.org/P12401 and previous config saved to /var/cache/conftool/dbconfig/20200827-123028-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-08-27T12:43:39Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Adjust s8 weights T243373', diff saved to https://phabricator.wikimedia.org/P12402 and previous config saved to /var/cache/conftool/dbconfig/20200827-124338-marostegui.json

Weights are tackled, but I will give them another review tomorrow morning.

Mentioned in SAL (#wikimedia-operations) [2020-08-27T13:01:31Z] <kormat> enabling replication from db2118 to db1086 (s7) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T13:14:02Z] <kormat> enabling replication from db2096 to db1103 (x1) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T13:18:44Z] <kormat> enabling replication from db2107 to db1122 (s2) T243373

@jcrespo @Kormat thoughts on codfw -> eqiad replication on misc sections? Initially I didn't include them as we are not switching those services, but maybe for consistency we should

profile::mariadb::replication_lag assumes the 'master' is in mediawiki::state('primary_dc'). This will not be true once the dc switch happens.

(This is independent of codfw->eqiad replication.)

I have gone thru logs from past switchover and we indeed left mX hosts aside from reenabling replication codfw -> eqiad, so let's not enable it.

Mentioned in SAL (#wikimedia-operations) [2020-08-27T13:50:34Z] <kormat> disabling GTID on db2107 (s2) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T13:51:24Z] <kormat> disabling GTID on db2105 (s3) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T13:52:05Z] <kormat> disabling GTID on db2090 (s4) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T13:52:46Z] <kormat> disabling GTID on db2123 (s5) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T13:53:53Z] <kormat> disabling GTID on db2129 (s6), db2118 (s7), db2079 (s8) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T13:56:31Z] <kormat> disabling GTID on db2096 (x1), es2021 (es4), es2023 (es5) T243373

Mentioned in SAL (#wikimedia-operations) [2020-08-27T13:58:39Z] <kormat> disabling GTID on pc2007 (pc1), pc2008 (pc2), pc2009 (pc3) T243373

GTID disabled on all codfw masters except for s1.

Mentioned in SAL (#wikimedia-operations) [2020-08-28T08:22:41Z] <kormat> enabling replication from db2112 to db1083 (s1) T243373

Confirmed that GTID is disabled on all codfw masters:

sudo cumin 'A:db-role-master and A:codfw' 'mysql -e "show slave status\G" | grep Using_Gtid'
15 hosts will be targeted:
db[2079,2090,2096,2105,2107,2112,2118,2123,2129].codfw.wmnet,es[2021,2023].codfw.wmnet,pc[2007-2010].codfw.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                           
(1) pc2010.codfw.wmnet                                                                                                                                           
----- OUTPUT of 'mysql -e "show s... grep Using_Gtid' -----                                                                                                      
                    Using_Gtid: Slave_Pos                                                                                                                        
===== NODE GROUP =====                                                                                                                                           
(6) db2096.codfw.wmnet,es[2021,2023].codfw.wmnet,pc[2007-2009].codfw.wmnet                                                                                       
----- OUTPUT of 'mysql -e "show s... grep Using_Gtid' -----                                                                                                      
                    Using_Gtid: No                                                                                                                               
===== NODE GROUP =====                                                                                                                                           
(8) db[2079,2090,2105,2107,2112,2118,2123,2129].codfw.wmnet                                                                                                      
----- OUTPUT of 'mysql -e "show s... grep Using_Gtid' -----                                                                                                      
                   Using_Gtid: No                                                                                                                                
================                                                                                                                                                 
PASS |โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 100% (15/15) [00:01<00:00,  2.61hosts/s]          
FAIL |                                                                                                                |   0% (0/15) [00:01<?, ?hosts/s]
100.0% (15/15) success ratio (>= 100.0% threshold) for command: 'mysql -e "show s... grep Using_Gtid'.
100.0% (15/15) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

(pc2010 is a false positive, as all parsercaches are 'masters')

I have reviewed the weights again so this is all done.
As part of T260042 I am warming up all the hosts in codfw, including the masters, parsercache, es...

Thanks Stephen for taking care of the replication re-enablement.

Marostegui renamed this task from Enable DB replication codfw -> eqiad before the switchover to Enable DB replication codfw -> eqiad before the switchover and some other checks.Aug 31 2020, 5:02 AM
Marostegui updated the task description. (Show Details)

All codfw have been checked for GTID enabled, they all had it (not codfw masters)

Mentioned in SAL (#wikimedia-operations) [2020-09-01T06:20:50Z] <marostegui> Install query killers on db2137:3314 T243373

I have checked that the event scheduler is enabled everywhere within codfw.
Same for query killers.
Query killer wasn't present on db2137:3314 db2137:3315, so I have installed them there.

On es3, es2017 only had the master ones, I have installed the slaves ones.

I have checked that replication works fine from codfw to eqiad by checking a few tables, sections checked:

  • pc1,pc2, pc3
  • es4, es5
  • s1 enwiki
  • s2 itwiki
  • s3 ilowiki (created my own user to check)
  • s4 commons
  • s5 dewiki
  • s6 frwiki
  • s7 eswiki
  • s8 wikidatawiki
  • x1 enwiki for echo_event