Page MenuHomePhabricator

Database maintenance scheduled while eqiad datacenter is non primary (after the DC switchover)
Closed, ResolvedPublic

Description

Let's create a meta ticket so we can have a list of things that can take advantage of the DC switchover from a DB point of view
Feel free to add/remove tasks as needed.
Depending on the amount of time we keep eqiad on sby we'd need to decide which of these tasks get done first

Recap of things to do: https://wikitech.wikimedia.org/wiki/Switch_Datacenter/planned_db_maintenance

Details

Related Gerrit Patches:

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusAssignedTask
ResolvedJoe
ResolvedNone
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
Resolvedjcrespo
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
Resolvedjcrespo
ResolvedNone
ResolvedMarostegui
ResolvedMarostegui
ResolvedCmjohnson
ResolvedMarostegui
ResolvedMarostegui

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 11 2017, 5:19 PM
Marostegui triaged this task as Medium priority.Jan 11 2017, 5:26 PM
Marostegui added a project: Epic.
Marostegui moved this task from Triage to Meta/Epic on the DBA board.
Marostegui updated the task description. (Show Details)Jan 11 2017, 5:45 PM
Marostegui added a subtask: Unknown Object (Task).Jan 11 2017, 10:48 PM

Things that we need to do while codfw is active:

  • ALTER eqiad s4 master: T73563
  • ALTER eqiad shards: T130067
  • ALTER eqiad shards: T147166
  • ALTER eqiad masters (if we finish on time the slaves for that task): T132416

Feel free to edit this comment and add/change stuff as needed.

Feel free to edit this comment and add/change stuff as needed.

Sadly, phabricator is not a wiki- either put it on the header, a wiki or an etherpad :-)

Marostegui updated the task description. (Show Details)Mar 31 2017, 8:55 AM

I have migrated that to the wiki for security reasons, as I have added details of that and added some extra tasks , and questions about others: https://wikitech.wikimedia.org/wiki/Switch_Datacenter/planned_db_maintenance

jcrespo updated the task description. (Show Details)Mar 31 2017, 9:28 AM
jcrespo renamed this task from Meta DBA ticket for the DC switchover to Database maintenance scheduled while eqiad datacenter is non primary (after the DC switchover).Apr 3 2017, 10:47 AM

I have added the "failover schedule" on the wikipage

Change 348440 had a related patch set uploaded (by Marostegui):
[operations/dns@master] templates/wmnet: Switch dns master alias to codfw

https://gerrit.wikimedia.org/r/348440

I have been reviewing all the steps on https://wikitech.wikimedia.org/wiki/Switch_Datacenter
The post switch, as stated here: https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_9_-_post_read-only will not take care of:

Update DNS records for new database masters. This is not covered by the switchdc script

So I have submitted the patch above to update it once we are on phase 9 of the switch. Please @Volans and @jcrespo take a look at it.

Change 348440 merged by Marostegui:
[operations/dns@master] templates/wmnet: Switch dns master alias to codfw

https://gerrit.wikimedia.org/r/348440

Mentioned in SAL (#wikimedia-operations) [2017-04-24T18:18:28Z] <jynus> disabling mysql replication eqiad -> codfw on s[1-7] and x1 shards T155099

I double-checked this is ok, but a second pair of eyes will be required tomorrow:

$ cumin 'db2* and R:Class = Role::Mariadb::Groups and R:Class%mysql_group = core and R:Class%mysql_role = master' "mysql --skip-ssl -e \"SHOW SLAVE STATUS\""
8 hosts will be targeted:
db[2016-2019,2023,2028-2029,2033].codfw.wmnet
Confirm to continue [y/n]? y
===== NO OUTPUT =====                                                                                                          
PASS |████████████████████████████████████████████████████████████████████████| 100% (8/8) [00:00<00:00, 11.94hosts/s]         
FAIL |                                                                                |   0% (0/8) [00:00<?, ?hosts/s]         
100.0% (8/8) success ratio (>= 100.0% threshold) for command: 'mysql --skip-ssl...OW SLAVE STATUS"'.
100.0% (8/8) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

We should also put an alarm to reenable it before the 3rd, I will do that now.

Change 350824 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Switch dns master alias to eqiad

https://gerrit.wikimedia.org/r/350824

Mentioned in SAL (#wikimedia-operations) [2017-05-03T09:50:17Z] <marostegui> Restart db1097 to change its binlog to STATEMENT - T155099

Change 350824 merged by Marostegui:
[operations/dns@master] wmnet: Switch dns master alias to eqiad

https://gerrit.wikimedia.org/r/350824

jcrespo closed this task as Resolved.May 3 2017, 4:43 PM
jcrespo assigned this task to Marostegui.

Now done, the tasks that stay open are the ones related to codfw pending tasks (having done eqiad) or slave-only eqiad tasks (having done the eqiad masters).

Marostegui removed Marostegui as the assignee of this task.May 3 2017, 4:47 PM

We BOTH have worked al lot this, so not fair to assign it to me!
Thanks for the massive amount of great work you have done here.