Maniphest T196366

Implement (or refactor) a script to move slaves when the master is not available
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Marostegui
	Jun 4 2018, 1:12 PM

Description

Right now we use repl.pl to move slaves around:

ie: when a master failover is needed, we use it to move all the slaves under the new master.

However, this script doesn't work when the master is unavailable.

It would be a good start to either refactor repl.pl or create a new script that could move slaves under a different host when the master is unavailable.

ie: master has crashed and we have to move all the slaves to replicate from the candidate master during an emergency.

Details

	Title	Reference	Author	Source Branch	Dest Branch
	[WIP] Introduce emergency switchover	repos/sre/wmfmariadbpy!3	ladsgroup	emergency_switchover	main

Customize query in GitLab

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		LSobanski	T156461 [META ticket] Automation for our DBs tracking task
		Open		None	T196366 Implement (or refactor) a script to move slaves when the master is not available

Event Timeline

Marostegui triaged this task as Medium priority.Jun 4 2018, 1:12 PM

Marostegui created this task.

Marostegui moved this task from Triage to Backlog on the DBA board.

• Vvjjkkii renamed this task from Implement (or refactor) a script to move slaves when the master is not available to pobaaaaaaa.Jul 1 2018, 1:05 AM

• Vvjjkkii raised the priority of this task from Medium to High.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

Marostegui renamed this task from pobaaaaaaa to Implement (or refactor) a script to move slaves when the master is not available.Jul 2 2018, 5:15 AM

Marostegui lowered the priority of this task from High to Medium.

Marostegui removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

Marostegui updated the task description. (Show Details)

@jcrespo - I have been thinking about this ticket lately.
Given that switchover.py works so well already, do you think it would be doable to do a --emergency-slave-switch $new_master (or whatever option) to be able to move the slaves under a given host without checking the master?
This would allow us to do emergency failovers if a master isn't reachable - obviously this needs to be execute carefully, but during an emergency, it can simplify the process of having to execute the change master host to the preferred host.
A human should still check:

Which is the host that is most advanced in terms of replication to promote that one (in the case that all the hosts didn't stop in the same position)
The preferred host is running STATEMENT.

Sadly switchover.py wouldn't be reusable or helpful (the replication and other libraries may be) for an emergency- it has to start from 0. Switchover.py assumes all hosts are reachable and have very low lag, replication is working, etc. which won't be the case on a failover. A failover is a much harder case where every possibility of breakage has to be contemplated separately and some safe compromises have to be taken (e.g. what to do if we detect X amount of data has been lost).

Ah, I see!.
Yeah, I was thinking about a very primitive way to do it (for now), which would require human intervention to decide which is the most suitable host to be the new master and then the script to actually execute the batch of change master to master host.

In T196366#5420943, @Marostegui wrote:

Ah, I see!.
Yeah, I was thinking about a very primitive way to do it (for now), which would require human intervention to decide which is the most suitable host to be the new master and then the script to actually execute the batch of change master to master host.

Yeah, I understood you that -not a fully automated and autonomous script- but even that is not easy and still not reusable, as it would have to make it without using the master, and the requires arbitrary master changes that neither gtid nor WMFReplication.move() allow yet. We would need to implement binlog position matching first, and a way to detect replicas from a master down (tendril replacement "zarcillo" database?). All doable, but not immediate or reusable from existing code.

Maybe gtid will become usable at 10.4 ? https://jira.mariadb.org/browse/MDEV-12012?focusedCommentId=132462#comment-132462

In T196366#5420946, @jcrespo wrote:

In T196366#5420943, @Marostegui wrote:

and a way to detect replicas from a master down (tendril replacement "zarcillo" database?).

Good point - with the master down there is not a canonical place to detect which hosts are hanging apart from tendril/zarcillo indeed.

Marostegui added a project: Wikimedia-Incident.Sep 23 2019, 6:19 AM

Krinkle moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Dec 17 2019, 10:18 PM

jcrespo added a parent task: T246435: Create or improve a tool for monitoring or automating tasks for Wikimedia databases.Feb 28 2020, 10:58 AM

jcrespo mentioned this in T246435: Create or improve a tool for monitoring or automating tasks for Wikimedia databases.Mar 13 2020, 7:23 AM

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

jcrespo removed a parent task: T246435: Create or improve a tool for monitoring or automating tasks for Wikimedia databases.May 6 2020, 7:32 AM

Marostegui moved this task from Backlog to Ready on the DBA board.Feb 16 2021, 1:44 PM

BTullis subscribed.Nov 2 2021, 10:28 AM

Marostegui mentioned this in T281249: Create or modify an existing tool that quickly shows the db replication status in case of master failure.Mar 22 2022, 3:44 PM

With the great work done by @Ladsgroup at T281249: Create or modify an existing tool that quickly shows the db replication status in case of master failure I think we are a step closer to get this done.
Once we have that script, we could implement another one based on that one (rather than refactor db-switchover) which would take care of, once passed, the right candidate master, simply configure replication on all the other replicas.

The safety measure the script should be to disallow hosts that have the following items:

Multi-instance
Other slaves hanging
binlog format not STATEMENT
Not in the active DC

@Ladsgroup would you be ok working on this task?

Definitely. I can start next week.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptMar 30 2022, 12:08 PM

Volans added a project: SRE-Sprint-Week-Sustainability-March2023.Mar 21 2023, 11:46 AM

Volans moved this task from Backlog to E_TOO_BIG_MAYBE_OKR? on the SRE-Sprint-Week-Sustainability-March2023 board.

@Ladsgroup: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

@Marostegui To get the list of direct replicas, something like this would work in cumin:

P65264 direct_replicas.py

1	import argparse
2	import json
3
4	import requests
5
6	parser = argparse.ArgumentParser()
7	parser.add_argument('section', help='Must be the section name in orchestrator')
8	args = parser.parse_args()
9	data_ = requests.get(
10	'https://orchestrator.wikimedia.org/api/cluster/alias/' +
11	args.section).json()
12	db_data = []
13	for db in data_:
14	analyzed_db = {}
15	if db['MasterKey']['Hostname'] + ':' + \
16	str(db['MasterKey']['Port']) != db['ClusterName']:
17	# not a direct replica
18	continue
19	db_data.append(db['Key']['Hostname'] + ':' + str(db['Key']['Port']))
20
21	print('direct replicas')
22	for db in db_data:
23	print(json.dumps(db))

Which outputs something like this:

ladsgroup@cumin1002:~/ladsgroup/software2/dbtools$ python3 direct_replicas.py s2
direct replicas
"db1156.eqiad.wmnet:3306"
"db1182.eqiad.wmnet:3306"
"db1188.eqiad.wmnet:3306"
"db1197.eqiad.wmnet:3306"
"db1222.eqiad.wmnet:3306"
"db1225.eqiad.wmnet:3312"
"db1229.eqiad.wmnet:3306"
"db1233.eqiad.wmnet:3306"
"db1239.eqiad.wmnet:3312"
"db1246.eqiad.wmnet:3306"
"db2207.codfw.wmnet:3306"
"dbstore1007.eqiad.wmnet:3312"

(it only works from cumin)

We can make it find direct replicas for secondary dc too. The hardest part is to refactor db-switchover to take that list (in itself it's not hard, it's that it assumes in many places that the old master is reachable which is a good assumption for the main usecase). Maybe I should copy paste that into a new file and see what happens.

In T196366#9911494, @Ladsgroup wrote:
@Marostegui To get the list of direct replicas, something like this would work in cumin:

P65264 direct_replicas.py
1 import argparse
2 import json
3
4 import requests
5
6 parser = argparse.ArgumentParser()
7 parser.add_argument('section', help='Must be the section name in orchestrator')
8 args = parser.parse_args()
9 data_ = requests.get(
10 'https://orchestrator.wikimedia.org/api/cluster/alias/' +
11 args.section).json()
12 db_data = []
13 for db in data_:
14 analyzed_db = {}
15 if db['MasterKey']['Hostname'] + ':' + \
16 str(db['MasterKey']['Port']) != db['ClusterName']:
17 # not a direct replica
18 continue
19 db_data.append(db['Key']['Hostname'] + ':' + str(db['Key']['Port']))
20
21 print('direct replicas')
22 for db in db_data:
23 print(json.dumps(db))

Which outputs something like this:
ladsgroup@cumin1002:~/ladsgroup/software2/dbtools$ python3 direct_replicas.py s2
direct replicas
"db1156.eqiad.wmnet:3306"
"db1182.eqiad.wmnet:3306"
"db1188.eqiad.wmnet:3306"
"db1197.eqiad.wmnet:3306"
"db1222.eqiad.wmnet:3306"
"db1225.eqiad.wmnet:3312"
"db1229.eqiad.wmnet:3306"
"db1233.eqiad.wmnet:3306"
"db1239.eqiad.wmnet:3312"
"db1246.eqiad.wmnet:3306"
"db2207.codfw.wmnet:3306"
"dbstore1007.eqiad.wmnet:3312"
(it only works from cumin)

We can make it find direct replicas for secondary dc too. The hardest part is to refactor db-switchover to take that list (in itself it's not hard, it's that it assumes in many places that the old master is reachable which is a good assumption for the main usecase). Maybe I should copy paste that into a new file and see what happens.

Keep in mind that you don't really need the list of replicas for the secondary, if you have that master, that is all you need. You don't really need to touch the secondary replicas. That is: all you need would be to reconfigure db2207 to replicate under the new primary master, their replicas don't need anything.

I agree, let's create db-emergency-switchover and work there without touching the current db-switchover for now.

ladsgroup opened https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/merge_requests/3

[WIP] Introduce emergency switchover

Ladsgroup mentioned this in rOSMD5faa1dbb0518: [WIP] Introduce emergency switchover.Sat, Jun 22, 2:41 AM

Implement (or refactor) a script to move slaves when the master is not availableOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Implement (or refactor) a script to move slaves when the master is not available
Open, MediumPublic
Actions

Related Objects
Search...