Page MenuHomePhabricator

find a viable replication source on codfw
Closed, ResolvedPublic

Description

seen in this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1071570 db2223 collides with another db replication source: https://fault-tolerance.toolforge.org/map?cluster=db-masters

@Ladsgroup could you help me find a better suited candidate?

If you give me your train of thought I'll attach it to a task to create a cookbook "find.a.better.suited.candidate.master"

Event Timeline

ABran-WMF changed the task status from Open to In Progress.Sep 17 2024, 12:54 PM
ABran-WMF triaged this task as High priority.
ABran-WMF moved this task from Triage to Pending comment on the DBA board.
ABran-WMF added subscribers: Marostegui, Volans.
ABran-WMF renamed this task from find a viable replication codfw source to find a viable replication source on codfw .Sep 18 2024, 9:50 AM

I'd say for now, pool it as a normal replica (remove the candidate master part from the puppet patch). Don't decommission the old one. Then let's go through pooled replicas of s5 and pick any that: 1- Isn't multiinstance 2- Isn't sharing a rack with any other master or candidate master 3- It's not sanitarium master 4- Is not about to be decommissioned. Once that's found. We need to change the binlog format to STATEMENT (make sure to set the binlog format both live and puppet). Then run flush logs. Then update tags and it's done. I'm not sure we can really automate this. We need to decide on it in every case. Fun.

ABran-WMF lowered the priority of this task from High to Medium.Sep 24 2024, 5:41 AM

Fun indeed! It reduces priority then. Will take care of it soon, thanks!

Change #1083813 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: add db2223

https://gerrit.wikimedia.org/r/1083813

Change #1083813 merged by Arnaudb:

[operations/puppet@production] mariadb: add db2223

https://gerrit.wikimedia.org/r/1083813

Is there anything needed or pending here?

db2223 has been pooled as a plain replica. db2123 being scheduled for removal, we'll have to pick a new candidate master for this section. I guess this could also be a good time for us to document and/or automate the decision process

Why not picking db2192 for instance?

The process is quite to actually maintain in place as we move servers logically a lot, so whatever state we leave it in now, it may not be the same one we'll have next week or even tomorrow if we have crashes, switchovers, decommisionings etc.
I think a good approach would be to start using Amir's tool to suggest single replicas (with no special features like RBR or hanging replicas) and see how that goes.

Why not picking db2192 for instance?

It could be a good candidate, I haven't had the time to unfold the process described in T374951#10168440 but I fully trust your choice, I'll swap it with db2123.

The process is quite to actually maintain in place as we move servers logically a lot, so whatever state we leave it in now, it may not be the same one we'll have next week or even tomorrow if we have crashes, switchovers, decommisionings etc.
I think a good approach would be to start using Amir's tool to suggest single replicas (with no special features like RBR or hanging replicas) and see how that goes.

I think tooling to to help pick that kind of host could be useful as its mostly sieving down a set of nodes with another set of information. As far as I found, Amir's tool (just to be 100% sure: you're talking about this one: https://fault-tolerance.toolforge.org/map right?), was a good help, but we can automate this even further based on the information it provides. Otherwise, we should at least document on wikitech the set of criteria to be used to pick up a new candidate source.

Change #1109678 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2191: Make it s5 candidate master

https://gerrit.wikimedia.org/r/1109678

Change #1109678 merged by Marostegui:

[operations/puppet@production] db2191: Make it s5 candidate master

https://gerrit.wikimedia.org/r/1109678

Marostegui claimed this task.

Why not picking db2192 for instance?

It could be a good candidate, I haven't had the time to unfold the process described in T374951#10168440 but I fully trust your choice, I'll swap it with db2123.

Done.