Page MenuHomePhabricator

Move sX to STATEMENT based replication
Closed, ResolvedPublic

Description

This has been going on in my mind for years and I'd like to get some thoughts.
We've running SBR on masters forever but RBR on slaves. Our candidate masters run SBR.

The motivation for this was to be able to at some point migrate to RBR everywhere, but this has been proven to be unlikely in the last few (more than 5 years), due to various reasons, starting with not being able to fully fully ensure our data is consistent everywhere, especially on those tables that didn't have PK for a long time (in some case till recently).

By mistake, some years ago we ran s5 with RBR everywhere for a lots of months and never happened, but we were always a bit nervous about it.

The fact that we run RBR on the slaves but not on the candidate master, it makes us very limited in our choice of candidate masters in case a master failure.

I'd like to propose we migrate sX to SBR everywhere (not on backup sources, sanitarium master, sanitarium and clouddb*).
x1 should remain RBR as it's been proven for years it is all good there.

@Ladsgroup @jcrespo @FCeratto-WMF thoughts? Either way, to move forward or to keep the current status are welcome.

Progress

  • s1
    • eqiad
    • codfw
  • s2
    • eqiad
    • codfw
  • s3
    • eqiad
    • codfw
  • s4
    • eqiad
    • codfw
  • s5
    • eqiad
    • codfw
  • s6
    • eqiad
    • codfw
  • s7
    • eqiad
    • codfw
  • s8
    • eqiad
    • codfw
  • x3
    • eqiad
    • codfw

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui triaged this task as Medium priority.Jan 15 2025, 4:24 PM
Marostegui moved this task from Triage to Refine on the DBA board.

Well, the biggest issue is that Oracle is deprecating it :-/ https://dev.mysql.com/doc/refman/8.0/en/replication-options-binary-log.html#sysvar_binlog_format

Re: backup sources, they don't have binlog to start with, so not applicable there.

MariaDB isn't doing it

Re: backup sources, they don't have binlog to start with, so not applicable there.

Ah true!

@jcrespo though, now that I think about it, if we want to backup binlogs, I think they should be on RBR anyway, so we have the full status of the rows - but that is to be discussed in some other task.

So my previous comment is my only contribution, I don't feel any way, we would love to have the flexibility of SBR but the reliability of RBR, but we sadly cannot have both. The only thing I can think of is: let's push T207253 and it won't matter!

@jcrespo though, now that I think about it, if we want to backup binlogs, I think they should be on RBR anyway, so we have the full status of the rows - but that is to be discussed in some other task.

My intention was to backup both formats, as long as it was possible. I want double backup of everything :-D. But yeah, a discussion for another time, and shouldn't affect this ticket.

So my previous comment is my only contribution, I don't feel any way, we would love to have the flexibility of SBR but the reliability of RBR, but we sadly cannot have both. The only thing I can think of is: let's push T207253 and it won't matter!

I talked to @FCeratto-WMF about that one actually today :-)

What we could do is move to SBR, and in parallel try to build the trust and checks in our data and eventually move to RBR (for real this time). Even though I've talked to MariaDB about this to confirm, we never know if in the future they'll stop supporting it.

Thanks for your comments Jaime.
@Ladsgroup @FCeratto-WMF that sounds like a plan?

Change #1111955 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Set RBR to all sanitarium masters

https://gerrit.wikimedia.org/r/1111955

Change #1111955 merged by Marostegui:

[operations/puppet@production] mariadb: Set RBR to all sanitarium masters

https://gerrit.wikimedia.org/r/1111955

I generally like the idea as it opens up the number of potential candidates for master in case master goes down and the designated candidate master doesn't' have the most recent entries.

My worry right now (which is not a major one) would be vendor lock-in. If only MariaDB sticks to keeping SBR and for any reason (redis situation, hostile takeover, etc. etc.) we would need to switch to something else ASAP, it would be quite painful for us.

I generally like the idea as it opens up the number of potential candidates for master in case master goes down and the designated candidate master doesn't' have the most recent entries.

My worry right now (which is not a major one) would be vendor lock-in. If only MariaDB sticks to keeping SBR and for any reason (redis situation, hostile takeover, etc. etc.) we would need to switch to something else ASAP, it would be quite painful for us.

We are, unfortunately, already locked in due to the way MariaDB implemented replication and GTID

Change #1112040 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Declare RBR in sanitariums.

https://gerrit.wikimedia.org/r/1112040

Change #1112040 merged by Marostegui:

[operations/puppet@production] mariadb: Declare RBR in sanitariums.

https://gerrit.wikimedia.org/r/1112040

I am going to start with s6 for this.
First of all I will put all the hosts with STATEMENT based on hiera, and once everything is migrated, I will move all the defaults in puppet to SBR

Change #1147424 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move s5 to SBR

https://gerrit.wikimedia.org/r/1147424

In the discussion around T207253 (comparing data across instances) we did not get into a timeline. Perhaps do we want to increase its priority due to the switch to SBR? If so we should also identify how much data is to be compared, how frequently, and what we can do if we detect differences.

In the discussion around T207253 (comparing data across instances) we did not get into a timeline. Perhaps do we want to increase its priority due to the switch to SBR? If so we should also identify how much data is to be compared, how frequently, and what we can do if we detect differences.

I believe @Ladsgroup had a PoC for comparing tables/data but I am not fully sure the state it is, it's been a while.
For basic comparison we could user the user table, as it is relatively fast and it would be rare if it didn't have any additions every 24h.

Change #1147424 merged by Marostegui:

[operations/puppet@production] mariadb: Move s5 to SBR

https://gerrit.wikimedia.org/r/1147424

Mentioned in SAL (#wikimedia-operations) [2025-05-19T10:14:53Z] <marostegui> Move eqiad s5 replicas (except sanitarium master and backup sources) to SBR dbmaint T383795

In the discussion around T207253 (comparing data across instances) we did not get into a timeline. Perhaps do we want to increase its priority due to the switch to SBR? If so we should also identify how much data is to be compared, how frequently, and what we can do if we detect differences.

I believe @Ladsgroup had a PoC for comparing tables/data but I am not fully sure the state it is, it's been a while.
For basic comparison we could user the user table, as it is relatively fast and it would be rare if it didn't have any additions every 24h.

it's blocked on making table catalog fully canonical so it can be used.

Change #1151185 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2*: Remove sanitarium masters

https://gerrit.wikimedia.org/r/1151185

Change #1151185 merged by Marostegui:

[operations/puppet@production] db2*: Remove sanitarium masters

https://gerrit.wikimedia.org/r/1151185

Mentioned in SAL (#wikimedia-operations) [2025-05-28T12:38:55Z] <marostegui> dbmaint x3 codfw make it SBR T383795

Change #1152070 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2211,db2228: Make them SBR

https://gerrit.wikimedia.org/r/1152070

Change #1152070 merged by Marostegui:

[operations/puppet@production] db2211,db2228: Make them SBR

https://gerrit.wikimedia.org/r/1152070

Change #1153550 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] s6 codfw: Migrate to SBR

https://gerrit.wikimedia.org/r/1153550

Change #1153550 merged by Marostegui:

[operations/puppet@production] s6 codfw: Migrate to SBR

https://gerrit.wikimedia.org/r/1153550

Mentioned in SAL (#wikimedia-operations) [2025-06-04T08:28:21Z] <marostegui> Change s6 codfw dbmaint to SBR T383795

Mentioned in SAL (#wikimedia-operations) [2025-06-04T08:38:05Z] <marostegui> Change s6 eqiad dbmaint to SBR T383795

Change #1153552 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] s6 eqiad: Migrate to SBR

https://gerrit.wikimedia.org/r/1153552

Change #1153552 merged by Marostegui:

[operations/puppet@production] s6 eqiad: Migrate to SBR

https://gerrit.wikimedia.org/r/1153552

Change #1154032 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb s2 codfw: Migrate to SBR

https://gerrit.wikimedia.org/r/1154032

Mentioned in SAL (#wikimedia-operations) [2025-06-05T13:51:31Z] <marostegui> Migrate s2 codfw to SBR dbmaint T383795

Change #1154032 merged by Marostegui:

[operations/puppet@production] mariadb s2 codfw: Migrate to SBR

https://gerrit.wikimedia.org/r/1154032

Change #1154775 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Migrate s2 eqiad to SBR

https://gerrit.wikimedia.org/r/1154775

Mentioned in SAL (#wikimedia-operations) [2025-06-09T09:42:55Z] <marostegui> Migrate s2 eqiad dbmaint to SBR T383795

Change #1154775 merged by Marostegui:

[operations/puppet@production] mariadb: Migrate s2 eqiad to SBR

https://gerrit.wikimedia.org/r/1154775

Change #1166969 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] s3 codfw: Migrate to SBR

https://gerrit.wikimedia.org/r/1166969

Change #1166969 merged by Marostegui:

[operations/puppet@production] s3 codfw: Migrate to SBR

https://gerrit.wikimedia.org/r/1166969

Mentioned in SAL (#wikimedia-operations) [2025-07-08T05:52:43Z] <marostegui> Migrate s3 codfw to SBR T383795

Change #1167142 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] s3 eqiad: Migrate to SBR

https://gerrit.wikimedia.org/r/1167142

Change #1167142 merged by Marostegui:

[operations/puppet@production] s3 eqiad: Migrate to SBR

https://gerrit.wikimedia.org/r/1167142

Mentioned in SAL (#wikimedia-operations) [2025-07-08T07:54:21Z] <marostegui> Migrate s3 eqiad to SBR T383795

Marostegui updated the task description. (Show Details)

This is finally done