Page MenuHomePhabricator

Migrate MySQLs to use ROW-based replication
Open, NormalPublic

Description

Row based replication has some important advantages:

  • It minimizes in most cases the replication lag (that we suffer in some hosts, sometimes)
  • It minimizes slave drift, that we are suffering now
  • More prone to break replication if a schema or data difference is detected (fails faster)
  • It usually end up with reduced contention, better performance and less locking needed

It has some disadvantages:

  • Increased size of binary logs, that affects both disk space and bandwidth needed. In edge cases (large blobs), it could impact binary logs performance when written to disk (configuration should be tuned)
  • It makes difficult to do one-host-at-a-time schema changes (which is the main mode we do them for mediawiki core hosts right now)
  • More prone to break replication if a schema or data difference is detected (fails faster) - this could be a curse or a blessing
  • Performance may not be great if slaves do not have proper primary keys; it can on the other side improve perforamnce
  • It requires an external tool (mysqlbinlog) in order to know the underlying ongoing queries (for example, if they are stuck)- changes are not shown on show processlist by the system threads
  • Sanitarium requires statement-based binary logs in order to filter rows to labsdbs Not anymore since triggers can happen replica side from 10.1+ (row based replication triggers)
  • It makes impossible or more difficult to use some tools like pt-table-checksum, specially on multi-tiered setups (for codfw, for labs)

This ticket is to decide if this change is worth it, how to do it, where (maybe not all servers require it), when and what blockers there are.

Related Objects

Event Timeline

jcrespo created this task.Aug 15 2015, 2:48 PM
jcrespo updated the task description. (Show Details)
jcrespo raised the priority of this task from to Needs Triage.
jcrespo added subscribers: jcrespo, Springle.
Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptAug 15 2015, 2:48 PM
akosiaris triaged this task as Normal priority.Aug 25 2015, 1:02 PM
akosiaris assigned this task to jcrespo.
akosiaris added a subscriber: akosiaris.

Assigning to @jcrespo as the man for the job

jcrespo updated the task description. (Show Details)Aug 25 2015, 1:04 PM
jcrespo set Security to None.

I would like to do a full-scale test of this feature on codfw to validate it and test its configuration (sadly binlog_row_image is not available until 10.1.6).

Change 256660 had a related patch set uploaded (by Jcrespo):
Add additional variable binlog_format that can be used on templates

https://gerrit.wikimedia.org/r/256660

Change 256660 merged by Jcrespo:
Add additional variable binlog_format that can be used on templates

https://gerrit.wikimedia.org/r/256660

Enwiki on codfw is now using ROW-based replication as a test, to check regressions and compare its performance to mixed (mostly, statement) on eqiad.

Should we continue doing ROW performance testing on codfw at the same time than T121207 is worked on?

jcrespo removed jcrespo as the assignee of this task.Apr 22 2016, 5:06 PM
Phabricator_maintenance renamed this task from Migrate MySQLs to use ROW-based replication (tracking) to Migrate MySQLs to use ROW-based replication.Aug 14 2016, 12:20 AM
Marostegui moved this task from Triage to Meta/Epic on the DBA board.May 31 2017, 7:25 AM

What's the status of this task? The previous comment is from over a year ago.

This is an Epic task and quite hard to achieve in short or even medium term.
To give you an example, row based replication is quite strict with data drifts and can break replication if data isn't exactly the same on all the tables.
We are right now in process of checking out all the data across the shards (running pt-table-checksum and some in house scripts) to evaluate (and fix) any data drifts we have.
The new labs infra does use row based replication.

There are other things we need to evaluate when slowly moving things to row based replication such as binlog sizes, potential regression on replication speed (specially for big data changes) etc

@MZMcBride Note this doesn't depend on us DBAs- changing to row based replication is a one-time change that is instantaneous. However, the application has to work well with it. Mediawiki developers have so far expressed no interest on it as production itself works more or less ok for them, you can see by no comment here by mediawiki hackers. There is nothing we can do without that. Please help us convincing the application developer this is worth seeking.

However, what we can do now is creating an intermediate replica that is in row-based and make labs use row based replication exclusively. That has already happened (it is not allegedly, it is real), and it is working, as it can be seen on the newly setup servers under the temporary dns labsdb-analytics.eqiad.wmnet and labsdb-web.eqiad.wmnet which, as far as I can see, they have been maintained in sync with production from several months already. You can use them already for testing.

Sorry, I misunderstood the scope of this task. I thought this task was about Wikimedia Labs using row-based replication, not Wikimedia production. I think I'm actually looking for T138967: Labs database replica drift.

However, what we can do now is creating an intermediate replica that is in row-based and make labs use row based replication exclusively. That has already happened (it is not allegedly, it is real), and it is working, as it can be seen on the newly setup servers under the temporary dns labsdb-analytics.eqiad.wmnet and labsdb-web.eqiad.wmnet which, as far as I can see, they have been maintained in sync with production from several months already. You can use them already for testing.

In the context of Wikimedia Labs, the word testing is confusing to me. Isn't all of Labs for testing? It's nice to hear that the data integrity issues we've been having on Wikimedia Labs might soon be resolved.

bd808 added a subscriber: bd808.Jun 20 2017, 10:41 PM

In the context of Wikimedia Labs, the word testing is confusing to me. Isn't all of Labs for testing? It's nice to hear that the data integrity issues we've been having on Wikimedia Labs might soon be resolved.

This is a pretty serious threadjack, but no, all of Labs/Cloud Services isn't for testing. The deployment-prep project is for testing MediaWiki and related services. The contint project is for running Jenkins workers that perform tests for code. Tool Labs/Toolforge and many other projects are the only "production" environment for the services that they supply. At least one academic paper has been written about the on-wiki impact of tools being off-line. We try very hard not to break things in the infrastructure for this reason. The replica databases are very much treated as services that should be available and stable as much as possible. The common association of 'labs' and 'experiments' is one of the drivers for rebranding Wikimedia Labs as a suite of products under the Wikimedia Cloud Services team.

Marostegui added a comment.EditedNov 20 2017, 8:28 AM

I don't think migrating to ROW is something we can actually do now after seeing the breakage caused when s5 master died (T180714) and there was a schema change on-going, and the replacement host was a host that was running row based replication.

Having ROW on the master makes some concrete schema changes (ie: adding a column) impossible unless they are done directly on the master and letting it replicate through replication.

We believe that since s5 was accidentally migrated to ROW, the lag is improved; so it did on labsdbs despite not having any kind of replication control, unlike production.

jcrespo updated the task description. (Show Details)Dec 18 2017, 1:47 PM
jcrespo updated the task description. (Show Details)
jcrespo removed a project: Goal.Feb 7 2018, 2:47 PM

This is important, but not a goal for this quarter- we are still blocked on mediawiki extension maintainers to be compatible with it; however, all databases (misc, x1, parsercache, es) have been meanwhile migrated to ROW already with great success.

There are some issues to solve regarding schema changes, but this is still a desirable change; at least to have the option, even if we go for MIXED or STATEMENT.