Page MenuHomePhabricator

Look into Maria 10 parallel-replication
Closed, ResolvedPublic

Description

https://mariadb.com/kb/en/mariadb/documentation/replication/standard-replication/parallel-replication/

This helps with shard s3 and such (multiple DBs per sever) and with group commits off the bat. How many worker to use? Is out-of-order also OK for MediaWiki?

Event Timeline

aaron raised the priority of this task from to Needs Triage.
aaron updated the task description. (Show Details)
aaron added a project: MediaWiki-Core-Team.
aaron moved this task to Backlog on the MediaWiki-Core-Team board.
aaron set Security to None.
aaron added a subscriber: Springle.
aaron subscribed.

We've been trialing this, along with multi-source replication, for a few months between sanitarium and labsdb as that's the only place we have MariaDB 10 master and slave, so far.

The ideal number of parallel worker threads is likely to differ between shards.

Pros:

  • Improved group commit with gtid + domain ids.
  • Supports configurable read-ahead of the relay log -- really nice, and potentially more important than all the rest.
  • Out-of-order is useful for doing certain schema changes without replag.

Cons:

  • Not quite a silver bullet. Even with out-of-order, still some situations where if one binlog stream lags, everything lags until catch up.
  • The mess with labsdb falling out of sync recently was a multi-source replication bug (multi-source and parallel code are inherently interlinked).

+1 to researching Mediawiki's tolerance for out-of-order.

aaron added a project: Sustainability.
aaron added a subscriber: Gilles.

Comparing https://tendril.wikimedia.org/host/view/db2016.codfw.wmnet/3306 and https://tendril.wikimedia.org/host/view/db1072.eqiad.wmnet/3306 suggest that the primary enwiki slaves are loaded enough around some peak times that they lag more than other normal (1-2 sec vs 20). Given the massive qps of the two main slaves (https://dbtree.wikimedia.org/), I wonder if we should add another slave (or at least shuffle the weight/grouping around).

Parallel replication may or may not help there, depending on whether there is just huge I/O saturation.

I believe we switched the slaves over to 10 right? What benefits are we getting already, if any?

Although DB1055 had a spike at 01:00 similar to 1072 (https://tendril.wikimedia.org/chart?hosts=db10%2855%29&vars=^seconds_behind_master&mode=delta) today. Even with it's low qps (watchlist queries mostly), it fell behind then.

So it's just a matter of 72/73 not keeping up. That's probably good news, since the cause is likely some awful write query that we have control over. I wonder if it comes from some script on a chron not using wfWaitForSlaves.

Tempted to at least partly blame the misc::maintenance::purge_abusefilter job. It runs at 1AM daily and uses 'LIMIT' with UPDATE, which we don't support so the LIMIT is ignored...patch incoming.

Taking that issue to task https://phabricator.wikimedia.org/T95382?workflow=create.

db1072 and 73 QPS loads look high, but they aren't really struggling even at peak. I'd tend to agree that most replag spikes are due to slow writes, and not individual slave traffic.

That said, we will certainly assign another S1 slave for general (non API/watchlist/pager/vslow) traffic once the next batch of EQIAD boxes are upgraded in the new FY, and we can spread them across more racks.

Re MariaDB 10 benefits: There isn't any obvious leap in performance to point to compared to 5.5. Some stuff is faster, some isn't. The biggest benefits have been:

  • The thread pool. Decoupling max_connections (now 5000, and could be more) from maximum threads (512) has definitely helped during spikes.
  • The persistent index statistics tables. These are being generated on dbstore1001 and exported to slaves for some wikis. They've definitely helped address some optimizer edge cases.
  • More options for online schema changes is nice, though not really any faster than using percona toolkit.

Looking at http://kristiannielsen.livejournal.com/18435.html it seems like even in-order could help a lot. I'd be curious how that goes before even considered out-of-order.

I have disabled parallel replication on all labsdb hosts. It was causing replication to get blocked.

I would focus right now on multi-tenant replication (replicate different wikis on different threads), reevaluate group commit-dependent replication later.

I've researched options. We want to enable GTID replication. Parallel replication seems only tested and working correctly on InnoDB, so we have to leave labs and analytics out of the picture for now.

GTID replication (T133385) will allow to set different domains for data streams. Those streams (assuming they are independent) will be executed in parallel with up to max workers on the slave. There are 3 main applications I can see from that:

  • Different masters having different domains. Not very interesting, as multi-master will only be done on a specific number of cases (analytics, labs, maybe failovers)
  • Maintenance operations done in parallel - combined with online schema changes, this will allow to execute ALTER TABLEs on the master with a different domain, and it will not affect regular replication. This will not solve schema changes in all cases, but it will in the most common case
  • Different wikis replicating in parallel - this is the most interesting case, and one of the largest issues on s3. Assuming wikis' writes are not affected by each other, that would basically create an independent channel for each wiki. One domain per wiki may be too much (800 channels?), but that can be tested. The main issue is that to differentiate channels, the application must set the domain manually (which means a SET command on every write, which can potentially increase latency)

Of course, there is still the in-order (group commit) and out-of-order replication options, but the former is the one I would be more excited about (the in-order may not gain much, and the second being risky).

Aaron, do you think it is worth to go for the domain/channel option? If yes, I would change this task to "test" mariadb 10 domain-based parallel replication. On the downside, pt-heartbeat may stop being reliable because it would go on a different channel (or it should be changed to be executed once per channel)

LoadBalancer assumes one lag per server. This is nice because:

  • It is easy to cache (one value per server in memcached/apc)
  • It avoids complexity such as trying to figure out whether to (a) just call selectDB() to reuse a connection or (b) make a new connection. If wiki1 is not lagged, and you get a connection there, it's nice being able to reuse that connection for wiki2 if it's on the same server. Varying lags mean a separate heartbeat query and possibly another connection to a less-lagged (for that wiki) slave. It could be written I suppose...

Schema change speed isn't too interesting too me. I can't really speak for labs and analytics, though I know the later have some scalability issues atm.

I'd like to first try less "invasive" things like in-order replication and fixing all the issues on T95501.

I am now testing parallel replication (conservative method) on db1073.

I will leave it on for a while to see how it behaves (specially, checking not data corruption happens due to our unsafe statements).

I forgot to say that I set it to 5 at random, as I have not a reason based on facts to set it as any other value- although based on the implementation, it probably has to be a small number.

We would need to enable parallel replication for: T130067

This thread is more about group-commit and several synchronous commit techniques/features. While GTID-domain based replication is parallel replication, as Aaron said above, it is not going to be deployed for mediawiki, only for maintenance tasks. I am just clarifying for performance. I wonder how that will affect GTID issues, as it will likely generate a new string.

db2062 now has as a test MariaDB 10.1, which has enabled by default conservative parallel replication- we will see how stable and reliable (data wise) that is, and it it really helps in our specific use case.

Marostegui assigned this task to jcrespo.

Closing this task, it's been a while and Parallel_Mode: conservative is the default everywhere. I don't think we are moving any time soon from any other mode.
We've not investigated it further for deploying schema changes in parallel - not being 100% confident with GTID is also playing a role here.
Please re-open if needed.