Page MenuHomePhabricator

Implement mariadb 10.0 masters
Closed, ResolvedPublic

Description

We already have s2 with mariadb10 masters- implement that upgrade to the other shards.

The last 5.5 slave in s[1-7] will shortly be upgraded, pending some table partitioning, so it's time to think about 10.0 on masters.

  • Do we trust 10.0 enough?
  • Which shard should we do first?
  • Should we attempt to trial 10.0 under real(ish) master load (thread pool, concurrency, not just serial replicated transactions, etc)?

Once we decide to move ahead, there are is other stuff we could achieve at the same time, since we're doing master rotations anyway:

  • Fix the remaining tables with unique/primary keys on promoted slaves.
  • Formally switch to MIXED or ROW (currently prod cnf is MIXED) -- and are we ready across the whole tree?
  • Same for GTID.
  • Do any shards need rebalancing or splitting (eg, any wikis to move from s3 to ??)

Other questions:

  • Is Sanitarium affected at all?
  • Do we want to consider any better HA solution for masters? Auto-failiver might be still unwise for Mediawiki, but what else would help?

Event Timeline

Springle created this task.Jul 8 2015, 10:49 AM
Springle raised the priority of this task from to Needs Triage.
Springle updated the task description. (Show Details)
Springle added subscribers: Springle, jcrespo.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 8 2015, 10:49 AM

To start some discussion. The questions are easy, the details are not:

Do we trust 10.0 enough?

Yes, once we have done proper testing (profiling thatis already ongoing, and more will be done).

Which shard should we do first?

All shards are production, doesn't matter. The important thing is having a plan to roll back version and changes (separately each one).

Should we attempt to trial 10.0 under real(ish) master load (thread pool, concurrency, not just serial replicated transactions, etc)?

Yes.

My main issue with 10 is that we need GTID-like features, but we cannot close our doors by chaining ourselves to diverging implementations and exclusive features. But that is not what this ticket is about.

(Please associate a project to this task if possible so it can be found on project searches / workboards. Is this Operations ?)

Springle set Security to None.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 20 2015, 2:30 AM

All s1-7 slaves are now 10.0.

Which shard should we do first?

All shards are production, doesn't matter. The important thing is having a plan to roll back version and changes (separately each one).

Agree in theory (and technically), but in terms of potential fallout after a mistake, they are not equal:

  • s1 and s4 are huge and would be painful to get wrong (in terms of upset users/public)
  • s7 has centralauth, so problems would affect all logged-in users on all shards
  • s5 is wikidata, which has non-standard write load and replag
  • s3 has many more tables/file-handles (maybe not big deal)

Thinking that doing s2 or s6 first would be safest.

My main issue with 10 is that we need GTID-like features, but we cannot close our doors by chaining ourselves to diverging implementations and exclusive features. But that is not what this ticket is about.

I think this ticket should be about that, because if we do not discuss it now, before migration is complete... then when :-) ? So, please feel free to throw up road blocks now!

Things I think they are a big issues:

  • Multisource replication (it does not work well, bugs with parallel?)
  • TokuDB (the versions we use have very important bugs, but I do not know if all of them are fixed in the latest versions)
  • Filtering (works well, but makes some tools more complex to use and error-prone)

None I think are related to masters:

I think we can avoid filtering, at least on production and maybe on labs, with some work.

TokuDB is not used except on dbstores/labs. We should update those to the latest version. Maybe for some of them (research) evaluate alternatives- for OLAP.
For storage saving I think InnoDB compressed was used for a while (what was your experience like?)
It would be lovely to also have MASTER_DELAY on those, which is not available on 10, only on 5.6.

Multisource is the big issue, there is an alternative on 5.7, but not a "short term" thing.

Main issue for me is GTID and ROW, but I do not have and answer to those yet, plus it has a dependency on sanitarium.

Now, my question is: why replacing 5.5 now? It works well and should work in jessie. Is there any specific issue we want to solve that requires update? (this is a rethorical question, but I want to know what are for you the main pain issues- queing, online alters?) HA is needed, but it is independent of the master version.

jcrespo triaged this task as Normal priority.Jul 20 2015, 3:51 PM
jcrespo moved this task from Triage to Backlog on the DBA board.

I am trying to get a list of blockers (we can edit the task description):

  • Prepare a deployment and rollback plan
  • Perform a table checksum on all affected servers
  • Find the best node on each shard to failover to (may need reinstall)
  • Create/review the role for mariadb production master
  • Test write load (for example, with pt-upgrade or pt-table-checksum)
  • Fix pending issues with jessie installer [soft block]
  • Setup proxies and put them into production [soft block]
  • Add extra monitoring only available on the mysql::coredb role [soft block]
    • Ganglia mysql parameters
    • Extra nagios checks (evaluate things like pt-heartbeart, read only mode or query strange states)

Some of these can be done in small increments (only one shard or server at a time)

Adding replication error as a blocking task, let's not rush and discard any replication issues with 10 masters first.

RobH changed the status of subtask T109116: dbproxy servers for codfw from Open to Stalled.Aug 18 2015, 7:26 PM
Restricted Application added a project: codfw-rollout. · View Herald TranscriptJan 25 2016, 7:57 PM

Doing s2 next week due to T122048

jcrespo claimed this task.Jan 28 2016, 9:08 PM

We will switchover from db1024 -> db1018

jcrespo raised the priority of this task from Normal to High.Jan 28 2016, 9:09 PM
jcrespo renamed this task from prepare for mariadb 10.0 masters to Implement mariadb 10.0 masters.Mar 8 2016, 2:11 PM
jcrespo updated the task description. (Show Details)
jcrespo moved this task from Backlog to Next on the DBA board.Mar 23 2016, 2:14 PM
jcrespo moved this task from Next to In progress on the DBA board.Apr 15 2016, 1:13 PM

All s* servers now have mariadb10 masters. We keep the old 5.5 masters in case a rollback is needed.

There are still some core servers in 5.5, mainly x1.

Change 284455 had a related patch set uploaded (by Volans):
Change eqiad masters for s1,s3-s7

https://gerrit.wikimedia.org/r/284455

Change 284455 merged by jenkins-bot:
Change eqiad masters for s1,s3-s7

https://gerrit.wikimedia.org/r/284455

Mentioned in SAL [2016-04-20T14:56:36Z] <volans@tin> Synchronized wmf-config/db-eqiad.php: Change eqiad masters for s1,s3-s7 - T105135 (duration: 00m 28s)

jcrespo closed this task as Resolved.Apr 22 2016, 4:24 PM

Done, not without some issues: T133309

Pending tasks tracked separately: T109179 T133385