Description

We already have s2 with mariadb10 masters- implement that upgrade to the other shards.

The last 5.5 slave in s[1-7] will shortly be upgraded, pending some table partitioning, so it's time to think about 10.0 on masters.

Do we trust 10.0 enough?
Which shard should we do first?
Should we attempt to trial 10.0 under real(ish) master load (thread pool, concurrency, not just serial replicated transactions, etc)?

Once we decide to move ahead, there are is other stuff we could achieve at the same time, since we're doing master rotations anyway:

Fix the remaining tables with unique/primary keys on promoted slaves.
Formally switch to MIXED or ROW (currently prod cnf is MIXED) -- and are we ready across the whole tree?
Same for GTID.
Do any shards need rebalancing or splitting (eg, any wikis to move from s3 to ??)

	Subject	Repo	Branch	Lines +/-
	Change eqiad masters for s1,s3-s7	operations/mediawiki-config	master	+17 -17

Status	Assigned	Task
Resolved	jcrespo	T85266 Look into Maria 10 parallel-replication
Resolved	jcrespo	T105135 Implement mariadb 10.0 masters
Resolved	jcrespo	T106647 mariadb multi-source replication glitch with site_identifiers
Declined	jcrespo	T109116 dbproxy servers for codfw
Resolved	jcrespo	T120122 Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic)
Resolved	• Cmjohnson	T120689 es1019 and its management interface are unresponsive
Resolved	jcrespo	T125215 Prepare db1018 and s2-slaves for s2 master failover

Event Timeline

• Springle created this task.Jul 8 2015, 10:49 AM

• Springle raised the priority of this task from to Needs Triage.

• Springle updated the task description. (Show Details)

• Springle added subscribers: • Springle, jcrespo.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 8 2015, 10:49 AM

To start some discussion. The questions are easy, the details are not:

Do we trust 10.0 enough?

Yes, once we have done proper testing (profiling thatis already ongoing, and more will be done).

Which shard should we do first?

All shards are production, doesn't matter. The important thing is having a plan to roll back version and changes (separately each one).

Should we attempt to trial 10.0 under real(ish) master load (thread pool, concurrency, not just serial replicated transactions, etc)?

Yes.

My main issue with 10 is that we need GTID-like features, but we cannot close our doors by chaining ourselves to diverging implementations and exclusive features. But that is not what this ticket is about.

(Please associate a project to this task if possible so it can be found on project searches / workboards. Is this SRE ?)

Aklapper added a project: DBA.Jul 10 2015, 1:41 PM

• Springle added a project: acl*sre-team.Jul 20 2015, 2:30 AM

• Springle set Security to None.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 20 2015, 2:30 AM

All s1-7 slaves are now 10.0.

In T105135#1437745, @jcrespo wrote:

Which shard should we do first?

All shards are production, doesn't matter. The important thing is having a plan to roll back version and changes (separately each one).

Agree in theory (and technically), but in terms of potential fallout after a mistake, they are not equal:

s1 and s4 are huge and would be painful to get wrong (in terms of upset users/public)
s7 has centralauth, so problems would affect all logged-in users on all shards
s5 is wikidata, which has non-standard write load and replag
s3 has many more tables/file-handles (maybe not big deal)

Thinking that doing s2 or s6 first would be safest.

My main issue with 10 is that we need GTID-like features, but we cannot close our doors by chaining ourselves to diverging implementations and exclusive features. But that is not what this ticket is about.

I think this ticket should be about that, because if we do not discuss it now, before migration is complete... then when :-) ? So, please feel free to throw up road blocks now!

Things I think they are a big issues:

Multisource replication (it does not work well, bugs with parallel?)
TokuDB (the versions we use have very important bugs, but I do not know if all of them are fixed in the latest versions)
Filtering (works well, but makes some tools more complex to use and error-prone)

None I think are related to masters:

I think we can avoid filtering, at least on production and maybe on labs, with some work.

TokuDB is not used except on dbstores/labs. We should update those to the latest version. Maybe for some of them (research) evaluate alternatives- for OLAP.
For storage saving I think InnoDB compressed was used for a while (what was your experience like?)
It would be lovely to also have MASTER_DELAY on those, which is not available on 10, only on 5.6.

Multisource is the big issue, there is an alternative on 5.7, but not a "short term" thing.

Main issue for me is GTID and ROW, but I do not have and answer to those yet, plus it has a dependency on sanitarium.

Now, my question is: why replacing 5.5 now? It works well and should work in jessie. Is there any specific issue we want to solve that requires update? (this is a rethorical question, but I want to know what are for you the main pain issues- queing, online alters?) HA is needed, but it is independent of the master version.

jcrespo triaged this task as Medium priority.Jul 20 2015, 3:51 PM

jcrespo moved this task from Triage to Backlog on the DBA board.

• JohnLewis subscribed.Jul 21 2015, 12:53 PM

Adding replication error as a blocking task, let's not rush and discard any replication issues with 10 masters first.

jcrespo added a parent task: T50628: Provide replication lag as a database function.Jul 28 2015, 4:00 PM

jcrespo closed subtask T106647: mariadb multi-source replication glitch with site_identifiers as Resolved.Aug 15 2015, 2:49 PM

jcrespo added a subtask: T109116: dbproxy servers for codfw.Aug 15 2015, 2:52 PM

RobH changed the status of subtask T109116: dbproxy servers for codfw from Open to Stalled.Aug 18 2015, 7:26 PM

jcrespo removed a parent task: T50628: Provide replication lag as a database function.Oct 5 2015, 9:44 AM

jcrespo added a subtask: T120122: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic).Dec 2 2015, 7:47 PM

jcrespo added a project: codfw-rollout-Jan-Mar-2016.Jan 25 2016, 7:57 PM

Restricted Application added a project: codfw-rollout. · View Herald TranscriptJan 25 2016, 7:57 PM

jcrespo added a parent task: T124699: Change configuration to make codfw db masters as the masters of all datacenters.Jan 25 2016, 8:01 PM

Doing s2 next week due to T122048

We will switchover from db1024 -> db1018

jcrespo raised the priority of this task from Medium to High.Jan 28 2016, 9:09 PM

jcrespo added a subtask: T125215: Prepare db1018 and s2-slaves for s2 master failover.Jan 29 2016, 3:14 PM

RobH closed subtask T109116: dbproxy servers for codfw as Declined.Feb 8 2016, 9:12 PM

jcrespo closed subtask T125215: Prepare db1018 and s2-slaves for s2 master failover as Resolved.Feb 10 2016, 2:17 PM

jcrespo removed a parent task: T124699: Change configuration to make codfw db masters as the masters of all datacenters.Mar 2 2016, 4:53 PM

jcrespo moved this task from Backlog to In Progress on the codfw-rollout-Jan-Mar-2016 board.Mar 8 2016, 2:04 PM

jcrespo renamed this task from prepare for mariadb 10.0 masters to Implement mariadb 10.0 masters.Mar 8 2016, 2:11 PM

jcrespo removed projects: codfw-rollout, codfw-rollout-Jan-Mar-2016.

jcrespo updated the task description. (Show Details)

jcrespo moved this task from Backlog to Pending comment on the DBA board.Mar 23 2016, 2:14 PM

jcrespo moved this task from Pending comment to In progress on the DBA board.Apr 15 2016, 1:13 PM

All s* servers now have mariadb10 masters. We keep the old 5.5 masters in case a rollback is needed.

There are still some core servers in 5.5, mainly x1.

Change 284455 had a related patch set uploaded (by Volans):
Change eqiad masters for s1,s3-s7

https://gerrit.wikimedia.org/r/284455

gerritbot added a project: Patch-For-Review.Apr 20 2016, 1:48 PM

Change 284455 merged by jenkins-bot:
Change eqiad masters for s1,s3-s7

https://gerrit.wikimedia.org/r/284455

Mentioned in SAL [2016-04-20T14:56:36Z] <volans@tin> Synchronized wmf-config/db-eqiad.php: Change eqiad masters for s1,s3-s7 - T105135 (duration: 00m 28s)

jcrespo closed subtask T120122: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic) as Resolved.Apr 22 2016, 9:58 AM

Done, not without some issues: T133309

Pending tasks tracked separately: T109179 T133385

jcrespo added a parent task: T85266: Look into Maria 10 parallel-replication.May 2 2016, 7:34 AM

Implement mariadb 10.0 masters
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Implement mariadb 10.0 mastersClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Implement mariadb 10.0 masters
Closed, ResolvedPublic
Actions

Related Objects
Search...