Meta ticket: Migrate multi-source database hosts to multi-instance
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Marostegui
	Mar 2 2017, 10:01 AM

Description

During some of the discussions between @jcrespo and myself, we have been talking about the future of multi source in our infrastructure and if it is worth trying to go for it or if it is better to go towards multi instance (several MySQL processes running on different ports).

There are several pro and cons for each design, and at the moment we have both solutions in place.
dbstore servers + labsdbservers + db1095 (new sanitarium) -> multisource
db1069 (old sanitarium) -> multi instance

We have found a bug in MariaDB (https://jira.mariadb.org/browse/MDEV-12012) that doesn't allow us to turn on GTID on multi source just yet, which prevent us to have "crash safe" replication.
Meaning that if a multisource slave crashes, it might corrupt data and we might need to rebuild it. The nature of a multisource slaves is a server with lots of data to import, so that work can take weeks to complete.

Those issues have triggered the discussion of what should we head to: multisource vs multi instance.

This is just the epic ticket so we can discuss and try to decide what we want to go for in the near future.

Feel free to change the description if needed.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T159423 Meta ticket: Migrate multi-source database hosts to multi-instance
Resolved	Marostegui	T165033 dbstore2001 takes 3 hours to start MySQL (was: dbstore2001 takes 3 hours to start MySQL after a crash)
Resolved	Marostegui	T169510 Setup dbstore2002 with 2 new mysql instances from production and enable GTID
Resolved	jcrespo	T168409 Migrate dbstore2001 to multi instance
Resolved	jcrespo	T169514 Refactor puppet mariadb class to support multi-instance hosts
Resolved	Marostegui	T171321 Finish dbstore2002 migration to multi-instance
Resolved	Marostegui	T178359 Support multi-instance on core hosts
Resolved	Marostegui	T190704 Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy
Resolved	Marostegui	T196376 Productionize old/temporary eqiad sanitariums
Resolved	Marostegui	T196527 Clean up sanitarium_multisource related code
Declined	None	T159430 convert dbstore1001 to multi-instance + InnoDB compressed by importing db shards to it
Resolved	Marostegui	T153743 Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts
Resolved	Marostegui	T157931 s5: db1070 not using file per table
Resolved	Marostegui	T168021 setup dewiki and wikidatawiki on the labsdb1009, 1010 and 1011
		Unknown Object (Task)
Resolved	Marostegui	T194780 rack/setup/install db112[45].eqiad.wmnet (sanitarium expansion)
Resolved	jcrespo	T194516 Bug on mariadb systemd unit on stretch for multi-instance hosts
		Unknown Object (Task)
Resolved	Marostegui	T194781 rack/setup/install db209[45].codfw.wmnet (sanitarium expansion)
Resolved	RobH	T195507 switch port information for db209[4-5]
Resolved	Marostegui	T200509 Make sure multi-instance slaves page

Event Timeline

Marostegui created this task.Mar 2 2017, 10:01 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 2 2017, 10:01 AM

Marostegui moved this task from Triage to Meta/Epic on the DBA board.Mar 2 2017, 10:01 AM

jcrespo added parent tasks: T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts, T159430: convert dbstore1001 to multi-instance + InnoDB compressed by importing db shards to it.Mar 2 2017, 10:27 AM

ArielGlenn subscribed.Mar 6 2017, 5:39 PM

Probably the issue we see on dbstore2001 (T165033) could be mitigated by having having multi-instance rather than multi-source.

akosiaris subscribed.May 17 2017, 1:23 PM

Another issue faced when trying to import compressed tablespaces into a host (which is only needed for multi source hosts really, as the other ones can be done with a physical copy): T153743#3299482

Does it make sense to keep this open if we have already decided to go for multi-instance?

Can we rename it to "Implement X" #meta ?

jcrespo renamed this task from Meta ticket: The future of multi source replication slaves vs multi instance ones. to Meta ticket: Migrate multi-source database hosts to multi-instance.Jul 3 2017, 10:24 AM

Marostegui triaged this task as Medium priority.Jul 3 2017, 10:24 AM

jcrespo added a subtask: T169510: Setup dbstore2002 with 2 new mysql instances from production and enable GTID.Jul 3 2017, 10:32 AM

jcrespo added a subtask: T168409: Migrate dbstore2001 to multi instance.Jul 3 2017, 10:40 AM

jcrespo created subtask T169514: Refactor puppet mariadb class to support multi-instance hosts.Jul 3 2017, 10:42 AM

jcrespo created subtask T171321: Finish dbstore2002 migration to multi-instance.Jul 21 2017, 4:21 PM

jcrespo closed subtask T169510: Setup dbstore2002 with 2 new mysql instances from production and enable GTID as Resolved.Jul 24 2017, 6:30 AM

Marostegui closed subtask T171321: Finish dbstore2002 migration to multi-instance as Resolved.Aug 7 2017, 12:16 PM

Marostegui closed subtask T168409: Migrate dbstore2001 to multi instance as Resolved.Aug 18 2017, 8:41 AM

jcrespo reopened subtask T168409: Migrate dbstore2001 to multi instance as Open.Aug 23 2017, 9:47 AM

jcrespo closed subtask T168409: Migrate dbstore2001 to multi instance as Resolved.Sep 12 2017, 4:07 PM

jcrespo closed subtask T169514: Refactor puppet mariadb class to support multi-instance hosts as Resolved.Oct 2 2017, 4:32 PM

Marostegui created subtask T178359: Support multi-instance on core hosts.Oct 17 2017, 9:45 AM

Marostegui closed subtask T178359: Support multi-instance on core hosts as Resolved.Dec 12 2017, 9:23 AM

Only dbstore1002 pending, which we are in talks with Analytics to replace soon CC @Ottomata @elukey (not sure if there is a ticket already).

We had a chat with the Research team during the offsite and they are onboard with having a multi host/instance version of dbstore1002, but we haven't still reached out to all the data analysts to get their opinion. Some preliminary questions:

Does Analytics need to request hardware for next FI for a multi-instance dbstore1002?
How soon are you guys planning to work on this?

The Analytics team's vision is to move all our data to HDFS and eventually move all our users away from mysql, both for the log database and for all the ones on dbstore1002. We would like to implement this plan during the next two years, that correspond to the expected lifespan of the analytics databases.

@elukey we have done our math about dbstore1002 replacement and redundancy for the next two years and this is what we came up with.

If we want to keep dbstore1002 we'd need 3 hosts for redundancy
If we DO NOT want to keep dbstore1002 we'd need 6 hosts for replacement+redundancy.

I guess you'd need to create a procurement ticket with whatever you decide (standard DB hosts meaning: 512G RAM + SSDs)

First of all thanks a lot for all the work put on figuring out a good config!

I had a chat with my team and we would prefer not to add budget for all the 6 nodes, but for something "cheaper" like 2/3 nodes max (possibly without re-using dbstore1002 that is becoming a big old). As far as we know people will be fine if for some reason some databases hosted on a broken host will be unavailable for 2 weeks for maintenance, so given this assumption would you guys be ok in provisioning a less redundant set of hosts?

Sorry for the extra pain points :(

So just to be clear, you just want 3 hosts and also decommission dbstore1002 - meaning that there will be no redundancy for the service. Is that understanding correct?

In T159423#3976734, @Marostegui wrote:

So just to be clear, you just want 3 hosts and also decommission dbstore1002 - meaning that there will be no redundancy for the service. Is that understanding correct?

Basically yes, the bare min set of hosts that would allow us to replace dbstore1002. We don't exactly know how many people are heavily relying on dbstore1002 at the moment for their daily work, but from all the feedback that we got it seems that it is tolerable to have 2/3 weeks of downtime if a major hw issue comes up.

This doesn't absolutely mean that we (as Analytics) don't care about the service, but due to the high budget that would be required for a full replicated environment we'd prefer to spend time and resources in:

replacing dbstore1002 with a "similar" configuration (no redundancy at the moment)
accelerate the work to eventually move all our use cases to HDFS only.

Let me know if you have any doubt or suggestion :)

All clear! Thanks :-)
Then 3 hosts should be it.

Or 2 + eventlogging, more realistically, with the current budget.

elukey mentioned this in T172410: Replace the current multisource analytics-store setup.Feb 23 2018, 8:34 AM

jcrespo mentioned this in T175790: dbstore1002 (analytics store) enwiki lag due to blocking query.Feb 26 2018, 2:26 PM

@Marostegui I wanted to setup db1113 as sanitarium multiinstance, but I see sanitarium_multiinstance.pp is very pourly puppetized. Compare it to core_multiinstance.pp.

In T159423#4023923, @jcrespo wrote:

@Marostegui I wanted to setup db1113 as sanitarium multiinstance, but I see sanitarium_multiinstance.pp is very pourly puppetized. Compare it to core_multiinstance.pp.

Yeah, it was done quickly to get out from the woods. It needs some care and time, which we might not have now :(

jcrespo removed a parent task: T159430: convert dbstore1001 to multi-instance + InnoDB compressed by importing db shards to it.Mar 26 2018, 7:58 PM

jcrespo added a subtask: T159430: convert dbstore1001 to multi-instance + InnoDB compressed by importing db shards to it.

jcrespo removed a parent task: T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts.

jcrespo added a subtask: T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts.

jcrespo added a subtask: Unknown Object (Task).Mar 26 2018, 8:02 PM

Marostegui changed the status of subtask Unknown Object (Task) from Open to Stalled.Apr 2 2018, 6:03 AM

Marostegui changed the status of subtask T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy from Open to Stalled.May 4 2018, 2:15 PM

jcrespo added a subtask: T194516: Bug on mariadb systemd unit on stretch for multi-instance hosts.May 11 2018, 5:27 PM

RobH added a subtask: Unknown Object (Task).May 15 2018, 4:46 PM

RobH changed the status of subtask Unknown Object (Task) from Stalled to Open.

jcrespo closed subtask T194516: Bug on mariadb systemd unit on stretch for multi-instance hosts as Resolved.May 16 2018, 1:31 PM

Marostegui changed the status of subtask T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy from Stalled to Open.May 25 2018, 4:20 PM

Marostegui closed subtask T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy as Resolved.Jun 11 2018, 5:53 AM

Just wanted to confirm that the Analytics team will go forward with the replacement of dbstore1002 with a multi instance set up during the upcoming fiscal year. I'll keep updating T172410 with more Analytics-related info.

Awesome news! Thanks for the heads up!

@elukey: Just wanted to confirm that the Analytics team will go forward with the replacement of dbstore1002 with a multi instance set up during the upcoming fiscal year. I'll keep updating T172410 with more Analytics-related info.

cc @faidon ^

RobH closed subtask Unknown Object (Task) as Resolved.Jun 22 2018, 7:09 PM

RobH closed subtask Unknown Object (Task) as Resolved.

elukey mentioned this in Unknown Object (Task).Jun 26 2018, 9:05 AM

• Vvjjkkii reopened subtask T194516: Bug on mariadb systemd unit on stretch for multi-instance hosts as Open.Jul 1 2018, 1:10 AM

Marostegui closed subtask T194516: Bug on mariadb systemd unit on stretch for multi-instance hosts as Resolved.Jul 1 2018, 6:41 PM

Marostegui added a subtask: T200509: Make sure multi-instance slaves page.Aug 2 2018, 6:21 AM

Just for the record.
The only hosts we have running with multi-source at the moment are:

labsdb1009-1011
dbstore1002 (which will be gone once the new HW is bought and set up)

Marostegui closed subtask T200509: Make sure multi-instance slaves page as Resolved.Sep 19 2018, 9:23 AM

@jcrespo @Marostegui: we have been discussing the future set up of the dbstore1002's replacements, and I have a question about accounts. The famous research user is an umbrella account that we use from the stat machines, and it has been bad for us for several reasons (no real accountability, one password shared by many users, etc..). Would it be possible to avoid re-creating it on the new db hosts, and have something like passwordless auth via unix socket for all the users that now are using the research account? Or any other feasible solution that you have in mind.. This would really be awesome for Analytics since we'll be able to delete the research account and know our users and their data query pattern, to eventually move them to Hadoop completely.

Does it sound reasonable?

elukey added a project: User-Elukey.Sep 21 2018, 11:07 PM

elukey added a subscriber: Milimetric.

@elukey the only way to use some sort of password-less auth for MySQL would be to indeed use unix socket authentication - however, users would need to be logged in in the server (ssh) to be able to use it, as it can only be used locally :-(
How many users do you have/will have? The Cloud Team developed a script to manage users and create users and passwords and then they get assigned to a profile which has some grants.
This still doesn't resolve your issue as each user have a password, but at least you can track each user's queries individually.

This is a good question, I don't know a precise number, but I'd say between 10 and 20? (very quick guess). My main goal is to figure out who is using these databases and why..

It shouldn't be a huge issue to ask users to have their own password, if there is automation around it :)

You might want to talk to @bd808 @chasemp or @Bstorm to get some more details about that script they use to handle those users.

If you mean "Can analytics prepare and setup a different authentication backend/system for the analytics MySQL Server? Will you help us do that?" the answer is "of course". MySQL supports many authentication methods, or you can check/ask how cloud manages its own for wikireplicas.

Socket authentication is out of the question, that only works for local login, which we will not allow for analytics (only roots should have local host login).

In T159423#4609785, @jcrespo wrote:

If you mean "Can analytics prepare and setup a different authentication backend/system for the analytics MySQL Server? Will you help us do that?" the answer is "of course". MySQL supports many authentication methods, or you can check/ask how cloud manages its own for wikireplicas.

All right I'll follow up with them :)

Socket authentication is out of the question, that only works for local login, which we will not allow for analytics (only roots should have local host login).

I imagined but wanted to double check with you guys first. Thanks!

This ticket was meant to be a discussion ticket but turned into a tracking ticket.
We are going to close it knowing that what is pending to migrate is:

dbstore1002 -> which is "on going" and there is HW that will be bought soon, hopefully.
labsdb10XX hosts -> which is something that will take a few years to migrate and a full re-arch as users won't be able to JOIN wikis from different sections directly on the DB.

Thanks everyone who helped out to get multi-instance to be a reality on core and misc, using resources in a much more efficient way!

Marostegui mentioned this in T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5].Nov 30 2018, 6:24 AM

jcrespo mentioned this in T234826: Repurpose db1108 as generic Analytics db replica.Nov 22 2019, 12:32 PM

Meta ticket: Migrate multi-source database hosts to multi-instanceClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Meta ticket: Migrate multi-source database hosts to multi-instance
Closed, ResolvedPublic
Actions

Related Objects
Search...