Page MenuHomePhabricator

Meta ticket: Migrate multi-source database hosts to multi-instance
Closed, ResolvedPublic

Description

During some of the discussions between @jcrespo and myself, we have been talking about the future of multi source in our infrastructure and if it is worth trying to go for it or if it is better to go towards multi instance (several MySQL processes running on different ports).

There are several pro and cons for each design, and at the moment we have both solutions in place.
dbstore servers + labsdbservers + db1095 (new sanitarium) -> multisource
db1069 (old sanitarium) -> multi instance

We have found a bug in MariaDB (https://jira.mariadb.org/browse/MDEV-12012) that doesn't allow us to turn on GTID on multi source just yet, which prevent us to have "crash safe" replication.
Meaning that if a multisource slave crashes, it might corrupt data and we might need to rebuild it. The nature of a multisource slaves is a server with lots of data to import, so that work can take weeks to complete.

Those issues have triggered the discussion of what should we head to: multisource vs multi instance.

This is just the epic ticket so we can discuss and try to decide what we want to go for in the near future.

Feel free to change the description if needed.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedMarostegui
ResolvedMarostegui
Resolvedjcrespo
Resolvedjcrespo
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
DeclinedNone
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
ResolvedMarostegui
Resolvedjcrespo
ResolvedMarostegui
ResolvedRobH
ResolvedMarostegui

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 2 2017, 10:01 AM
Marostegui moved this task from Triage to Meta/Epic on the DBA board.Mar 2 2017, 10:01 AM

Probably the issue we see on dbstore2001 (T165033) could be mitigated by having having multi-instance rather than multi-source.

Another issue faced when trying to import compressed tablespaces into a host (which is only needed for multi source hosts really, as the other ones can be done with a physical copy): T153743#3299482

Does it make sense to keep this open if we have already decided to go for multi-instance?

Can we rename it to "Implement X" #meta ?

jcrespo renamed this task from Meta ticket: The future of multi source replication slaves vs multi instance ones. to Meta ticket: Migrate multi-source database hosts to multi-instance.Jul 3 2017, 10:24 AM
Marostegui triaged this task as Medium priority.Jul 3 2017, 10:24 AM

Only dbstore1002 pending, which we are in talks with Analytics to replace soon CC @Ottomata @elukey (not sure if there is a ticket already).

We had a chat with the Research team during the offsite and they are onboard with having a multi host/instance version of dbstore1002, but we haven't still reached out to all the data analysts to get their opinion. Some preliminary questions:

  1. Does Analytics need to request hardware for next FI for a multi-instance dbstore1002?
  2. How soon are you guys planning to work on this?

The Analytics team's vision is to move all our data to HDFS and eventually move all our users away from mysql, both for the log database and for all the ones on dbstore1002. We would like to implement this plan during the next two years, that correspond to the expected lifespan of the analytics databases.

@elukey we have done our math about dbstore1002 replacement and redundancy for the next two years and this is what we came up with.

If we want to keep dbstore1002 we'd need 3 hosts for redundancy
If we DO NOT want to keep dbstore1002 we'd need 6 hosts for replacement+redundancy.

I guess you'd need to create a procurement ticket with whatever you decide (standard DB hosts meaning: 512G RAM + SSDs)

First of all thanks a lot for all the work put on figuring out a good config!

I had a chat with my team and we would prefer not to add budget for all the 6 nodes, but for something "cheaper" like 2/3 nodes max (possibly without re-using dbstore1002 that is becoming a big old). As far as we know people will be fine if for some reason some databases hosted on a broken host will be unavailable for 2 weeks for maintenance, so given this assumption would you guys be ok in provisioning a less redundant set of hosts?

Sorry for the extra pain points :(

So just to be clear, you just want 3 hosts and also decommission dbstore1002 - meaning that there will be no redundancy for the service. Is that understanding correct?

So just to be clear, you just want 3 hosts and also decommission dbstore1002 - meaning that there will be no redundancy for the service. Is that understanding correct?

Basically yes, the bare min set of hosts that would allow us to replace dbstore1002. We don't exactly know how many people are heavily relying on dbstore1002 at the moment for their daily work, but from all the feedback that we got it seems that it is tolerable to have 2/3 weeks of downtime if a major hw issue comes up.

This doesn't absolutely mean that we (as Analytics) don't care about the service, but due to the high budget that would be required for a full replicated environment we'd prefer to spend time and resources in:

  1. replacing dbstore1002 with a "similar" configuration (no redundancy at the moment)
  2. accelerate the work to eventually move all our use cases to HDFS only.

Let me know if you have any doubt or suggestion :)

All clear! Thanks :-)
Then 3 hosts should be it.

Or 2 + eventlogging, more realistically, with the current budget.

@Marostegui I wanted to setup db1113 as sanitarium multiinstance, but I see sanitarium_multiinstance.pp is very pourly puppetized. Compare it to core_multiinstance.pp.

@Marostegui I wanted to setup db1113 as sanitarium multiinstance, but I see sanitarium_multiinstance.pp is very pourly puppetized. Compare it to core_multiinstance.pp.

Yeah, it was done quickly to get out from the woods. It needs some care and time, which we might not have now :(

jcrespo added a subtask: Unknown Object (Task).Mar 26 2018, 8:02 PM
Marostegui changed the status of subtask Unknown Object (Task) from Open to Stalled.Apr 2 2018, 6:03 AM
RobH added a subtask: Unknown Object (Task).May 15 2018, 4:46 PM
RobH changed the status of subtask Unknown Object (Task) from Stalled to Open.

Just wanted to confirm that the Analytics team will go forward with the replacement of dbstore1002 with a multi instance set up during the upcoming fiscal year. I'll keep updating T172410 with more Analytics-related info.

Awesome news! Thanks for the heads up!

jcrespo added a subscriber: faidon.Jun 12 2018, 7:05 AM

@elukey: Just wanted to confirm that the Analytics team will go forward with the replacement of dbstore1002 with a multi instance set up during the upcoming fiscal year. I'll keep updating T172410 with more Analytics-related info.

cc @faidon ^

RobH closed subtask Unknown Object (Task) as Resolved.Jun 22 2018, 7:09 PM
RobH closed subtask Unknown Object (Task) as Resolved.
elukey mentioned this in Unknown Object (Task).Jun 26 2018, 9:05 AM

Just for the record.
The only hosts we have running with multi-source at the moment are:

  • labsdb1009-1011
  • dbstore1002 (which will be gone once the new HW is bought and set up)

@jcrespo @Marostegui: we have been discussing the future set up of the dbstore1002's replacements, and I have a question about accounts. The famous research user is an umbrella account that we use from the stat machines, and it has been bad for us for several reasons (no real accountability, one password shared by many users, etc..). Would it be possible to avoid re-creating it on the new db hosts, and have something like passwordless auth via unix socket for all the users that now are using the research account? Or any other feasible solution that you have in mind.. This would really be awesome for Analytics since we'll be able to delete the research account and know our users and their data query pattern, to eventually move them to Hadoop completely.

Does it sound reasonable?

@elukey the only way to use some sort of password-less auth for MySQL would be to indeed use unix socket authentication - however, users would need to be logged in in the server (ssh) to be able to use it, as it can only be used locally :-(
How many users do you have/will have? The Cloud Team developed a script to manage users and create users and passwords and then they get assigned to a profile which has some grants.
This still doesn't resolve your issue as each user have a password, but at least you can track each user's queries individually.

elukey added a comment.EditedSep 24 2018, 7:26 AM

This is a good question, I don't know a precise number, but I'd say between 10 and 20? (very quick guess). My main goal is to figure out who is using these databases and why..

It shouldn't be a huge issue to ask users to have their own password, if there is automation around it :)

You might want to talk to @bd808 @chasemp or @Bstorm to get some more details about that script they use to handle those users.

If you mean "Can analytics prepare and setup a different authentication backend/system for the analytics MySQL Server? Will you help us do that?" the answer is "of course". MySQL supports many authentication methods, or you can check/ask how cloud manages its own for wikireplicas.

Socket authentication is out of the question, that only works for local login, which we will not allow for analytics (only roots should have local host login).

If you mean "Can analytics prepare and setup a different authentication backend/system for the analytics MySQL Server? Will you help us do that?" the answer is "of course". MySQL supports many authentication methods, or you can check/ask how cloud manages its own for wikireplicas.

All right I'll follow up with them :)

Socket authentication is out of the question, that only works for local login, which we will not allow for analytics (only roots should have local host login).

I imagined but wanted to double check with you guys first. Thanks!

Marostegui closed this task as Resolved.Sep 28 2018, 2:50 PM

This ticket was meant to be a discussion ticket but turned into a tracking ticket.
We are going to close it knowing that what is pending to migrate is:

dbstore1002 -> which is "on going" and there is HW that will be bought soon, hopefully.
labsdb10XX hosts -> which is something that will take a few years to migrate and a full re-arch as users won't be able to JOIN wikis from different sections directly on the DB.

Thanks everyone who helped out to get multi-instance to be a reality on core and misc, using resources in a much more efficient way!