Implement a frontend failover solution for labsdb replicas
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• chasemp
	Jul 22 2016, 2:00 PM

Description

Currently we have no good way to depool a labsdb host for normal maintenance or in case of failure.

The model as it stands:

Every production database replica exists on each labsdb server.
We have a really rough and dirty array in puppet that manages service hostnames pointing to certain DBs at a certain physical host (e.g. "enwiki.labsdb", "wikidatawiki.labsdb", ...).
User tables can be created on any given physical host, have no life expectancy beyond the physical host (i.e. no replication or backup), and have no uptime guarantees given they are tied to a single fallible physical host.
When a physical host has an unplanned issue it is always an outage event for users with this setup for replica access, and when a physical host has a planned maintenance window it is always an outage for any user tables on that host.

The model we would like to consider:

All DB replicas could exist on all servers or not but probably would for node parity.
We would have a proxy or intermediary process pooling/depooling backend replicas for maintenance or in case of failure. We use haproxy for this in production. It would be really advisable IMO to keep this consistent with production.
Service hostnames for DB's would point the proxy which would ensure service integrity as much as possible
Making changes for better availability of user tables is currently undecided. Should they remain the same? They are problematic in any situation with an abstracted labsdb replication service as persistence of user connections to the same backend that stores both their table and replica DBs is problematic. We do not have a mechanism that solves this problem. In modern setups we would shard user datebases across multiple physical hosts (which removes the ability to perform SQL JOINs with production replica tables), or accept the transient nature of user tables on replica servers.

Details

Subject	Repo	Branch	Lines +/-
Add dns alias for analytics and web requests labsdb service	operations/dns	master	+2 -0
Deploy HAProxy to the new labsdb proxies	operations/puppet	production	+19 -3
proxysql: Setup dbproxy1011 as a test host for labs::db::proxy	operations/puppet	production	+4 -3
Create labs::db::proxy role to load balance and failover replicas	operations/puppet	production	+38 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	jcrespo	T140788 Labs databases rearchitecture (tracking)
Resolved	jcrespo	T141097 Implement a frontend failover solution for labsdb replicas
Resolved	jcrespo	T140452 Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011
Resolved	jcrespo	T149422 Prepare for mariadb 10.1
Resolved	jcrespo	T148500 Implement proxysql both for labs and for later production usage
Resolved	• Cmjohnson	T149170 Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002
Resolved	jcrespo	T149844 Decide between proxysql and haproxy for labsdbproxy service

Event Timeline

• chasemp created this task.Jul 22 2016, 2:00 PM

I think that, of all things to fix, this one is a hard blocker for T140452.

bd808 renamed this task from labsdb* has no HA solution to labsdb* has no High Availability solution.Jul 22 2016, 3:45 PM

bd808 updated the task description. (Show Details)

• chasemp updated the task description. (Show Details)Jul 22 2016, 3:48 PM

• chasemp updated the task description. (Show Details)Jul 22 2016, 6:25 PM

Danny_B removed projects: Epic, Tracking-Neverending.Jul 23 2016, 2:21 AM

Danny_B unsubscribed.Jul 23 2016, 2:28 AM

bd808 changed the title from "labsdb* has no HA solution" to "labsdb* has

no High Availability solution".

I do not like much this title. I think we all understand the problem, but
labsdb has an HA solution (unless you take the hard definition of 99.999%
availability, which we definitely do not want): there is high availability
-if a server fails we do not lose the service nor the data-, it is just
that it is fully manual, very unflexible, and inconvenient for our users-
and implementing it is not obvious.

I think the title could misslead readers that are not on the loop.

In T141097#2489365, @jcrespo wrote:

I think the title could misslead readers that are not on the loop.

Please improve the title as needed. I just expanded the "HA" jargon to words that non-sysadmin users might understand after being asked on IRC by a reasonably computer literate user what "HA" meant.

jcrespo renamed this task from labsdb* has no High Availability solution to labsdb* has no automatic failover solution.Aug 11 2016, 3:07 PM

jcrespo added a subtask: T140452: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011.Aug 12 2016, 7:27 AM

@chasemp, We cannot wait more on this; @mark has specifically asked me to unblock this so we can go on with T142807.

@jcrespo, that's not what I said. :)

It sounds like this is not an easy decision that we can make without consequences and without gathering more information first. I've proposed we discuss this in a session at the Ops offsite.

jcrespo removed a subscriber: coren.Aug 12 2016, 10:51 AM

Brief proposal for the discussion:

new labsdb boxes
haproxy (somewhere?)
service url for reaching these new labsdb boxes to access replicas
service url for reaching one of these labsdb boxes for user tables

At the moment user tables are basically short to medium term in that they live on a single server and die with that server, and may or may not be respected based on any number of factors (size, use, etc). They are an artifact of having no better offerings for a certain subset of use cases. In the outline above we maintain the same persistent and support outcomes for and allow normal maintenance and process for the backend pool. i.e. user table availability is best effort.

open questions:

how many user tables are there defined?
where are they used from?
what are they used for? (hardest to answer)
how much space do they occupy? (can we cram them onto one new server)

can we cram them onto one new server

I do not have right now exact numbers, and some may be excessively large, but yes, we can cram them in a single server, at least one of the 3 new ones.

I think a good strategy to get rid of unused data is to migrate those tables in an opt-in way, so we do not waste resources on a long-time dead project. Also, we can convert user tables to InnoDB when transferring them. The number is small enough to be able to contact each user individually.

So a proposal would be:

Setup a long-running query service with user tables (most should be scratch data anyway)
Contact users to migrate user tables to toolsdb (labsdb100[45]), spread the love of replication; if extra performance is needed, I would accept not moving labsdb1008 back to enwiki, but keep it as the "user database"
Those that cannot, migrate to InnoDB and the "slow" query service single host with not redundancy guarantees
Setup a "fast" query service with a limit of X seconds with no user tables; this service can failover to any of the other 2
Let people decide what to do with the third server; it can be a backup (but active) server for both services
I think I could load balance writes so they go automatically to one server, but I would have to check
Potentially deprecate user tables (e.g. not allow to create new databases), slowly move them to a separate server
Announce those changes in advance to get user feedback

• chasemp renamed this task from labsdb* has no automatic failover solution to Implement a frontend failover solution for labsdb (possibly HAProxy).Sep 29 2016, 1:00 PM

zhuyifei1999 subscribed.Sep 29 2016, 1:40 PM

Marostegui renamed this task from Implement a frontend failover solution for labsdb (possibly HAProxy) to Implement a frontend failover solution for labsdb replicas.Sep 30 2016, 9:35 AM

jcrespo created subtask T148500: Implement proxysql both for labs and for later production usage.Oct 18 2016, 10:54 AM

Change 316558 had a related patch set uploaded (by Jcrespo):
New labs::db::proxy role to load balance and failover replicas

https://gerrit.wikimedia.org/r/316558

gerritbot added a project: Patch-For-Review.Oct 18 2016, 1:31 PM

Change 316558 merged by Jcrespo:
Create labs::db::proxy role to load balance and failover replicas

https://gerrit.wikimedia.org/r/316558

Change 317173 had a related patch set uploaded (by Jcrespo):
proxysql: Setup dbproxy1011 as a test host for labs::db::proxy

https://gerrit.wikimedia.org/r/317173

Change 317173 merged by Jcrespo:
proxysql: Setup dbproxy1011 as a test host for labs::db::proxy

https://gerrit.wikimedia.org/r/317173

jcrespo closed subtask T148500: Implement proxysql both for labs and for later production usage as Resolved.Oct 26 2016, 8:48 AM

jcrespo created subtask T149170: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002.Oct 26 2016, 9:01 AM

jcrespo created subtask T149844: Decide between proxysql and haproxy for labsdbproxy service.Nov 2 2016, 8:34 PM

jcrespo closed subtask T140452: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011 as Resolved.Nov 3 2016, 6:13 PM

jcrespo closed subtask T149844: Decide between proxysql and haproxy for labsdbproxy service as Resolved.Nov 14 2016, 6:20 PM

dbproxy1010 and dbproxy1011 are now serving as proxies for labsdb1009/10/11 on the labs-support network (they just have been reinstalled). Proper puppetization and final configuration, including dns point of entry is to be decided, but at least they work as a proof of concept right now. CC @Marostegui

jcrespo mentioned this in T149170: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002.Nov 23 2016, 6:59 PM

jcrespo closed subtask T149170: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 as Resolved.

How is the haproxy layer failed over (between nodes) in prod atm? LVS or ucarp/VRRP or ?

How is the haproxy layer failed over (between nodes) in prod atm? LVS or ucarp/VRRP or ?

There is no redundancy at the moment, not when we do not have enough machines to provide it. In production, the plan is to setup a proxy per client at localhost. In labs, I would like to do a simpler dns-based failover. I would like to do virtual IP/heartbeat, but Faidon says it is not possible cross-rack. Note performance is not a concern on labs, and LVS on top of HAProxy makes no sense, while our LVS setup is not thought for stateful services.

jcrespo moved this task from Triage to In progress on the DBA board.Dec 2 2016, 12:20 PM

Change 324893 had a related patch set uploaded (by Jcrespo):
Deploy HAProxy to the new labsdb proxies

https://gerrit.wikimedia.org/r/324893

Change 324893 merged by Jcrespo:
Deploy HAProxy to the new labsdb proxies

https://gerrit.wikimedia.org/r/324893

Change 324905 had a related patch set uploaded (by Jcrespo):
Add dns alias for analytics and web requests labsdb service

https://gerrit.wikimedia.org/r/324905

Change 324905 merged by Jcrespo:
Add dns alias for analytics and web requests labsdb service

https://gerrit.wikimedia.org/r/324905

$ mysql -h labsdb-web -u u<SANITIZED> -p$PASS
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 885492
Server version: 5.5.5-10.1.19-MariaDB MariaDB Server

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB MARIADB labsdb-web (none) > SELECT @@hostname;
+------------+
| @@hostname |
+------------+
| labsdb1011 |
+------------+
1 row in set (0.00 sec)

MariaDB MARIADB labsdb-web (none) > Bye
$ mysql -h labsdb-analytics -u u<SANITIZED> -p$PASS
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 459489
Server version: 5.5.5-10.1.19-MariaDB MariaDB Server

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB MARIADB labsdb-analytics (none) > SELECT @@hostname;
+------------+
| @@hostname |
+------------+
| labsdb1009 |
+------------+
1 row in set (0.00 sec)

MariaDB MARIADB labsdb-analytics (none) > Bye

jcrespo mentioned this in T147051: Provide at least 2 separate service endpoints: one for slow, long running queries; and another for quick, web requests.Dec 2 2016, 4:29 PM

jcrespo mentioned this in T140452: Initial setup and provision of labsdb1009, labsdb1010 and labsdb1011.Dec 14 2016, 4:47 PM

We have settled on DNS based failover with haproxy for the time being between the two dbproxy hosts (1010 and 1011). Currently our web and analytics end points from T147051 terminate on different hosts and we can move endpoints between them as needed in case of issue. While this needs to be iterated on it's functional now.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:41 PM

Implement a frontend failover solution for labsdb replicasClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Implement a frontend failover solution for labsdb replicas
Closed, ResolvedPublic
Actions

Related Objects
Search...