Page MenuHomePhabricator

Implement a frontend failover solution for labsdb replicas
Closed, ResolvedPublic

Description

Currently we have no good way to depool a labsdb host for normal maintenance or in case of failure.

The model as it stands:

  • Every production database replica exists on each labsdb server.
  • We have a really rough and dirty array in puppet that manages service hostnames pointing to certain DBs at a certain physical host (e.g. "enwiki.labsdb", "wikidatawiki.labsdb", ...).
  • User tables can be created on any given physical host, have no life expectancy beyond the physical host (i.e. no replication or backup), and have no uptime guarantees given they are tied to a single fallible physical host.
  • When a physical host has an unplanned issue it is always an outage event for users with this setup for replica access, and when a physical host has a planned maintenance window it is always an outage for any user tables on that host.

The model we would like to consider:

  • All DB replicas could exist on all servers or not but probably would for node parity.
  • We would have a proxy or intermediary process pooling/depooling backend replicas for maintenance or in case of failure. We use haproxy for this in production. It would be really advisable IMO to keep this consistent with production.
  • Service hostnames for DB's would point the proxy which would ensure service integrity as much as possible
  • Making changes for better availability of user tables is currently undecided. Should they remain the same? They are problematic in any situation with an abstracted labsdb replication service as persistence of user connections to the same backend that stores both their table and replica DBs is problematic. We do not have a mechanism that solves this problem. In modern setups we would shard user datebases across multiple physical hosts (which removes the ability to perform SQL JOINs with production replica tables), or accept the transient nature of user tables on replica servers.

Event Timeline

I think that, of all things to fix, this one is a hard blocker for T140452.

bd808 renamed this task from labsdb* has no HA solution to labsdb* has no High Availability solution.Jul 22 2016, 3:45 PM
bd808 updated the task description. (Show Details)

bd808 changed the title from "labsdb* has no HA solution" to "labsdb* has

no High Availability solution".

I do not like much this title. I think we all understand the problem, but
labsdb has an HA solution (unless you take the hard definition of 99.999%
availability, which we definitely do not want): there is high availability
-if a server fails we do not lose the service nor the data-, it is just
that it is fully manual, very unflexible, and inconvenient for our users-
and implementing it is not obvious.

I think the title could misslead readers that are not on the loop.

I think the title could misslead readers that are not on the loop.

Please improve the title as needed. I just expanded the "HA" jargon to words that non-sysadmin users might understand after being asked on IRC by a reasonably computer literate user what "HA" meant.

jcrespo renamed this task from labsdb* has no High Availability solution to labsdb* has no automatic failover solution.Aug 11 2016, 3:07 PM

@chasemp, We cannot wait more on this; @mark has specifically asked me to unblock this so we can go on with T142807.

@jcrespo, that's not what I said. :)

It sounds like this is not an easy decision that we can make without consequences and without gathering more information first. I've proposed we discuss this in a session at the Ops offsite.

Brief proposal for the discussion:

  • new labsdb boxes
  • haproxy (somewhere?)
  • service url for reaching these new labsdb boxes to access replicas
  • service url for reaching one of these labsdb boxes for user tables

At the moment user tables are basically short to medium term in that they live on a single server and die with that server, and may or may not be respected based on any number of factors (size, use, etc). They are an artifact of having no better offerings for a certain subset of use cases. In the outline above we maintain the same persistent and support outcomes for and allow normal maintenance and process for the backend pool. i.e. user table availability is best effort.

open questions:

  • how many user tables are there defined?
  • where are they used from?
  • what are they used for? (hardest to answer)
  • how much space do they occupy? (can we cram them onto one new server)

can we cram them onto one new server

I do not have right now exact numbers, and some may be excessively large, but yes, we can cram them in a single server, at least one of the 3 new ones.

I think a good strategy to get rid of unused data is to migrate those tables in an opt-in way, so we do not waste resources on a long-time dead project. Also, we can convert user tables to InnoDB when transferring them. The number is small enough to be able to contact each user individually.

So a proposal would be:

  • Setup a long-running query service with user tables (most should be scratch data anyway)
  • Contact users to migrate user tables to toolsdb (labsdb100[45]), spread the love of replication; if extra performance is needed, I would accept not moving labsdb1008 back to enwiki, but keep it as the "user database"
  • Those that cannot, migrate to InnoDB and the "slow" query service single host with not redundancy guarantees
  • Setup a "fast" query service with a limit of X seconds with no user tables; this service can failover to any of the other 2
  • Let people decide what to do with the third server; it can be a backup (but active) server for both services
  • I think I could load balance writes so they go automatically to one server, but I would have to check
  • Potentially deprecate user tables (e.g. not allow to create new databases), slowly move them to a separate server
  • Announce those changes in advance to get user feedback
chasemp renamed this task from labsdb* has no automatic failover solution to Implement a frontend failover solution for labsdb (possibly HAProxy).Sep 29 2016, 1:00 PM
Marostegui renamed this task from Implement a frontend failover solution for labsdb (possibly HAProxy) to Implement a frontend failover solution for labsdb replicas.Sep 30 2016, 9:35 AM

Change 316558 had a related patch set uploaded (by Jcrespo):
New labs::db::proxy role to load balance and failover replicas

https://gerrit.wikimedia.org/r/316558

Change 316558 merged by Jcrespo:
Create labs::db::proxy role to load balance and failover replicas

https://gerrit.wikimedia.org/r/316558

Change 317173 had a related patch set uploaded (by Jcrespo):
proxysql: Setup dbproxy1011 as a test host for labs::db::proxy

https://gerrit.wikimedia.org/r/317173

Change 317173 merged by Jcrespo:
proxysql: Setup dbproxy1011 as a test host for labs::db::proxy

https://gerrit.wikimedia.org/r/317173

jcrespo added a subscriber: Marostegui.

dbproxy1010 and dbproxy1011 are now serving as proxies for labsdb1009/10/11 on the labs-support network (they just have been reinstalled). Proper puppetization and final configuration, including dns point of entry is to be decided, but at least they work as a proof of concept right now. CC @Marostegui

How is the haproxy layer failed over (between nodes) in prod atm? LVS or ucarp/VRRP or ?

How is the haproxy layer failed over (between nodes) in prod atm? LVS or ucarp/VRRP or ?

There is no redundancy at the moment, not when we do not have enough machines to provide it. In production, the plan is to setup a proxy per client at localhost. In labs, I would like to do a simpler dns-based failover. I would like to do virtual IP/heartbeat, but Faidon says it is not possible cross-rack. Note performance is not a concern on labs, and LVS on top of HAProxy makes no sense, while our LVS setup is not thought for stateful services.

Change 324893 had a related patch set uploaded (by Jcrespo):
Deploy HAProxy to the new labsdb proxies

https://gerrit.wikimedia.org/r/324893

Change 324893 merged by Jcrespo:
Deploy HAProxy to the new labsdb proxies

https://gerrit.wikimedia.org/r/324893

Change 324905 had a related patch set uploaded (by Jcrespo):
Add dns alias for analytics and web requests labsdb service

https://gerrit.wikimedia.org/r/324905

Change 324905 merged by Jcrespo:
Add dns alias for analytics and web requests labsdb service

https://gerrit.wikimedia.org/r/324905

$ mysql -h labsdb-web -u u<SANITIZED> -p$PASS
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 885492
Server version: 5.5.5-10.1.19-MariaDB MariaDB Server

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB MARIADB labsdb-web (none) > SELECT @@hostname;
+------------+
| @@hostname |
+------------+
| labsdb1011 |
+------------+
1 row in set (0.00 sec)

MariaDB MARIADB labsdb-web (none) > Bye
$ mysql -h labsdb-analytics -u u<SANITIZED> -p$PASS
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 459489
Server version: 5.5.5-10.1.19-MariaDB MariaDB Server

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB MARIADB labsdb-analytics (none) > SELECT @@hostname;
+------------+
| @@hostname |
+------------+
| labsdb1009 |
+------------+
1 row in set (0.00 sec)

MariaDB MARIADB labsdb-analytics (none) > Bye

We have settled on DNS based failover with haproxy for the time being between the two dbproxy hosts (1010 and 1011). Currently our web and analytics end points from T147051 terminate on different hosts and we can move endpoints between them as needed in case of issue. While this needs to be iterated on it's functional now.