Currently we have no good way to depool a labsdb host for normal maintenance or in case of failure.
The model as it stands:
- Every production database replica exists on each labsdb server.
- We have a really rough and dirty array in puppet that manages service hostnames pointing to certain DBs at a certain physical host (e.g. "enwiki.labsdb", "wikidatawiki.labsdb", ...).
- User tables can be created on any given physical host, have no life expectancy beyond the physical host (i.e. no replication or backup), and have no uptime guarantees given they are tied to a single fallible physical host.
- When a physical host has an unplanned issue it is always an outage event for users with this setup for replica access, and when a physical host has a planned maintenance window it is always an outage for any user tables on that host.
The model we would like to consider:
- All DB replicas could exist on all servers or not but probably would for node parity.
- We would have a proxy or intermediary process pooling/depooling backend replicas for maintenance or in case of failure. We use haproxy for this in production. It would be really advisable IMO to keep this consistent with production.
- Service hostnames for DB's would point the proxy which would ensure service integrity as much as possible
- Making changes for better availability of user tables is currently undecided. Should they remain the same? They are problematic in any situation with an abstracted labsdb replication service as persistence of user connections to the same backend that stores both their table and replica DBs is problematic. We do not have a mechanism that solves this problem. In modern setups we would shard user datebases across multiple physical hosts (which removes the ability to perform SQL JOINs with production replica tables), or accept the transient nature of user tables on replica servers.