Currently the LockManager needs 3 different servers. However, we only have 2 primary physical Redis servers per DC. We have accepted the risk of using 3 different Redis instances (that is, Redis ports), where 2 of them are hosted on the same physical host.
For each DC we have 2 primary hosts and 2 secondary (replication) hosts, with separate Redis instances running on different ports.
Pair 1
| Primary | Secondary |
|---|---|
| PrimaryA-6378 | SecondaryA-6378 |
| PrimaryA-6379 | SecondaryA-6379 |
| PrimaryA-6380 | SecondaryA-6380 |
| PrimaryA-6381 | SecondaryA-6381 |
| PrimaryA-6382 | SecondaryA-6382 |
Pair 2
| Primary | Secondary |
|---|---|
| PrimaryB-6378 | SecondaryB-6378 |
| PrimaryB-6379 | SecondaryB-6379 |
| PrimaryB-6380 | SecondaryB-6380 |
| PrimaryB-6381 | SecondaryB-6381 |
| PrimaryB-6382 | SecondaryB-6382 |
The majority of the applications and services using these 2 physical primary Redis nodes can tolerate some unavailability, for example during server rebooting rounds. As far as LockManager is concerned, we usually just go ahead and reboot the Redis servers, primarily because we know that the chances of something going wrong are slim.
LockManager
In T366938: Reduce relying on database locks a more sophisticated logic is introduced, which makes this part more fault tolerant. It uses consistent hashing, where for each key, every server in the pool is hashed together with that key.
- To pick the voters for a key, we sort the servers by that combined hash and take the first N.
- Adding a server gives every key one new hash to consider.
- Removing a server hands its keys over to whichever server ranks next for them.
- The pool can grow or shrink without reshuffling the whole keyspace.
This decouples the size of the voter set (N) from the size of the pool. The pool can be larger than N, and per key we simply take the top N. That is what gives us graceful degradation: if a server disappears, only the keys for which it was a voter are affected, and those keys will go to the next server in their ranking.
Options
1. Two instances per primary host, with the secondaries as manual standbys
| Slot | Instance |
|---|---|
| Slot1 | PrimaryA-6378 |
| Slot2 | PrimaryA-6379 |
| Slot3 | PrimaryB-6378 |
| Slot4 | PrimaryB-6379 |
2 hosts for 4 voters
The pool of four instances spans only two physical fault domains: PrimaryA hosts two of them and PrimaryB hosts the other two.
Problems:
- For any key, at least two of its voters will share a host.
- Rebooting a single primary takes out two voters simultaneously.
- One physical host going away still breaks locking.
- In effect this is the situation we already have, just with more ports.
2. One non-replicating instance per primary and one per secondary
| Slot | Instance |
|---|---|
| Slot1 | PrimaryA-6378 |
| Slot2 | SecondaryA-6378 |
| Slot3 | PrimaryB-6378 |
| Slot4 | SecondaryB-6378 |
4 hosts for 4 voters
The four instances sit on four distinct physical hosts, so any single reboot removes only one voter and quorum holds.
Problems:
- The established mental model is that the two primaries replicate to their secondaries, and that a secondary exists *replace* for its primary.
- Here the secondaries are instead independent, non-replicating LockManager voters that happen to live on the replication hosts, which creates a special case.
- When PrimaryA fails and an SRE works through the usual (very draining) replacement, promoting SecondaryA application by application.
- For LockManager we can't go down that path, since SecondaryA is already a separate member of the voter pool and not a replica of PrimaryA's LockManager instance.
- We can have that information documented in the code
3. A third pair of hosts (VMs)
| Primary | Secondary |
|---|---|
| PrimaryA-6378 | SecondaryA-6378 |
| PrimaryB-6378 | SecondaryB-6378 |
| PrimaryC-6378 | SecondaryC-6378 |
3 hosts for 3 voters
This adds a third primary host (PrimaryC) with its own replicating secondary (SecondaryC), so that three voters map one to one onto three physical primary hosts.
- Each voter has its own host, so a single reboot costs us one voter out of three and quorum holds.
- PrimaryC replicates to SecondaryC just like the existing pairs, so the primary/secondary model is preserved and host replacement and reboot rounds need no LockManager-specific procedure.
- In case of hardware issues, VMs can be transparently migrated to healthy nodes
Problem:
- We are adding 2 VMs per DC for a single service.
- That is real operational and capacity overhead.
- It is can be justfied, given how much the database wellness will be at stake with T366938
- We do have separate hosts dedicated to poolcounter due to its importance