Page MenuHomePhabricator

Redis solution for LockManager
Open, MediumPublic

Description

Currently the LockManager needs 3 different servers. However, we only have 2 primary physical Redis servers per DC. We have accepted the risk of using 3 different Redis instances (that is, Redis ports), where 2 of them are hosted on the same physical host.

For each DC we have 2 primary hosts and 2 secondary (replication) hosts, with separate Redis instances running on different ports.

Pair 1

PrimarySecondary
PrimaryA-6378SecondaryA-6378
PrimaryA-6379SecondaryA-6379
PrimaryA-6380SecondaryA-6380
PrimaryA-6381SecondaryA-6381
PrimaryA-6382SecondaryA-6382

Pair 2

PrimarySecondary
PrimaryB-6378SecondaryB-6378
PrimaryB-6379SecondaryB-6379
PrimaryB-6380SecondaryB-6380
PrimaryB-6381SecondaryB-6381
PrimaryB-6382SecondaryB-6382

The majority of the applications and services using these 2 physical primary Redis nodes can tolerate some unavailability, for example during server rebooting rounds. As far as LockManager is concerned, we usually just go ahead and reboot the Redis servers, primarily because we know that the chances of something going wrong are slim.

LockManager

In T366938: Reduce relying on database locks a more sophisticated logic is introduced, which makes this part more fault tolerant. It uses consistent hashing, where for each key, every server in the pool is hashed together with that key.

  • To pick the voters for a key, we sort the servers by that combined hash and take the first N.
  • Adding a server gives every key one new hash to consider.
  • Removing a server hands its keys over to whichever server ranks next for them.
  • The pool can grow or shrink without reshuffling the whole keyspace.

This decouples the size of the voter set (N) from the size of the pool. The pool can be larger than N, and per key we simply take the top N. That is what gives us graceful degradation: if a server disappears, only the keys for which it was a voter are affected, and those keys will go to the next server in their ranking.

Options
1. Two instances per primary host, with the secondaries as manual standbys
SlotInstance
Slot1PrimaryA-6378
Slot2PrimaryA-6379
Slot3PrimaryB-6378
Slot4PrimaryB-6379

2 hosts for 4 voters
The pool of four instances spans only two physical fault domains: PrimaryA hosts two of them and PrimaryB hosts the other two.

Problems:

  • For any key, at least two of its voters will share a host.
  • Rebooting a single primary takes out two voters simultaneously.
  • One physical host going away still breaks locking.
  • In effect this is the situation we already have, just with more ports.
2. One non-replicating instance per primary and one per secondary
SlotInstance
Slot1PrimaryA-6378
Slot2SecondaryA-6378
Slot3PrimaryB-6378
Slot4SecondaryB-6378

4 hosts for 4 voters
The four instances sit on four distinct physical hosts, so any single reboot removes only one voter and quorum holds.

Problems:

  • The established mental model is that the two primaries replicate to their secondaries, and that a secondary exists *replace* for its primary.
  • Here the secondaries are instead independent, non-replicating LockManager voters that happen to live on the replication hosts, which creates a special case.
  • When PrimaryA fails and an SRE works through the usual (very draining) replacement, promoting SecondaryA application by application.
    • For LockManager we can't go down that path, since SecondaryA is already a separate member of the voter pool and not a replica of PrimaryA's LockManager instance.
  • We can have that information documented in the code
3. A third pair of hosts (VMs)
PrimarySecondary
PrimaryA-6378SecondaryA-6378
PrimaryB-6378SecondaryB-6378
PrimaryC-6378SecondaryC-6378

3 hosts for 3 voters
This adds a third primary host (PrimaryC) with its own replicating secondary (SecondaryC), so that three voters map one to one onto three physical primary hosts.

  • Each voter has its own host, so a single reboot costs us one voter out of three and quorum holds.
  • PrimaryC replicates to SecondaryC just like the existing pairs, so the primary/secondary model is preserved and host replacement and reboot rounds need no LockManager-specific procedure.
  • In case of hardware issues, VMs can be transparently migrated to healthy nodes

Problem:

  • We are adding 2 VMs per DC for a single service.
  • That is real operational and capacity overhead.
    • It is can be justfied, given how much the database wellness will be at stake with T366938
    • We do have separate hosts dedicated to poolcounter due to its importance

Event Timeline

While solution #2 is more fault tolerant but at the cost of some hidden complexity, which can be documented in the code.

My preference is #3. While it is less fault tolerant, the operational cost is two extra servers per DC. Beyond that, everything stays clean and behaves as expected.

MLechvien-WMF subscribed.

Thanks for the detailed analysis @jijiki .
Moving to next quarter for discussion as we're oversubscribed this quarter. @Ladsgroup FYI