Page MenuHomePhabricator

Spike: How to make our new limbo implementation high-availability?
Closed, ResolvedPublic2 Story Points

Description

Conclusions

The easiest solution from a development perspective would be to use Redis Cluster, which is available in Redis >= 3.0.0 and is recommended by the Redis authors as the best current practice. It transparently supports sharding, automatic failover, and node migration. Unfortunately, this has not yet landed in Ubuntu's stable distro, and I can't find any examples of successful production deployment, so it's out of the question for now.

Instead, we're going to do a manual sharding thing and simple master-slave replication. The topology will be like,

  • Payments 1001-1003: Each run a Redis master.
  • Payments 1004: Runs three Redis slaves, one for each master.

The following changes will need to be made to client code:

  • The DonationInterface frontends will read and write to Redis on localhost, each box maintaining its own limbo. If the completed transaction hook cannot find a limbo entry for the donation being processed, check whether the data is available on any of the three slaves. Failure to pop from the FIFO queue during transaction completion is fine, the orphan slayer is already robust against that.
  • The orphan slayer will pop from each master, going round-robin and taking one message from each queue.
  • The slayer should probably only use slaves for actual read operations (getting limbo data), and not substitute a slave read for a master pop (pull and dequeue most recent record) like we do from the frontend, otherwise we'll have to come up with a mechanism to prevent repeatedly chasing our tail.

Related Objects

StatusAssignedTask
DeclinedNone
ResolvedPcoombe
DeclinedNone
OpenNone
OpenNone
ResolvedNone
Declined atgo
ResolvedNone
DeclinedNone
DeclinedNone
OpenNone
OpenNone
Resolvedawight
Resolvedawight
Resolvedawight
Resolvedawight
Resolvedawight
Resolvedawight
Resolvedawight
Resolvedawight
Resolvedawight
Resolvedawight
Invalidawight
Resolvedawight
DeclinedNone
DeclinedNone
DeclinedNone
DeclinedNone

Event Timeline

awight created this task.Jun 20 2015, 12:41 AM
awight claimed this task.
awight raised the priority of this task from to High.
awight updated the task description. (Show Details)
awight added a project: Unplanned-Sprint-Work.
awight set Security to None.
awight edited a custom field.
awight added subscribers: awight, Aklapper.

Doesn't seem that the phpredis extension supports clustering natively, looking at rediscluster now to decide whether we need the abstraction.

awight added a subscriber: Jgreen.Jun 20 2015, 1:11 AM

RedisCluster doesn't support transactions yet, either. I tried to pry around in the innards and could improve nothing but causing a stack trace.

@Jgreen, what do you think about a single master with slaves, or some clever load-balancing situation so our client only connects to a single server?

Note to self: Another way out out of this hole would be to rewrite the transactions as pipelines, and become robust to the predictable ways data will be corrupted.

awight moved this task from Backlog to PCI on the Fundraising-Backlog-Old board.Jun 23 2015, 1:04 AM
awight renamed this task from Find a PHP Redis client that supports CAS transactions against a cluster to Spike: How to make our new limbo implementation high-availability?.Jul 1 2015, 8:40 PM
awight updated the task description. (Show Details)
awight updated the task description. (Show Details)Jul 1 2015, 8:54 PM
awight updated the task description. (Show Details)Jul 1 2015, 9:00 PM
awight updated the task description. (Show Details)
awight moved this task from Doing to Done on the Fundraising Sprint N*E*R*D board.Jul 1 2015, 10:04 PM
awight closed this task as Resolved.Jul 3 2015, 6:02 PM
mmodell removed a subscriber: awight.Jun 22 2017, 9:38 PM