Page MenuHomePhabricator

Modify DonationInterface limbo code for high availability deployment
Closed, ResolvedPublic2 Estimated Story Points

Description

Implement plan set out in T103206. Make changes to our code and configuration so that we are minimally affected by the following types of failure:

  • One payments box becomes unreachable for a few minutes.
  • A payments box dies and can be rebuilt with intact data.
  • A payments box dies and is rebuilt with no data.

See updated documentation at https://wikitech.wikimedia.org/wiki/Fundraising#Message_queues

Rework code to handle limbo queues as an object rather than a global, and add logic to choose which backends we connect to in each case.

  • Frontend code writes to the queue on localhost, aka. payments100[1-3]. No code change, just configuration.
  • Orphan slayer connects to these three queues in round-robin order. Configure.
  • If the orphan slayer fails to connect to a server, eliminate it from this batch run and hobble on through.
  • Configuration for a single queue backend should behave as it did before.

Event Timeline

awight claimed this task.
awight raised the priority of this task from to High.
awight updated the task description. (Show Details)
awight added subscribers: atgo, awight.

@awight is this blocked on something? where should this fit in our workflow?

awight edited a custom field.
awight added a project: Unplanned-Sprint-Work.

Change 226948 had a related patch set uploaded (by Awight):
WIP Implement high-availability queue pool

https://gerrit.wikimedia.org/r/226948

Change 226948 abandoned by Awight:
WIP Implement high-availability queue pool

Reason:
squashed.

https://gerrit.wikimedia.org/r/226948