Page MenuHomePhabricator

payments-listener server high availability
Open, MediumPublic

Description

The payments-listener service is hosted on a single machine eqiad with a standby machine at codfw, switching to the backup in event of failure is a two step process--the codfw machine needs to be flipped in puppet out of 'maintenance mode' and the DNS record for payments-listener.wikimedia.org gets changed. The payments-listener service can tolerate outages, it's an API for callbacks from payment providers and the providers will keep retrying if it's down.

We host a second site, fundraising.wikimedia.org, on the same webserver. The site is very simple--all it does is redirects in nginx. This site is user-facing because it hosts redirects for URLs sent to potential donors to signup for donor events. So we should improve our reliability standards for this webserver.

The way we've handled this in other cases (payments) is to add one or more webservers, and put a pair of pybal/LVS servers in front of them in an LVS-DR configuration. That's a large investment considering how little these services do. Also we need to look at how the payments-listener will behave if LVS shifts traffic across webservers mid-session.

Event Timeline

Jgreen created this task.Jul 11 2017, 8:06 PM
Jgreen created this object with visibility "WMF-NDA (Project)".
Jgreen created this object with edit policy "WMF-NDA (Project)".
Ejegg moved this task from Triage to FR-Ops on the Fundraising-Backlog board.Jul 11 2017, 8:08 PM
Jgreen claimed this task.Jul 11 2017, 9:00 PM
Jgreen removed a parent task: Restricted Task.Dec 14 2017, 8:49 PM

Historically we would use pybal/LVS-DR for this, but I think it would be simpler and more efficient to use something like Bird to make a pair of webservers advertise themselves by BGP as routes to a VIP bound to loopback. The end effect would be similar to what we do with LVS-DR, but without the separate load balancers. If we can make this work, I would also like to use it to deprecate the pay-lvs servers.

Arzhel has already started implementing this strategy for things like DNS servers, and has a bunch of work awaiting code review.

https://gerrit.wikimedia.org/r/#/c/391149/
https://gerrit.wikimedia.org/r/#/c/397723/
https://github.com/unixsurfer/anycast_healthchecker

(12:45:13 PM) XioNoX: We're now using it in prod, with doc on https://wikitech.wikimedia.org/wiki/Anycast

Jgreen moved this task from Triage to Backlog on the fundraising-tech-ops board.Feb 19 2020, 10:51 PM
Jgreen removed Jgreen as the assignee of this task.Jun 16 2020, 8:28 PM
Jgreen changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".
Jgreen changed the edit policy from "WMF-NDA (Project)" to "All Users".
Jgreen removed a subscriber: K4-713.