The payments-listener service is hosted on a single machine eqiad with a standby machine at codfw, switching to the backup in event of failure is a two step process--the codfw machine needs to be flipped in puppet out of 'maintenance mode' and the DNS record for payments-listener.wikimedia.org gets changed. The payments-listener service can tolerate outages, it's an API for callbacks from payment providers and the providers will keep retrying if it's down.
We host a second site, fundraising.wikimedia.org, on the same webserver. The site is very simple--all it does is redirects in nginx. This site is user-facing because it hosts redirects for URLs sent to potential donors to signup for donor events. So we should improve our reliability standards for this webserver.
The way we've handled this in other cases (payments) is to add one or more webservers, and put a pair of pybal/LVS servers in front of them in an LVS-DR configuration. That's a large investment considering how little these services do. Also we need to look at how the payments-listener will behave if LVS shifts traffic across webservers mid-session.