Reliance on a single web proxy (both within tools and outside of tools) means that a partial labs outage reads to the outside world as a complete labs outage. We need some kind of failover or redundancy for this.
|Open||None||T90534 Make toolforge reliable enough (tracking)|
|Open||None||T91068 Set up a schedule for doing failover exercises for toollabs|
|Resolved||Andrew||T90542 Make sure that toollabs can function fully even with one virt* host fully down|
|Resolved||yuvipanda||T89995 ToolLabs web proxy tolerate the failure of virt host|
|Declined||yuvipanda||T91484 Monitor that the redundant webproxies have same state in terms of what they are proxying to whom|
- Mentioned In
- rOPUP5df81e066818: tools: Keep proxylistener socket open
rOPUP92b4e2ad85b3: tools: Make uwsgi & nodejs services also ping additional proxies
rOPUP2fee2a672582: tools: Make portgrabber also ping two additional webproxies
T90542: Make sure that toollabs can function fully even with one virt* host fully down
It seems to me unlikely that the proxy can be made HA without a redesign, but having a warm standby ready to take over at the flip of a switch would probably be a sufficient recovery method (presuming, of course, that it lives on different hardware)
I think Labs is too intertwined that having the Labs/Tools proxy back up sooner than the rest is very useful. Fixing this hardware failure for the most part in less than three hours was pretty spectacular (thanks!) and (IMHO) very good for a less-than-production-grade environment.
The SPOF is probably more "North America, day time" :-). I'd prefer if the ops team could make sure that the know-how is also available on a Saturday night with the "usual suspects" partying, hiking, ill, etc., i. e. can another ops solve a similar situation with the existing training and documentation in, say, less than twice that time frame?
So we will eventually have two proxies - tools-webproxy-01 and tools-webproxy-02, and they'll be hotspares. Webservices will register with each of them when they are started. Switching between them would be just moving the floating IP from one to the other. Will add documentation soon.
Alright, so toollabs webproxy is now running on tools-webproxy-01, with a hotspare in tools-webproxy-02. To switch to the spare, go to Special:NovaAddress on wikitech, and reassociate the IP for tools.wmflabs.org to the other one. tools-webproxy, the old machine, is also kept on for now, but will be removed next week. The new machines are also trusty, rather than precise.
I'm not particularly sure where to document this..
My understanding is that dynamicproxy holds the proxying information essentially in memory, i. e. it starts with an empty plate. In that case, tools that start their webservice between start(tools-webproxy-01/dynamicproxy) and start(tools-webproxy-02/dynamicproxy) would lose their proxy when the floating IP is switched to tools-webproxy-02. This can be worked around by restarting all webservices that are not registered with the active proxy (which could be automatically monitored).
Should we just document this/set up monitoring, or do you want to implement some data sharing between the proxies?