Page MenuHomePhabricator

ToolLabs web proxy tolerate the failure of virt host
Closed, ResolvedPublic

Description

As per https://wikitech.wikimedia.org/wiki/Incident_documentation/20150217-LabsOutage

Reliance on a single web proxy (both within tools and outside of tools) means that a partial labs outage reads to the outside world as a complete labs outage. We need some kind of failover or redundancy for this.

Event Timeline

Andrew raised the priority of this task from to Needs Triage.
Andrew updated the task description. (Show Details)
Andrew added a project: Cloud-Services.
Andrew added a subscriber: Andrew.

It seems to me unlikely that the proxy can be made HA without a redesign, but having a warm standby ready to take over at the flip of a switch would probably be a sufficient recovery method (presuming, of course, that it lives on different hardware)

I think Labs is too intertwined that having the Labs/Tools proxy back up sooner than the rest is very useful. Fixing this hardware failure for the most part in less than three hours was pretty spectacular (thanks!) and (IMHO) very good for a less-than-production-grade environment.

The SPOF is probably more "North America, day time" :-). I'd prefer if the ops team could make sure that the know-how is also available on a Saturday night with the "usual suspects" partying, hiking, ill, etc., i. e. can another ops solve a similar situation with the existing training and documentation in, say, less than twice that time frame?

@Andrew, you are proposing to work on this task during the Wikimedia Hackathon in Lyon. Please consider associating this task with Wikimedia-Hackathon-2015.

Qgil triaged this task as Low priority.Feb 20 2015, 2:36 PM
yuvipanda raised the priority of this task from Low to High.Feb 24 2015, 11:53 AM

Toollabs was out twice over the last few days because of non-redundancy in tools-webproxy

Change 193334 had a related patch set uploaded (by Yuvipanda):
tools: Make portgrabber also ping two additional webproxies

https://gerrit.wikimedia.org/r/193334

Change 193334 merged by Yuvipanda:
tools: Make portgrabber also ping two additional webproxies

https://gerrit.wikimedia.org/r/193334

Change 193343 had a related patch set uploaded (by Yuvipanda):
tools: Make uwsgi & nodejs services also ping additional proxies

https://gerrit.wikimedia.org/r/193343

yuvipanda renamed this task from Labs web proxy should be load-balanced and tolerate the failure of virt host to ToolLabs web proxy should be load-balanced and tolerate the failure of virt host.Feb 27 2015, 8:34 AM
yuvipanda edited projects, added Toolforge; removed Wikimedia-Hackathon-2015.

(Removing the Hackathon project since this needs to be fixed *now*)

Change 193343 merged by Yuvipanda:
tools: Make uwsgi & nodejs services also ping additional proxies

https://gerrit.wikimedia.org/r/193343

So we will eventually have two proxies - tools-webproxy-01 and tools-webproxy-02, and they'll be hotspares. Webservices will register with each of them when they are started. Switching between them would be just moving the floating IP from one to the other. Will add documentation soon.

Change 193345 had a related patch set uploaded (by Yuvipanda):
tools: Keep proxylistener socket open

https://gerrit.wikimedia.org/r/193345

Change 193345 merged by Yuvipanda:
tools: Keep proxylistener socket open

https://gerrit.wikimedia.org/r/193345

Alright, so toollabs webproxy is now running on tools-webproxy-01, with a hotspare in tools-webproxy-02. To switch to the spare, go to Special:NovaAddress on wikitech, and reassociate the IP for tools.wmflabs.org to the other one. tools-webproxy, the old machine, is also kept on for now, but will be removed next week. The new machines are also trusty, rather than precise.

I'm not particularly sure where to document this..

yuvipanda renamed this task from ToolLabs web proxy should be load-balanced and tolerate the failure of virt host to ToolLabs web proxy tolerate the failure of virt host.Feb 27 2015, 1:09 PM
yuvipanda claimed this task.

Tested the failover. Worked perfectly.

Haven't tested instructions on bringing back a dead instance, though.

My understanding is that dynamicproxy holds the proxying information essentially in memory, i. e. it starts with an empty plate. In that case, tools that start their webservice between start(tools-webproxy-01/dynamicproxy) and start(tools-webproxy-02/dynamicproxy) would lose their proxy when the floating IP is switched to tools-webproxy-02. This can be worked around by restarting all webservices that are not registered with the active proxy (which could be automatically monitored).

Should we just document this/set up monitoring, or do you want to implement some data sharing between the proxies?