Change Details

There was an outage for phab.wm.o over the weekend. It seems this disrupted some work. Phabricator is a service that needs clarification for the level of HA provided. It is possible if the web front end is dead to spin up a new box and cut over the DB backend and have...most...of the content and functionality. Some things such as Diffusion repositories are stored locally on the server and would not translate since they are not managed through Puppet (and are not supposed to be). That would be a reasonable compromise if documented in the face of a much larger outage IMO. I have been in a wait-and-see mode on this for awhile as upstream is resolving this issue for their own purposes here: https://secure.phabricator.com/T4209. Their SAAS offerings obviously have many of these same problems and the edge cases are far, far better handled by them for real HA. My proposal for now would be to define the level of service we offer explicitly. This is a nuanced conversation and I'm not sure if there is a full list of all offered services with a breakdown on what is most important in the event that //everything// is down. If Phab is down for X hours and we believe it will not come back up without intervention as part of a greater resolution (such as during a larger network event) should we prioritize it to bring it back up during that outage? This is really a business priority question that has to be followed up on from a technical perspective and the first part of it I'm not the right person to define :)