Page MenuHomePhabricator

Determine phabricator.wikimedia.org service level
Closed, ResolvedPublic

Description

There was an outage for phab.wm.o over the weekend. It seems this disrupted some work. Phabricator is a service that needs clarification for the level of HA provided. It is possible if the web front end is dead to spin up a new box and cut over the DB backend and have...most...of the content and functionality. Some things such as Diffusion repositories are stored locally on the server and would not translate since they are not managed through Puppet (and are not supposed to be). That would be a reasonable compromise if documented in the face of a much larger outage IMO.

I have been in a wait-and-see mode on this for awhile as upstream is resolving this issue for their own purposes here: https://secure.phabricator.com/T4209. Their SAAS offerings obviously have many of these same problems and the edge cases are far, far better handled by them for real HA.

My proposal for now would be to define the level of service we offer explicitly. This is a nuanced conversation and I'm not sure if there is a full list of all offered services with a breakdown on what is most important in the event that everything is down.

If Phab is down for X hours and we believe it will not come back up without intervention as part of a greater resolution (such as during a larger network event) should we prioritize it to bring it back up during that outage?

This is really a business priority question that has to be followed up on from a technical perspective and the first part of it I'm not the right person to define :)

Event Timeline

chasemp raised the priority of this task from to Needs Triage.
chasemp updated the task description. (Show Details)
chasemp changed Security from none to None.
chasemp subscribed.
chasemp renamed this task from determine phabricator.wikimedia.org service level to Determine phabricator.wikimedia.org service level.Dec 1 2014, 8:09 PM
Qgil triaged this task as Medium priority.
Qgil added subscribers: RobLa-WMF, mark, Aklapper.

Currently Phabricator is getting the same service level that Bugzilla had. Looking at the whole Wikimedia picture, I think this is the most sensible option. I don't see any strong reason to change it.

Bugzilla was down unexpectedly several times in the past years, and if Ops was able to react quicker it's just because we were luckier with the cause, timing and location of the breaks. If we would have Bugzilla instead of Phabricator in the rack that went down this weekend, the service provided by Ops would have been exactly the same.

We can reopen this discussion when planning the migration of code review and (eventually) continuous integration. For now, I think we are good. This is the opinion of the #Engineering-Community team. If this works also for Operations and Platform Engineering, then we can resolve this task.

PS: About the downtime itself, 5 hours on a weekend is clearly unfortunate, but imho nothing that should make us revise the current service level. Was anybody unable to work, arms crossed? Was any project delayed? I'm counting volunteers as much as employees. Personally I learned about the downtime only in wikitech-l, having used Phabricator on Saturday-Sunday night at 1am CET, and then on Sunday at 1pm.

If we are to migrate other services to Phab, it would definitely need more reliability than what Bugzilla had.

I think HA is "High-Availability" and SA[A]S is "Software as a Service".

Is there an incident report from the outage on 2014-11-29 somewhere on wikitech.wikimedia.org? I'm not sure the service level needs to be revisited, but it would seem that slightly greater redundancy is needed somewhere, if possible.

we do have Icinga monitoring for phabricator, and we have exact dowtimes:

https://icinga.wikimedia.org/cgi-bin/icinga/avail.cgi?t1=1417401318&t2=1417487718&show_log_entries=&host=iridium&service=https%3A%2F%2Fphabricator.wikimedia.org&assumeinitialstates=yes&assumestateretention=yes&assumestatesduringnotrunning=yes&includesoftstates=no&initialassumedhoststate=0&initialassumedservicestate=0&timeperiod=last7days&backtrack=4

but these are not set to "critical => true" which would make them send out SMS to ops. if a service is critical it should be changed to that so we get pages, if it's not, then not. personally, i don't think it is

Is there an incident report from the outage on 2014-11-29 somewhere on wikitech.wikimedia.org?

Yes: https://wikitech.wikimedia.org/wiki/Incident_documentation/20141130-Eqiad-Rack-C4

Cool, thanks.

My take on this is we should have the Operations team assess whether installing a redundant network switch is feasible. (And perhaps investigate why the Redis job queue didn't fail over?)

I agree with @Dzahn that we can leave the current notification levels as they are for now. But as @MaxSem notes, if/when Phabricator becomes more than an issue tracker, we should probably increase the notification/service level.

My take on this is we should have the Operations team assess whether installing a redundant network switch is feasible.

Our network config is fairly robust as it is, and we have several switches. The design philosophy in general is that if something needs to be fault-tolerant, it is distributed over multiple boxes which don't share a rack (and thus PDUs and switches and whatnot). I think this fairly reflects the realities of the situation. Generally network and PDU hardware are significantly more resilient than an actual, complex Linux machine. If it made sense statistically to add a redundant switch to support a certain service, then it makes even more sense to add a redundant host in a separate rack, which gives you a redundant switch (and other bits) intrinsically.

The takeaway here for Phabricator resiliency is that if we want to be immune to isolated hardware failures, we need to structure it to support some kind of actual fault tolerance at the host machine level. Alternatively, if it's deemed that the added complexity for that isn't worth it in uptime terms, a compromise solution is to have an easy process for re-creating the phabricator service in another rack on another arbitrary host in cases where hardware can't be quickly revived remotely, which means good puppetization (which I believe we already have), timely backups of the data (perhaps very timely, e.g. a replicated database), and a written plan for how to quickly bring it up with live data on alternate hardware.

Alternatively, if it's deemed that the added complexity for that isn't worth it in uptime terms, a compromise solution is to have an easy process for re-creating the phabricator service in another rack on another arbitrary host in cases where hardware can't be quickly revived remotely, which means good puppetization (which I believe we already have), timely backups of the data (perhaps very timely, e.g. a replicated database), and a written plan for how to quickly bring it up with live data on alternate hardware.

++

...and that written plan outlines the expectation of when to commit to the move vs when to wait it out. Or at least who makes the call. Seems like the most realistic outcome for this.

A written plan with expectations on how to handle on all possible cases of downtime? For Phabricator? Sounds a wee bit overkill to me. Documentation on technical details and how to recover/replicate systems is great and very helpful during these incidents, but let's keep it at that. I'd like to continue to use common sense and good judgement, as we've always done. The time required for deciding on/writing policy manuals is better spent on improving service HA, should that be needed. :)

Just for the record, I checked with @RobLa-WMF and he also agreed to keep the current service level until further notice.

I'm resolving this task. I will leave the decision about required documentation to the Ops team, since I have little to add.