Page MenuHomePhabricator

Test performance and reliability of a loaded Phabricator
Closed, ResolvedPublic

Description

We are assuming that Phabricator can afford 100.000 tasks, 100s of repos and the load of contributions that our Bugzilla + Gerrit handle with good performance and uptime reliability. The theoretical case we are assuming is "If it woks for Facebook, it should work for us".

Is there a way to run a simulation and test the results before Day 1? Having data about other big Phabricator instances would be good as well.

Details

Reference
fl322

Related Objects

StatusSubtypeAssignedTask
ResolvedDzahn
Resolved Cmjohnson
ResolvedDzahn
ResolvedDanny_B
ResolvedPaladox
OpenDzahn
Resolved demon
Resolved demon
ResolvedPaladox
ResolvedNemo_bis
Resolved demon
ResolvedPaladox
ResolvedKrenair
Resolved mmodell
InvalidNone
DeclinedNone
Resolved demon
Resolved chasemp

Event Timeline

flimport raised the priority of this task from to Medium.Sep 12 2014, 1:35 AM
flimport added a project: Gerrit-Migration.
flimport set Reference to fl322.

epriestley wrote on 2014-05-11 22:41:46 (UTC)

A few general notes:

  • You may be able to use bin/lipsum to generate "somewhat realistic" test data in order to populate an instance with a large number of fake objects, although it will take a while to build huge numbers of objects, probably has a few bugs, and may not be completely representative of what real data will look like. It also can only generate some types of objects (users, tasks, revisions) and not others (repositories).
  • Facebook's internal install has >1M reviews, very large repositories (tens of GB, million-ish commits), and millions of other objects. They use a separate task system, though. I'm not sure what the largest number of tasks on an install is. In the past, I've generated ~200-300K locally without issues.
  • The largest number of repositories I'm aware of on a single install is something like 600-700. The only scale issue I've heard about with this is organizing them in the UI (some discussion in https://secure.phabricator.com/T3102). It is somewhat common to have at least 100-200 repositories.
  • We probably have the largest number of user accounts (about 30K), although most aren't very active. Facebook probably has the largest number of active regular users (maybe around 1,000 using it daily?).

Things which do have limitations now:

  • The Node.js notification server doesn't scale well right now, because the architecture is very simple/dumb. https://secure.phabricator.com/T4324 discusses resolving this.
  • Phabricator can be spread across a small number of web/application servers (say, 4-8) without problems, but currently needs to retain a working copy of each repository on each application server, which won't scale to large numbers of servers (say, 100+). https://secure.phabricator.com/T4209 has a plan to resolve this.
  • Phabricator puts all hosted repositories on one machine right now. https://secure.phabricator.com/T4292 has a plan to resolve this.
  • Cloning repositories over HTTPS buffers an unnecessary amount of data in process memory, which can lead to problems if repositories have a large total size. This might be trivial to fix or might be very difficult to resolve, it mostly depends on whether we can detect webserver buffering behavior from within PHP. I think there's discussion somewhere but I can't immediately dig it up. SSH does not have these issues.
  • Phabricator is not efficient at managing very large files right now. You should find another solution for sharing files larger than, say, 16MB, or expect lackluster performance until we can put some work into this.
  • We use MySQL as a task queue to avoid having a dependency on a real queue server. It scales easily to tens of thousands of tasks per hour, but we may need to let installs swap it out for a real queue server eventually.
  • Spreading Phabricator across hosts isn't as easy as it could be; we'll build the tooling out as we build out the features it will support.
  • Ultimately, the database is the only bottleneck that we don't have a technical approach for escaping. We do have some approaches to mitigate it:
    • With a small amount of work, we can send reads to replicas. Some discussion in https://secure.phabricator.com/T1969
    • The database is structured so it can be partitioned vertically (each application already has a dedicated logical database, and we perform no joins between them).
    • We can offload a significant amount of the query load by introducing Memcache into the stack. A sizable chunk of the query load goes through an abstraction called "object handles" which was partly designed to be cacheable.
    • If any install ever manages to hit a scale that exceeds all of the headroom provided by these mechanisms, horizontal sharding isn't impossible. However, because of the one-install-per-organization nature of Phabricator, we don't currently anticipate any install ever having enough activity to surpass these limits.

Broadly:

  • We believe we're in good shape at the scales discussed, subject to the caveats above.
  • Scalability is important to us, and if you do hit scalability issues we'll consider it a priority to resolve them.
  • We have experience with working at scale at Facebook (I spent a good deal of my time there working on infrastructure and performance) and Phabricator's architecture is largely based on proven approaches from Facebook's architecture, so we have at least some reason to believe these approaches are realistic and workable.
  • The exception is that the architecture does not anticipate horizontal sharding, because we don't think any organization will ever require it.

qgil wrote on 2014-05-13 19:21:03 (UTC)

Thank you @epriestley for what could be an Ars Technica article with just a bit of copy editing!

Our community is a bit concerned about uptime of the one and only tool we would be all using. Now if Bugzilla goes down at least you can entertain yourself at Gerrit, etc. It is a licit concern, but I hope it can be rebated with tests and facts.

CCing the people that can followup in detail.

bd808 wrote on 2014-05-15 01:22:29 (UTC)

@epriestley Does any of your analysis change based on the PHP interpreter in use? More specifically is this data based on using HHVM as the runtime?

epriestley wrote on 2014-05-15 01:25:19 (UTC)

This data is based on normal Zend PHP. We don't formally support HHVM, but you might be able to get Facebook to help you if you run into issues (or maybe not).

Some discussion here:

http://www.quora.com/HipHop-for-PHP/What-are-the-downsides-of-using-HipHop-for-PHP
http://www.quora.com/Phabricator/Does-Phabricator-run-on-HipHop-for-PHP

mmodell wrote on 2014-07-29 16:39:09 (UTC)

I believe that this is a non-issue. I trust Evan's input on this and I really don't foresee any performance surprises.

Nemo_bis wrote on 2014-08-09 10:41:08 (UTC)

https://reviews.facebook.net/robots.txt
I don't know what this "/diffusion/" thing is, but it's not nice to be uncrawlable. We really need things to be searched, see https://www.mediawiki.org/wiki/Wikimedia_technical_search Are there docs on why they have such a robot.txt?

aklapper wrote on 2014-08-09 11:41:44 (UTC)

In T322#15, @Nemo_bis wrote:

I don't know what this "/diffusion/" thing is, but it's not nice to be uncrawlable.

Diffusion is the code repository browser. See https://secure.phabricator.com/book/phabricator/article/diffusion/

Nemo_bis wrote on 2014-08-15 10:00:51 (UTC)

Are there docs on why they have such a robot.txt?

The question stands.

Another thing: in https://reviews.facebook.net/diffusion/query/all/ I only see 15 repositories (3 before logging in). Is there a phabricator install where I can see a listing of about a thousand repositories? We currently link from MediaWiki's README a page which takes 20 s to load, 10 s to first byte. http://www.webpagetest.org/result/140815_K2_C14/1/details/

Rush wrote on 2014-08-15 16:55:45 (UTC)

as far as I know repository transition for code review is down the road, the only things agreed upon to migrate are rt and bugzilla with the addendum that we welcome the outliers (trello, etc). My impression was this ticket was referring to issue / ticket management and performance which would put the differential use case out of scope.

qgil wrote on 2014-08-15 17:04:15 (UTC)

I think we can consider this task not-a-problem for Day 1 (since enough evidence exists on Maniphest performance) and keep it under #code_review_in_phabricator only.

Qgil lowered the priority of this task from Medium to Low.Oct 9 2014, 11:14 PM
Qgil added a subscriber: Qgil.

Is it reasonable to performance-compare gitblit and diffusion on the same Labs box somehow?

See @epriestley's reply above, providing some data about the dimensions of Facebook's code review instance. I'm not saying that replies solve all concerns, and I'm not even pretending to know anything about performance myself. Just connecting new questions with old answers. :)

In T192#12661, @Qgil wrote:

Is it reasonable to performance-compare gitblit and diffusion on the same Labs box somehow?

See @epriestley's reply above, providing some data about the dimensions of Facebook's code review instance. I'm not saying that replies solve all concerns, and I'm not even pretending to know anything about performance myself. Just connecting new questions with old answers. :)

Thanks. Given gitblit's, umm, poor record I'm willing to JFDI at least for a few repos and go from there.

In T192#12661, @Qgil wrote:

Is it reasonable to performance-compare gitblit and diffusion on the same Labs box somehow?

See @epriestley's reply above, providing some data about the dimensions of Facebook's code review instance. I'm not saying that replies solve all concerns, and I'm not even pretending to know anything about performance myself. Just connecting new questions with old answers. :)

Thanks. Given gitblit's, umm, poor record I'm willing to JFDI at least for a few repos and go from there.

MediaWiki is a good example to start with. It's by far the largest and there's no question what the callsign will be.

In T192#13013, @Chad wrote:

Thanks. Given gitblit's, umm, poor record I'm willing to JFDI at least for a few repos and go from there.

MediaWiki is a good example to start with. It's by far the largest and there's no question what the callsign will be.

Just to make sure we're all exactly agreed, I've created a clone of MediaWiki in the test Phabricator instance for perusal.

chasemp claimed this task.

We ran through several performance problems with the large task migrations and have either mitigated or solved the ones I am aware of. As this instance continues to grow we will need (I'm sure) to revisit the resources allocated but I would rather that be in a distinct and actionable task.