Test performance and reliability of a loaded Phabricator
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Qgil
	May 11 2014, 9:17 PM

Description

We are assuming that Phabricator can afford 100.000 tasks, 100s of repos and the load of contributions that our Bugzilla + Gerrit handle with good performance and uptime reliability. The theoretical case we are assuming is "If it woks for Facebook, it should work for us".

Is there a way to run a simulation and test the results before Day 1? Having data about other big Phabricator instances would be good as well.

Details

Reference: fl322

Related Objects
Search...

Status	Assigned	Task
Resolved	Dzahn	T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production
Resolved	• Cmjohnson	T138978 decom antimony (datacenter)
Resolved	Dzahn	T123718 Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit)
Resolved	Danny_B	T138986 Archive #Gitblit-Deprecate
Resolved	Paladox	T137353 Update all on-wiki references to git.wikimedia.org and replace them with the Phabricator equivalent
Open	Dzahn	T323073 Make https://git.wikimedia.org not redirect to Phabricator Diffusion
Resolved	• demon	T111465 [keyresult] Deprecate gitblit in favor of Diffusion
Resolved	• demon	T752 Use Diffusion as canonical location for browsing code repos (not gitblit)
Resolved	Paladox	T108864 Update mediawiki.org templates to link to Diffusion, not gitblit
Resolved	Nemo_bis	T101358 Update {{git file}} to link to diffusion
Resolved	• demon	T110607 redirect gerrit repo paths to diffusion callsigns
Resolved	Paladox	T111887 Diffusion replacement for tarfile download from git.wikimedia.org
Resolved	Krenair	T122674 Make all /r/project/* paths in phabricator accessible without login
Resolved	• mmodell	T129447 Diffusion redirect from name to callsign doesn't always work
Invalid	None	T135336 Update Module:Callsigns in mediawiki.org
Declined	None	T76788 ExtensionDistributor should use Phabricator/Diffusion instead of Gerrit
Resolved	• demon	T616 Import all gerrit.wikimedia.org repositories with Diffusion
Resolved	• chasemp	T192 Test performance and reliability of a loaded Phabricator

Event Timeline

epriestley wrote on 2014-05-11 22:41:46 (UTC)

A few general notes:

You may be able to use bin/lipsum to generate "somewhat realistic" test data in order to populate an instance with a large number of fake objects, although it will take a while to build huge numbers of objects, probably has a few bugs, and may not be completely representative of what real data will look like. It also can only generate some types of objects (users, tasks, revisions) and not others (repositories).
Facebook's internal install has >1M reviews, very large repositories (tens of GB, million-ish commits), and millions of other objects. They use a separate task system, though. I'm not sure what the largest number of tasks on an install is. In the past, I've generated ~200-300K locally without issues.
The largest number of repositories I'm aware of on a single install is something like 600-700. The only scale issue I've heard about with this is organizing them in the UI (some discussion in https://secure.phabricator.com/T3102). It is somewhat common to have at least 100-200 repositories.
We probably have the largest number of user accounts (about 30K), although most aren't very active. Facebook probably has the largest number of active regular users (maybe around 1,000 using it daily?).

Things which do have limitations now:

The Node.js notification server doesn't scale well right now, because the architecture is very simple/dumb. https://secure.phabricator.com/T4324 discusses resolving this.
Phabricator can be spread across a small number of web/application servers (say, 4-8) without problems, but currently needs to retain a working copy of each repository on each application server, which won't scale to large numbers of servers (say, 100+). https://secure.phabricator.com/T4209 has a plan to resolve this.
Phabricator puts all hosted repositories on one machine right now. https://secure.phabricator.com/T4292 has a plan to resolve this.
Cloning repositories over HTTPS buffers an unnecessary amount of data in process memory, which can lead to problems if repositories have a large total size. This might be trivial to fix or might be very difficult to resolve, it mostly depends on whether we can detect webserver buffering behavior from within PHP. I think there's discussion somewhere but I can't immediately dig it up. SSH does not have these issues.
Phabricator is not efficient at managing very large files right now. You should find another solution for sharing files larger than, say, 16MB, or expect lackluster performance until we can put some work into this.
We use MySQL as a task queue to avoid having a dependency on a real queue server. It scales easily to tens of thousands of tasks per hour, but we may need to let installs swap it out for a real queue server eventually.
Spreading Phabricator across hosts isn't as easy as it could be; we'll build the tooling out as we build out the features it will support.
Ultimately, the database is the only bottleneck that we don't have a technical approach for escaping. We do have some approaches to mitigate it:
- With a small amount of work, we can send reads to replicas. Some discussion in https://secure.phabricator.com/T1969
- The database is structured so it can be partitioned vertically (each application already has a dedicated logical database, and we perform no joins between them).
- We can offload a significant amount of the query load by introducing Memcache into the stack. A sizable chunk of the query load goes through an abstraction called "object handles" which was partly designed to be cacheable.
- If any install ever manages to hit a scale that exceeds all of the headroom provided by these mechanisms, horizontal sharding isn't impossible. However, because of the one-install-per-organization nature of Phabricator, we don't currently anticipate any install ever having enough activity to surpass these limits.

Broadly:

We believe we're in good shape at the scales discussed, subject to the caveats above.
Scalability is important to us, and if you do hit scalability issues we'll consider it a priority to resolve them.
We have experience with working at scale at Facebook (I spent a good deal of my time there working on infrastructure and performance) and Phabricator's architecture is largely based on proven approaches from Facebook's architecture, so we have at least some reason to believe these approaches are realistic and workable.
The exception is that the architecture does not anticipate horizontal sharding, because we don't think any organization will ever require it.

qgil wrote on 2014-05-13 19:21:03 (UTC)

Thank you @epriestley for what could be an Ars Technica article with just a bit of copy editing!

Our community is a bit concerned about uptime of the one and only tool we would be all using. Now if Bugzilla goes down at least you can entertain yourself at Gerrit, etc. It is a licit concern, but I hope it can be rebated with tests and facts.

CCing the people that can followup in detail.

bd808 wrote on 2014-05-15 01:22:29 (UTC)

@epriestley Does any of your analysis change based on the PHP interpreter in use? More specifically is this data based on using HHVM as the runtime?

epriestley wrote on 2014-05-15 01:25:19 (UTC)

This data is based on normal Zend PHP. We don't formally support HHVM, but you might be able to get Facebook to help you if you run into issues (or maybe not).

Some discussion here:

http://www.quora.com/HipHop-for-PHP/What-are-the-downsides-of-using-HipHop-for-PHP
http://www.quora.com/Phabricator/Does-Phabricator-run-on-HipHop-for-PHP

mmodell wrote on 2014-07-29 16:39:09 (UTC)

I believe that this is a non-issue. I trust Evan's input on this and I really don't foresee any performance surprises.

Nemo_bis wrote on 2014-08-09 10:41:08 (UTC)

https://reviews.facebook.net/robots.txt
I don't know what this "/diffusion/" thing is, but it's not nice to be uncrawlable. We really need things to be searched, see https://www.mediawiki.org/wiki/Wikimedia_technical_search Are there docs on why they have such a robot.txt?

aklapper wrote on 2014-08-09 11:41:44 (UTC)

In T322#15, @Nemo_bis wrote:

I don't know what this "/diffusion/" thing is, but it's not nice to be uncrawlable.

Diffusion is the code repository browser. See https://secure.phabricator.com/book/phabricator/article/diffusion/

Nemo_bis wrote on 2014-08-15 10:00:51 (UTC)

Are there docs on why they have such a robot.txt?

The question stands.

Another thing: in https://reviews.facebook.net/diffusion/query/all/ I only see 15 repositories (3 before logging in). Is there a phabricator install where I can see a listing of about a thousand repositories? We currently link from MediaWiki's README a page which takes 20 s to load, 10 s to first byte. http://www.webpagetest.org/result/140815_K2_C14/1/details/

Rush wrote on 2014-08-15 16:55:45 (UTC)

as far as I know repository transition for code review is down the road, the only things agreed upon to migrate are rt and bugzilla with the addendum that we welcome the outliers (trello, etc). My impression was this ticket was referring to issue / ticket management and performance which would put the differential use case out of scope.

qgil wrote on 2014-08-15 17:04:15 (UTC)

I think we can consider this task not-a-problem for Day 1 (since enough evidence exists on Maniphest performance) and keep it under #code_review_in_phabricator only.

• flimport added a subscriber: • chasemp.Oct 1 2014, 10:33 PM

• flimport added a subscriber: • mmodell.Oct 1 2014, 10:38 PM

• flimport added a subscriber: Aklapper.Oct 1 2014, 11:03 PM

• flimport added a subscriber: greg.Oct 2 2014, 9:58 PM

• flimport added a subscriber: bd808.Oct 3 2014, 3:00 PM

• flimport added a subscriber: scfc.Oct 7 2014, 3:01 AM

Qgil lowered the priority of this task from Medium to Low.Oct 9 2014, 11:14 PM

Qgil subscribed.

In T616#12592, @Jdforrester-WMF wrote:

Is it reasonable to performance-compare gitblit and diffusion on the same Labs box somehow?

See @epriestley's reply above, providing some data about the dimensions of Facebook's code review instance. I'm not saying that replies solve all concerns, and I'm not even pretending to know anything about performance myself. Just connecting new questions with old answers. :)

In T192#12661, @Qgil wrote:

In T616#12592, @Jdforrester-WMF wrote:

Is it reasonable to performance-compare gitblit and diffusion on the same Labs box somehow?

See @epriestley's reply above, providing some data about the dimensions of Facebook's code review instance. I'm not saying that replies solve all concerns, and I'm not even pretending to know anything about performance myself. Just connecting new questions with old answers. :)

Thanks. Given gitblit's, umm, poor record I'm willing to JFDI at least for a few repos and go from there.

In T192#13003, @Jdforrester-WMF wrote:

In T192#12661, @Qgil wrote:

In T616#12592, @Jdforrester-WMF wrote:

Is it reasonable to performance-compare gitblit and diffusion on the same Labs box somehow?

See @epriestley's reply above, providing some data about the dimensions of Facebook's code review instance. I'm not saying that replies solve all concerns, and I'm not even pretending to know anything about performance myself. Just connecting new questions with old answers. :)

Thanks. Given gitblit's, umm, poor record I'm willing to JFDI at least for a few repos and go from there.

MediaWiki is a good example to start with. It's by far the largest and there's no question what the callsign will be.

In T192#13013, @Chad wrote:

In T192#13003, @Jdforrester-WMF wrote:

Thanks. Given gitblit's, umm, poor record I'm willing to JFDI at least for a few repos and go from there.

MediaWiki is a good example to start with. It's by far the largest and there's no question what the callsign will be.

Just to make sure we're all exactly agreed, I've created a clone of MediaWiki in the test Phabricator instance for perusal.

We ran through several performance problems with the large task migrations and have either mitigated or solved the ones I am aware of. As this instance continues to grow we will need (I'm sure) to revisit the resources allocated but I would rather that be in a distinct and actionable task.

Qgil mentioned this in T160: Configure the size limit of the file upload configuration for tasks to a higher limit than 10MB.Apr 6 2015, 6:12 PM

greg moved this task from To Triage to Done/Archive on the Gerrit-Migration board.Sep 24 2015, 11:35 PM

Test performance and reliability of a loaded PhabricatorClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Test performance and reliability of a loaded Phabricator
Closed, ResolvedPublic
Actions

Related Objects
Search...