Page MenuHomePhabricator

[Spike] Benchmark built-in HTTP server options for scap3 fanout
Closed, ResolvedPublic2 Estimated Story Points

Description

Goals

  • Ideally we want minimal dependencies and zero configuration for ease of maintenance.
  • Something that's integrated into python would be good but a standalone binary would also work.
  • Lightweight is desirable, however, performance should not be a major issue with only ~40 simultaneous connections.

Our options thus far:

namelanguagedepends oncomments
gpackPythongit, geventGit "smart" protocol (http) server implemented in python as a wsgi app
aiohttpPython3Python >= 3.4.1, chardet or cchardetSimple asyncio http implementation. Requires just a bit of boilerplate python code to manage the server.
twistdPythonA bunch of themtwisted does include a standalone web server as an example, and packages are in debian/ubuntu. It fails the 'lightweight' and 'simple' goals though.
SimpleHTTPServerPythonPythonNo extra dependencies and no configuration required. Questionable Unacceptable performance and scalability. Renamed to http.server in python 3.x
GunicornPythonUbuntu >= 12.04 or Debian JessieWSGI server front-end. This or uWSGI could be used as a front-end for gpack or sina
uWSGIPython?WSGI server front-end. Apparently this is already used in production as a front-end for other WMF services.

Performance/dependency testing/benchmark things

Event Timeline

dduvall moved this task from Needs triage to Experiments on the Scap board.
mmodell updated the task description. (Show Details)

Played around with an idea this evening. Wrote a python script that spun up 11 DigitalOcean Droplets.

On droplet 1 I ran:

 git clone https://github.com/wikimedia/mediawiki.git
cd mediawiki
git update-server-info
cd .git
python -m SimpleHTTPServer

On the remaining 10 droplets, I used clustershell to run: git clone http://[droplet1-ip]/ cats in parallel.

The droplets were named: fuck-do-it[0-9] You can see their output from clustershell below:

fuck-do-it[1-3,5,8-9]: Cloning into 'cats'...
fatal: unable to access 'http://10.132.28.178:8000/': Recv failure: Connection reset by peer
fuck-do-it[0,6]: Cloning into 'cats'...
error: Unable to get pack file http://10.132.28.178:8000/objects/pack/pack-9a9078d4d54b056d12b20dc730e55891b265f453.pack
Recv failure: Connection reset by peer
error: Unable to find d4d086b0cee47b8110eeb12bf870b7a75bd0e05a under http://10.132.28.178:8000
Cannot obtain needed object d4d086b0cee47b8110eeb12bf870b7a75bd0e05a
error: fetch failed.
fuck-do-it[4,7]: Cloning into 'cats'...
Checking out files: 100% (6065/6065), done.
Checking out files: 100% (6065/6065)

Only 2/10 droplets managed to get 100% of the clone :(

Running git -C cats status across the cluster shows:

fuck-do-it[0-3,5-6,8-9]: fatal: Cannot change to 'cats': No such file or directory
fuck-do-it[4,7]: On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean

I'm confident saying that SimpleHTTPServer should be out of the race.

aiohttp seems to have decent performance, from someone else's testing:
https://groups.google.com/forum/#!topic/aio-libs/3TaOnHqryn0

It can surely handle 40 simultaneous connections.

I'll put together a test or two.

@mmodell that gpack thing is pretty cool, seems to work well.

I forked it, made some tweaks: https://github.com/thcipriani/gpack

Did the same digital ocean test. Tested with 30 instances using my dumb digital ocean, make lots of instances and run lots of commands in parallel script: https://gist.github.com/thcipriani/6c2edf3fa8161aa79676

Worked great, all boxes were able to pull down the repo fairly quickly and reliably.

@thcipriani: I tested it also, and had similarly good results.

mmodell set the point value for this task to 2.

gpack seems to meet all of our needs.