Page MenuHomePhabricator

Implement connection pooling for elasticsearch connections
Closed, ResolvedPublic

Description

Settings up an https connection increases the connection overhead from 2-3ms to 30-40ms. This would double the 95th percentile completion suggester latency, and increase the 50th percentile by 5x. Implement connection pooling to spread this initialization cost across many requests rather than http

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 17 2016, 4:04 PM
Restricted Application added a project: Discovery. · View Herald TranscriptMar 17 2016, 4:05 PM

curl_init_pooled, an hhvm builtin, looks like a reasonable way to implement connection pooling. https://gerrit.wikimedia.org/r/#/c/277919/ has been merged which configures two pools with default values, one for eqiad and one for codfw, https://gerrit.wikimedia.org/r/#/c/277907/ has also been merged which adds a transport for the Elastica library that utilizes these connections pools.

Still todo:

  • Evaluate the default settings. I don't know if the pool size of 5 is too small, or reasonable. The elasticsearch cluster averages 600 open connections to the entire cluster, so our average is much lower than 5 per server. But that doesn't account for peaks. Additionally the default timeout is 5 seconds, compared to our 95th percentile time of 100ms this is absolutely massive.
  • Improve HHVM error handling of exhausted pools. Currently it throws a fatal exception on pool exhaustion
  • Improve visibility into how long cirrus spends waiting for a handle to become available. If we are waiting for handles, as opposed to waiting for the network, we need to know about that.

Maybe todo:

  • Make connection pools runtime configurable, rather than having to be deployed as part of puppet. This limitation means that the entire cluster of hhvm servers needs a rolling restart to update one connection pool setting.
Gehel added a subscriber: Gehel.Mar 22 2016, 9:16 AM

About the sizing of the pool, I'm looking into graphite, to see if we have any data on number of open connections to elasticsearch. I can't find any (but there are sooo many metrics, I might just have missed it). If we do introduce a pool, we should probably have metrics about it's utilization (current size of the pool, number of active connections, wait time on the pool, ...). Is this something simple enough to put in place?

There is some info about open connections in the elasticsearch-percentiles graph iirc. This info comes from elasticsearch itself and typically maxes in the 6-700 range.

Metrics are a little harder. We can measure wait time for pooled handle in php, but open connections is handled down in the curl level below the pool itself. Basically the pool is for handles, and these handles are created when the pool is initialized. Whether they are open or not to isn't exposed.

We do know though that elasticsearch allows HTTP connections to be held open indefinitely, so it might be safe to assume all are open. The pool uses the LIFO std::stack from c++ though so if the pool has never been completely utilized that won't be true. HHVM does have a stats API, and I'm sure this information could be sourced on the c++ side. I'll see if it could be added without too much pain.

Note though that upstreaming these things may take time, and @Joe has indicated that due to past issues we will only use backports of accepted patches in prod hhvm instances. Once the appropriate code has been put together we should expect a couple weeks to a month or more of turnaround time.

Change 279064 had a related patch set uploaded (by EBernhardson):
Collect timing information for getting a pooled curl handle

https://gerrit.wikimedia.org/r/279064

Change 279064 merged by jenkins-bot:
Collect timing information for getting a pooled curl handle

https://gerrit.wikimedia.org/r/279064

Change 279380 had a related patch set uploaded (by Gehel):
CirrusSearch labs configured to use HTTPS connection pool

https://gerrit.wikimedia.org/r/279380

Change 279380 merged by Gehel:
CirrusSearch labs configured to use HTTPS connection pool

https://gerrit.wikimedia.org/r/279380

Deskana triaged this task as Medium priority.Mar 24 2016, 4:38 PM
Deskana moved this task from Needs triage to On Sprint Board on the Discovery board.

Mentioned in SAL [2016-03-24T16:41:13Z] <ebernhardson@tin> Synchronized wmf-config/CirrusSearch-labs.php: T130219 CirrusSearch labs configured to use HTTPS connection pool (duration: 00m 30s)

Gehel added a comment.Mar 24 2016, 5:04 PM

HTTP pool is now deployed on labs. As we are using client side load balancing (a list of elasticsearch server is configured on MW side), pool is used, but connections do not seem to be reused. I will try to configure a single host and see how that goes...

Change 279393 had a related patch set uploaded (by Gehel):
Configure a single elasticsearch server for CirrusSearch beta cluster

https://gerrit.wikimedia.org/r/279393

Change 279393 merged by Gehel:
Configure a single elasticsearch server for CirrusSearch beta cluster

https://gerrit.wikimedia.org/r/279393

Mentioned in SAL [2016-03-24T21:05:36Z] <gehel> deploying mediawiki-config: single elasticsearch server for CirrusSearch beta cluster (T130219)

Mentioned in SAL [2016-03-24T21:08:28Z] <gehel@tin> Synchronized wmf-config/LabsServices.php: T130219 Configure a single elasticsearch server for CirrusSearch beta cluster (duration: 00m 38s)

Gehel added a comment.Mar 24 2016, 9:18 PM

Labs is now configured to use only elastic05. Generating some traffic manually and checking connection (watch -n 1 -c 'sudo netstat -pn | grep 9243') I do see 2 connections kept open for a fairly long time (>30 seconds). This looks good!

Upstream patches are:

Gehel added a comment.Mar 30 2016, 1:24 PM

It seems that we are still using HHVM 3.6.5 on deployment-prep, so curl_init_pooled should not be available.

gehel@deployment-mediawiki03:~$ hhvm --version
HipHop VM 3.6.5 (rel)
Compiler: 1454064375_587847189
Repo schema: c1d1b6a039457472a47a9b28a9307d37703525c0
Extension API: 20150212

I to not understand why I was seeing long running connections before.

Gehel added a comment.EditedMar 30 2016, 1:45 PM

Testing on deployment-mediawiki01, I see much better timings with pooling than without (not really a surprise). mwrepl used again with the following oneliner:

$time = 0; for ( $i = 0; $i < 100; ++$i) { $ch = curl_init_pooled("cirrus-eqiad", "https://deployment-elastic08.deployment-prep.eqiad.wmflabs:9243/"); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_exec($ch); $time += curl_getinfo($ch, CURLINFO_TOTAL_TIME);} echo $time/100;

Data so far:

  • HTTPS, no pool: 30-40 ms / call
  • HTTPS, with pool: 1-3 ms / call
  • HTTP, no pool: 4-5 ms / call
  • HTTP, with pool: < 1 ms / call

It looks like adding HTTPS and connection pooling might actually improve performances... (let's wait until we have more realistic data than this test).

Change 282359 had a related patch set uploaded (by Gehel):
CirrusSearch on Labs uses the full Elasticsearch cluster again

https://gerrit.wikimedia.org/r/282359

Restricted Application added a project: Discovery-Search. · View Herald TranscriptApr 8 2016, 1:30 PM

Change 282359 merged by Gehel:
CirrusSearch on Labs uses the full Elasticsearch cluster again

https://gerrit.wikimedia.org/r/282359

Deskana closed this task as Resolved.May 11 2016, 10:44 PM