Page MenuHomePhabricator

Look into encrypting Elasticsearch traffic
Closed, ResolvedPublic

Description

Search traffic is currently sent over plain HTTP. Elasticsearch doesn't provide this by default but there's some stuff in the plugin space for it (notably: https://github.com/floragunncom/search-guard).

Right now we don't really do much cross-DC traffic with Elasticsearch (by design), but we could in theory (failover, etc).

Should at least look into feasibility/need.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Joe @mark Could anyone clarify the concerns around using a per-server nginx proxy for SSL termination? It was mentioned in todays meeting that it's not optimal, but I'm not clear on the reasons.

Discussion indicates that a local nginx reverse proxy for SSL termination is a non optimal solution. I do not understand why. Seems to me that deploying nginx locally would not require additional hardware (well, additional CPU to do encryption, but that's also true - even more so - with an elasticsearch plugin). In my former life, nginx reverse proxy for SSL termination was the preferred solution, even for appservers that could support SSL termination.

I also had a look at search-guard (very short look) and it seems to me that this does a lot more than just encryption (authentication, authorization, and a few other security related features). I'm not very keen to add stuff into the elasticsearch instances if it's not absolutely required (KISS).

I'm probably missing some of the requirements, so the above analysis if most probably flawed. Can anyone point me to what I'm missing?

My personal take is that nginx is perfectly fine in general, given it is well able to handle our huge production traffic.

It might be a bit of an overkill just to expose one (internal) encrypted socket to elasticsearch; I wouldn't suggest this if ES wasn't so demanding on resource availability.

So I would also consider "cheaper" (in terms of resources) alternatives like stunnel for this specific task.

In all, I'm not opposed to using nginx; I'd just check the resource usage on our varnishes and try to evaluate if we can tolerate that kind of resource usage on the ES boxes.

So using nginx (or similar) would obviously work for port 9200 since that's the HTTP/REST api. However, I don't believe we could use that to secure our transport on port 9300. AIUI, that's not HTTP--but docs are sparse, please correct if I'm wrong.

Considering a request can end up hitting many nodes, and the question here is about encrypting already inside-WMF traffic, it would make sense to me to also make sure we're securing the inter-node transport too.

(Although I'm probably being overly paranoid here)

9300 is the internal transport port, right? As these are independent clusters, we're not using that across the datacenter barrier, correct?

I'd say that our first priority are the cross-DC flows. Securing the intra-DC cluster is something that should ideally happen as well (not just encryption, but also authentication), but I'm guessing it won't be that easy and will likely need an ElasticSearch plugin (binary protocol, client support within ES as well).

That reminds me: nginx has been working fairly well for our frontend use-case (which is a huge plus), but on the other hand perhaps we don't need something that speaks HTTP here — a simple TLS terminator (like haproxy/stud/hitch) could also do the job just as easily and potentially with less complexity. Up to you :)

As far as I can see (grep in operations/puppet) we are not using either stud nor hitch. I'd prefer not to add a new dependency if what we have works well enough. So looking first into nginx / haproxy...

WRT elasticsearch's internal (non-GPL) product all they do is enable TLS for the inter-node (port 9300) traffic and https on the main port (9200)

In my head, the biggest open question here is about the cert's. I'm guessing we would use self signed certs for internal usage? Paying an outside company for certs seems a bit odd in this case.

It's been awhile but my understanding is SSL cert's are for a single domain, our elasticsearch servers receive production work on search.svc.{eqiad,codfw}.wmnet. Monitoring talks to the servers directly, at for example elastic1001.eqiad.wmnet. Do we need to install two certs per server, one for the LVS domain and one for their own domain? It's been too long since I delt with SSL to say for sure....

If we go with self signed internal certs, is there infrastructure in place for this already? Or do we need to install certs onto any server that may contact the search service? That would basically be all the mediawiki servers, the hadoop cluster, and the servers themselves (for monitoring). Anything else?

I could also be off base here...it's been some time since i delt with SSL.

Some constraints about certs are exposed in T111654, it probably applies here as well.

Change 273254 had a related patch set uploaded (by Gehel):
Expose elasticsearch through HTTP

https://gerrit.wikimedia.org/r/273254

Change 274382 had a related patch set uploaded (by Gehel):
Factorized code exposing Puppet SSL certs

https://gerrit.wikimedia.org/r/274382

For the php end of things, whatever SSL certs are put together need to be provided to php, we use curl to talk to elasticsearch and it will use certs as specified in the curl.cainfo php configuration key:

http://php.net/manual/en/curl.configuration.php#ini.curl.cainfo

Also, mostly for reference, switching user search traffic from eqiad to codfw is changing this line and deploying it:

https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L15884-L15886

Change 'local' to 'codfw' and all search traffic will flow to codfw.

For the php end of things, whatever SSL certs are put together need to be provided to php, we use curl to talk to elasticsearch and it will use certs as specified in the curl.cainfo php configuration key:

http://php.net/manual/en/curl.configuration.php#ini.curl.cainfo

This can also be set on a per-request instance via curl_setopt() and CURLOPT_CAPATH if for some reason we want it somewhere different or don't want to set it globally for all requests.

It looks like elastica library has a 'curl' config option on the Connection object that holds the array of curl_setopt k=>v pairs if we go that route.

Change 274711 had a related patch set uploaded (by Gehel):
Expose elasticsearch through HTTP

https://gerrit.wikimedia.org/r/274711

Discussion with @EBernhardson:

For the pool counter part i filed a task, T128761, and wrote a patch which is up now. I used (p75+cross dc latency)/p75 as the metric for how far to increase the worker count. I'm not 100% sure this is correct...but load testing codfw showed it has significantly more capacity than the eqiad cluster so i'm not *too* worried about the amount of load this might cause.

For configuring CirrusSearch to use https connections and utilize a specific pem file i have put together a patch in gerrit. The patch was just attached. It is a labs only change so can be merged at any time, but has to be synced out to prod for consistency. I will take care of that either tonight or tomorrow morning, since it looks like nginx is already deployed to deployment-elastic0?.deployment-prep.eqiad.wmflabs

Change 274877 had a related patch set uploaded (by EBernhardson):
Use https to talk to elasticsearch in beta cluster

https://gerrit.wikimedia.org/r/274877

In terms of persistent connections, I'm not sure if we have anything in HHVM for doing persistent SSL connections. One option (unreviewed, untested) would be something like hhvm-ext-pcurl, an HHVM extension to expose a copy of the curl library that uses connection sharing between web requests.

For configuring CirrusSearch to use https connections and utilize a specific pem file i have put together a patch in gerrit. The patch was just attached.

Thanks. Regarding the "specific pem file": we shouldn't, as this would be too error-prone. Do you know off-hand what does it default to if we don't supply it? If it's "no verification at all", it's something bad that we should fix.

/etc/ssl/certs (for a c_rehashed CApath) or /etc/ssl/certs/ca-certificates.crt (for a CAfile) are the canonical paths containing the system-wide CA store.

For configuring CirrusSearch to use https connections and utilize a specific pem file i have put together a patch in gerrit. The patch was just attached.

Thanks. Regarding the "specific pem file": we shouldn't, as this would be too error-prone. Do you know off-hand what does it default to if we don't supply it? If it's "no verification at all", it's something bad that we should fix.

/etc/ssl/certs (for a c_rehashed CApath) or /etc/ssl/certs/ca-certificates.crt (for a CAfile) are the canonical paths containing the system-wide CA store.

curl verifys by default, it looks like it's probably checking /etc/ssl/certs, because i pushed the configuration to beta cluster and it worked without needing to specify anything specifically.

One downside we have happening right now is that we do not have persistent connections to the elasticsearch cluster between requests. On the beta cluster this has induced an additional ~25ms latency for setting up connections. Options to solve this include bringing in a php http client impl that uses raw php sockets, which do support persistent connections, or installing a plugin to hhvm that enables persistent curl (via pcurl_* methods). Generally i would feel better about adding php level code than a new C extension, so will see about writing up an Elastica transport using Zend_Http_Client and see if it works as expected.

The extra latency of ssl isn't exactly a blocker, but it is rather undesirable for requests like prefixsearch that have a p75 of ~30ms prior to the ssl patch.

you can check cost per call with the following oneliner in mwrepl:

$time = 0; for ( $i = 0; $i < 100; ++$i) { $ch = curl_init("https://deployment-elastic08.deployment-prep.eqiad.wmflabs:9243/"); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_exec($ch); $time += curl_getinfo($ch, CURLINFO_TOTAL_TIME);} echo $time/100;

The idea was to reuse the same code to expose puppet certificates, which implies some refactoring to k8s module. This seems to be blocked on T120159 (@yuvipanda dixit). So I'll take the k8s refactoring part out of this patch and keep it for later...

Change 276427 had a related patch set uploaded (by Gehel):
Make r8s module use base::expose_puppet_certs

https://gerrit.wikimedia.org/r/276427

Change 273254 abandoned by Gehel:
Expose elasticsearch through HTTP

Reason:
replaced by https://gerrit.wikimedia.org/r/#/c/274711/

https://gerrit.wikimedia.org/r/273254

Patch deployed on beta cluster (again). This time it does not include any dependency to k8s, so I should be able to push it to prod relatively soon.

Sounds good. In the weekly codfw meeting we agree to swap elasticsearch traffic over to codfw next week. We didn't choose an exact day though, any preference?

I have a slight preference for Monday, Tuesday or Thursday. I'm having lunch with friends on Wednesday and Friday, so I'm going to be a little less available around lunch break. But in the end, any day is fine for me.

Looking at some tcpdump traces to see if there is something to improve. My SSL-fu is quite rusty, so I might be doing this completely wrong. Still...

  • there are 10 packets exchanged before application data (this seems much more than is strictly necessary)
  • it seems that we do not use TLS session resumption (not surprising as we do not keep any state client side)
  • it seems that we do not use TLS false start (it seems that the option has been added in recent curl version, but not yet available on our system)

In the end, looking in more details into this is probably waster time, the right solution to reduce SSL handshake impact is HTTP connection pooling...

I have a slight preference for Monday, Tuesday or Thursday. I'm having lunch with friends on Wednesday and Friday, so I'm going to be a little less available around lunch break. But in the end, any day is fine for me.

I've added a tentative date of Thursday the 17th under the datacenter switch page. Please confirm/amend, add a time as well, and let RelEng know :)

I've added a tentative date of Thursday the 17th under the datacenter switch page. Please confirm/amend, add a time as well, and let RelEng know :)

I confirm the date and added the time (1300 UTC). @dcausse and @EBernhardson will be available as well as myself.

How should I let RelEng know? Is a message in #wikimedia-releng sufficient?

Change 274382 merged by Gehel:
Factorized code exposing Puppet SSL certs

https://gerrit.wikimedia.org/r/274382

I'm probably missing something obvious, but there is something I do not understand about our hiera hierarchy as part of change 274711.

What I am trying to achieve:

  1. activate HTTPS for elastic search in both eqiad and codfw
  2. not activate it for logstash (it would not hurt, but it does not seem like we need it at the moment)
  3. define base::puppet::dns_alt_names for elasticsearch servers to 'search.svc.eqiad.wmnet' and 'search.svc.codfw.wmnet' respectively

How I'm trying to achieve this:

  1. Adding the property elasticsearch::https::ensure: 'present' in hieradata/role/common/elasticsearch/server.yaml
  1. Adding the property elasticsearch::https::ensure: 'present' in hieradata/role/common/logstash/server.yaml
  1. Adding the property base::puppet::dns_alt_names: 'search.svc.[codfw|eqiad].wmnet' in hieradata/role/{codfw|eqiad}/elasticsearch/server.yaml respectively.

What makes me think it is not working:

The excellent puppet-compiler gives me a view on the change catalog that tells me that

  • elasticsearch::https is ensured 'present' for both elasticsearch and logstash nodes
  • dns_alt_names is not defined for any node

Anyone has a pointer on what I'm missing?

It looks like this was figured out? I'm seeing http://puppet-compiler.wmflabs.org/2048/ which looks to have dns_alt_names set correctly and the logstash node shows no changes.

Strange ... I can't find what I would have changed to fix this. Maybe I just had the changes locally and forgot to push them? Well, it works...

Change 274711 merged by Gehel:
Expose elasticsearch through HTTP

https://gerrit.wikimedia.org/r/274711

Change 277329 had a related patch set uploaded (by Gehel):
Puppet SSL dir is not the same on Production or Labs

https://gerrit.wikimedia.org/r/277329

Change 277329 merged by Gehel:
Puppet SSL dir is not the same on Production or Labs

https://gerrit.wikimedia.org/r/277329

HTTPS is now active on all elasticsearch servers on port 9243. Still to do (non exhaustive list):

  • re-generate SSL certificates to include service name as SAN entry (search.svc.[codfw|eqiad].wmnet)
  • configure LVS
  • enable HTTP connection pooling client side (on mediawiki / CirrusSearch extension)
  • configure client to actually use HTTPS

@Smalyshev Any thoughts on persistent http connections from php? My plan right now is to evaluate zend-http, which reimplements the http protocol in php. This allows it to use the php stream api's, specifically STREAM_CLIENT_PERSISTENT. We would have to extend the Elastica AbstractTransport to utilize zend-http, but most of the transport implementations are ~200 lines so it's probably not a big deal.

Are you perhaps aware of better http client libraries that offer persistent connections that I should also be evaluating?

Reimplementing full HTTP is kind of PITA, and I'm not sure how up-to-date is Zend/Http with all new HTTP stuff. If it covers everything we need to talk to ES, then we can use it, but we'd have to import significant chunk of ZF dependencies.

But I'd also look into cURL - it would reuse connection at least within the request, not sure if it's possible to make it do the same across requests. cURL does support overriding socket creating via CURLOPT_OPENSOCKETFUNCTION but I don't think PHP binding supports it now.

From the top of my head, I don't remember any library that implements persistent HTTP in PHP. Most implementations prefer to use cURL.

curl_init_pooled looks very interesting. Unfortunately it is new as of 3.9.0 and prod is on 3.6.5

If upgrade is hard I imaging extracting just that part in curl extension and backporting it may be possible. Depending on priorities we could look into it.

curl_init_pooled looks very interesting. Unfortunately it is new as of 3.9.0 and prod is on 3.6.5

Prod is 3.12 as of last week (cf. T119637) :)

Change 277956 had a related patch set uploaded (by Gehel):
Enabling HTTPS access to elasticsearch via LVS

https://gerrit.wikimedia.org/r/277956

Change 277956 merged by Gehel:
Enabling HTTPS access to elasticsearch via LVS.

https://gerrit.wikimedia.org/r/277956

LVS has been configured and activated for eqiad and codfw. Elsaticsearch is available through HTTPS via the usual service names:

  • search.svc.eqiad.wmnet
  • search.svc.codfw.wmnet

LVS has been configured and activated for eqiad and codfw. Elsaticsearch is available through HTTPS via the usual service names:

  • search.svc.eqiad.wmnet
  • search.svc.codfw.wmnet

Working for me! Awesomeeeee!!!!!

demon@tin ~$ curl -XGET https://search.svc.eqiad.wmnet:9243
{
  "status" : 200,
  "name" : "elastic1004",
  "cluster_name" : "production-search-eqiad",
  "version" : {
    "number" : "1.7.5",
    "build_hash" : "00f95f4ffca6de89d68b7ccaf80d148f1f70e4d4",
    "build_timestamp" : "2016-02-02T09:55:30Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

Change 279064 had a related patch set uploaded (by EBernhardson):
Collect timing information for getting a pooled curl handle

https://gerrit.wikimedia.org/r/279064

I created a minimal gatling project to do some experiments:

. Result from a run with pools and HTTPS enabled: . If you want to run it:

  1. untar the project
  2. mvn gatling:execute (or mvn package, which will run the test and create a .tar.gz`with the results.

Prerequisite: maven should be available in your path...

Deskana subscribed.

Yay! Thanks for also adding this to this week's email update. :-)

Change 276427 abandoned by Gehel:
Make r8s module use base::expose_puppet_certs

Reason:
This needs to wait for a lot of refactoring to happen. As there is not enough intelligence here, let's drop it and re implement if the time comes.

https://gerrit.wikimedia.org/r/276427