Setup elasticsearch on cloudelastic100[1-4]
Closed, ResolvedPublic
Actions

Description

Setup cirrus search on cloudelastic servers.

Create system group to allow access to discovery team members.
Prepare role for cloudelastic servers (puppet)
Prepare profile and hiera config for cloudelastic servers (puppet)
Prepare discovery entries for cloudelastic
Install elastic on one of the new cloudelastic node to test and confirm
Install elastic on all cloudelastic nodes and confirm that all nodes belong to the cluster

Details

Subject	Repo	Branch	Lines +/-
elasticsearch: split plugin into base and cirrus	operations/puppet	production	+18 -7
icinga: Ok when total shards is zero	operations/puppet	production	+2 -0
cloudelastic: use acme_chief to get ssl cert	operations/puppet	production	+18 -24
tlsproxy::localssl: split title and acme cert name	operations/puppet	production	+7 -2
acme_chief: generate cert for each cirrus clusters	operations/puppet	production	+20 -0
acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o	operations/puppet	production	+10 -0
cloudelastic: allow elastic to bind to public ip	operations/puppet	production	+5 -0
cloudelastic: add missing monitoring clusters	operations/puppet	production	+4 -0
cloudelastic: Add cloudelastic configs	operations/puppet	production	+126 -1
elasticsearch: add profile for icinga checks	operations/puppet	production	+100 -46
elasticsearch: add profile for icinga checks	operations/puppet	production	+100 -46
elasticsearch: split plugin into base and cirrus	operations/puppet	production	+43 -32
elasticsearch: use standard resources for icinga checks	operations/puppet	production	+22 -42
use hostname instead of fqdn	operations/puppet	production	+53 -83
elasticsearch: refactor elastic icinga checks	operations/puppet	production	+83 -61
elasticsearch: move nagios check to profile	operations/puppet	production	+6 -6
relforge: switch relforge to use localssl	operations/puppet	production	+0 -2
cirrus: fallback option to use localssl via acme subject	operations/puppet	production	+33 -9
Revert "acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o"	operations/puppet	production	+0 -10
acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o	operations/puppet	production	+10 -0

Related Objects
Search...

Status	Subtype	Assigned	Task
			Unknown Object (Task)
Resolved		• Mathew.onipe	T194186 rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems
Declined	Feature	None	T71489 Expose mwgrep functionality on-wiki
Resolved		None	T109715 Replicate production elasticsearch indices to labs
Resolved		• Mathew.onipe	T214921 Setup elasticsearch on cloudelastic100[1-4]
Resolved		debt	T214922 Create cloudelastic-root group

Event Timeline

• Mathew.onipe triaged this task as Medium priority.Jan 29 2019, 2:34 PM

• Mathew.onipe created this task.

• Mathew.onipe updated the task description. (Show Details)

• Mathew.onipe updated the task description. (Show Details)Jan 29 2019, 3:38 PM

TJones edited projects, added Discovery-Search; removed Discovery-Search (Current work).Jan 29 2019, 9:56 PM

TJones moved this task from needs triage to Ops / SRE on the Discovery-Search board.

RobH unsubscribed.Jan 29 2019, 9:58 PM

• Mathew.onipe updated the task description. (Show Details)Jan 30 2019, 8:28 AM

• Mathew.onipe edited projects, added Discovery-Search (Current work); removed Discovery-Search.

• Mathew.onipe moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Change 487129 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] cloudelastic: Add cloudelastic configs

https://gerrit.wikimedia.org/r/487129

gerritbot added a project: Patch-For-Review.Jan 30 2019, 11:52 AM

debt moved this task from not in use - please delete to Incoming on the Discovery-Search (Current work) board.Jan 31 2019, 10:54 PM

bd808 added a parent task: T109715: Replicate production elasticsearch indices to labs.Feb 10 2019, 8:52 PM

Framawiki subscribed.Feb 10 2019, 11:40 PM

debt closed subtask T214922: Create cloudelastic-root group as Resolved.Feb 15 2019, 7:00 PM

• Mathew.onipe updated the task description. (Show Details)Feb 19 2019, 10:38 AM

• Mathew.onipe removed a project: Patch-For-Review.Feb 19 2019, 6:33 PM

Looking at this,
It would be nice to know what we should enable for cloudelastic and what not. This will help move cloudelastic forward at least.
@dcausse suggested we don't need to read from kafka_msearch_daemon topic: https://gerrit.wikimedia.org/r/c/operations/puppet/+/487129/3/hieradata/role/common/elasticsearch/cloudelastic.yaml#61.
@EBernhardson what do you think? and you could review https://gerrit.wikimedia.org/r/487129 as well :)

Also I think we might have to add the $DOMAIN to ferm range: https://gerrit.wikimedia.org/r/c/operations/puppet/+/487129/3/hieradata/role/common/elasticsearch/cloudelastic.yaml#11 to allow access to Kafka etc. Please correct me if I'm wrong.

Added a review. If opening up kafka access is problematic there is no hard requirement to read kafka on these machines. It might be nice to offer the updates that come through kafka, but without them the service will still provide 99% of the data we use in prod search.

If opening up kafka access is problematic

It's going to be problematic! :)

Oh, maybe it isn't...If these are nodes in production networks then it could be fine.

In T214921#4965991, @Ottomata wrote:

Oh, maybe it isn't...If these are nodes in production networks then it could be fine.

These nodes should be, if i understand it right, living in the same space as the cloud mariadb replicas. The servers live in the production network and have a port opened up to the cloud network somehow.

Ah, you're still going to have problems then. In T207321: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster we were told that the network hole for the replicas was bad and they wanted to get rid of it. We are still working on finding the right solution.

The servers live in the production network and have a port opened up to the cloud network somehow.

Actually, this is exactly what we were told we weren't allowed to do with Presto. There should be no holes between Cloud and Prod. The only way for them to talk to each other is via public networks. :/

In T214921#4966002, @EBernhardson wrote:

In T214921#4965991, @Ottomata wrote:

Oh, maybe it isn't...If these are nodes in production networks then it could be fine.

These nodes should be, if i understand it right, living in the same space as the cloud mariadb replicas. The servers live in the production network and have a port opened up to the cloud network somehow.

The Wiki Replica databases (labsdb10{09,10,11}) live in the deprecated labs-support* VLAN. These new cloudelastic100[1-4] servers are in the public* VLAN. We should be able to use ferm rules on the servers themselves to control which source IPs are allowed to connect to whatever reverse proxy layer ends up in front of the actual Elasticsearch service. The reverse proxy will likely exist only to give us a way to force authentication to the service which in turn will allow us to cut off access to any accidental (or malicious) bad actors. The public* VLAN is a place where things can be opened up to the entire Internet, but not necessarily a place where things must be opened up to the entire Internet.

My understanding of the current network topology is incomplete, but I believe that the security model is such that things in both the private* VLAN and the cloud-hosts* VLAN can access resources in the public* VLAN, but that the reverse is not typically allowed. This allows for data flows where a host in the private* VLAN pushes data to a host in the public* VLAN and clients in the cloud-hosts* VLAN access that data via a service on the public* VLAN host. This is broadly the model that the dumps.wikimedia.org servers use.

I was under the impression we would do the same as relforge (T142211) which is in prod and accessible from labs. Looking into it closer, that is different from what we need here. The relforge instances are not part of the private prod networks, the only prod hosts that can talk to it are mwmaint. The new instances will need to be accessible from job runners, and probably falls under the same problem ottomata points out.

In T214921#4967027, @bd808 wrote:

In T214921#4966002, @EBernhardson wrote:

In T214921#4965991, @Ottomata wrote:

Oh, maybe it isn't...If these are nodes in production networks then it could be fine.

These nodes should be, if i understand it right, living in the same space as the cloud mariadb replicas. The servers live in the production network and have a port opened up to the cloud network somehow.

The Wiki Replica databases (labsdb10{09,10,11}) live in the deprecated labs-support* VLAN. These new cloudelastic100[1-4] servers are in the public* VLAN. We should be able to use ferm rules on the servers themselves to control which source IPs are allowed to connect to whatever reverse proxy layer ends up in front of the actual Elasticsearch service. The reverse proxy will likely exist only to give us a way to force authentication to the service which in turn will allow us to cut off access to any accidental (or malicious) bad actors. The public* VLAN is a place where things can be opened up to the entire Internet, but not necessarily a place where things must be opened up to the entire Internet.

My understanding of the current network topology is incomplete, but I believe that the security model is such that things in both the private* VLAN and the cloud-hosts* VLAN can access resources in the public* VLAN, but that the reverse is not typically allowed. This allows for data flows where a host in the private* VLAN pushes data to a host in the public* VLAN and clients in the cloud-hosts* VLAN access that data via a service on the public* VLAN host. This is broadly the model that the dumps.wikimedia.org servers use.

this sounds like everything should "just work" in the public vlan, in terms of being able to open tcp connections. I suppose we will just wait and find out then. The primary connects we need are from the mw job runners to cloudelastic to send updates as they occur.

Actually, looking back in otto's task, this was said by one of our network engineers:

In T207321#4882980, @ayounsi wrote:

I think there is a distinction to make here when saying "prod", as it's made of several vlans/networks, especially:

public, host have public IPs and are reachable from the Internet (including Cloud), protected by iptables

private, host have private IPs, only reachable from private and prod public, protected by firewalls and iptables

analytics, similar to private, with firewalls limiting outbound flows as well (data leak, etc)

SRE won't be opening up any networking holes between Cloud and Prod.

From Cloud to public prod, everything is already open, it's from Cloud to private prod that traffic shouldn't be permitted.

There will be no private data at all in this cluster.
All we need is that this Presto (JDBC) endpoint is accessible via Cloud VPS

Because of that it looks like the public vlan is a good candidate, and use IPtables make sure the endpoint is only exposed to Cloud IP ranges.
That doesn't remove the need for proper secure authentication, but reduces considerably the attack surface.

We have:
private: mw job runners
public: cloudelastic
cloudvirts (same as internet traffic, but known ip range)

cloud virts -> cloudelastic : open
mw job runners -> cloudelastic : closed
cloudelastic -> mw job runners: open

Unfortunately this is the exact opposite of what we need wrt cloudelastic <-> mw job runners.

I don't know how we would make it happen, but one potential option could be having separate data and query clusters. The data cluster could live in the private vlan and accept writes from the mw job runners. The query cluster could live in the public vlan and will be able to open connections to the data nodes. To avoid the wonkery of trying to put nodes in the same cluster that can only open a tcp connection in one direction the query nodes would have to be their own independent cluster that uses elasticsearch cross-cluster search to reach out to the data nodes.

This is roundabout and somewhat complicated, but could plausibly work. A major sticking point is where to get nodes to form the "query cluster". These would be very light weight instances, but afaik there is no way to spin up a vm (ganetti or whatever) in the public vlan?

Sorry for making everything confusing here, lets run with the assumption for now that the job runners can talk to cloudelastic, and if not deal with that then.

mw job runners -> cloudelastic : closed

We have the same problem with updating data in a Presto cluster in the public VLAN. The easiest thing for us to do would be allow Analytics Hadoop -> Public Presto, but we were told not to do that. We're going to have to generate a static dump file and put it somewhere publicly accessible (likely dumps.wm.org), and then download and load that dump into Public Presto.

Change 492966 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o

https://gerrit.wikimedia.org/r/492966

gerritbot added a project: Patch-For-Review.Feb 26 2019, 8:23 AM

Change 492966 merged by Vgutierrez:
[operations/puppet@production] acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o

https://gerrit.wikimedia.org/r/492966

Vgutierrez subscribed.Feb 26 2019, 8:35 AM

This comment was removed by Vgutierrez.

Change 492974 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] Revert "acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o"

https://gerrit.wikimedia.org/r/492974

Change 492974 merged by Vgutierrez:
[operations/puppet@production] Revert "acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o"

https://gerrit.wikimedia.org/r/492974

Change 493048 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] cirrus: fallback option to localssl using acme subject

https://gerrit.wikimedia.org/r/493048

Change 493063 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] relforge: switch relforge to use localssl

https://gerrit.wikimedia.org/r/493063

Change 493048 merged by Volans:
[operations/puppet@production] cirrus: fallback option to use localssl via acme subject

https://gerrit.wikimedia.org/r/493048

Change 493063 merged by Volans:
[operations/puppet@production] relforge: switch relforge to use localssl

https://gerrit.wikimedia.org/r/493063

Change 494471 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] elasticsearch: move nagios check to profile

https://gerrit.wikimedia.org/r/494471

Change 494499 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] elasticsearch: refactor icinga checks

https://gerrit.wikimedia.org/r/494499

Change 494471 merged by Dzahn:
[operations/puppet@production] elasticsearch: move nagios check to profile

https://gerrit.wikimedia.org/r/494471

Change 494499 merged by Gehel:
[operations/puppet@production] elasticsearch: refactor elastic icinga checks

https://gerrit.wikimedia.org/r/494499

Dzahn unsubscribed.Mar 13 2019, 11:46 AM

Change 496164 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] use hostname instead of fqdn

https://gerrit.wikimedia.org/r/496164

Change 496164 merged by Gehel:
[operations/puppet@production] use hostname instead of fqdn

https://gerrit.wikimedia.org/r/496164

Change 496782 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] elasticsearch: add profile for icinga checks

https://gerrit.wikimedia.org/r/496782

Change 496782 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] elasticsearch: add profile for icinga checks

https://gerrit.wikimedia.org/r/496782

• GTirloni edited projects, added cloud-services-team (Kanban); removed cloud-services-team.Mar 24 2019, 10:16 AM

Change 499511 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] elasticsearch: use standard resources for icinga checks

https://gerrit.wikimedia.org/r/499511

Change 499511 merged by Gehel:
[operations/puppet@production] elasticsearch: use standard resources for icinga checks

https://gerrit.wikimedia.org/r/499511

Change 499790 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] elasticsearch: split plugin into base and cirrus

https://gerrit.wikimedia.org/r/499790

Change 499790 merged by Gehel:
[operations/puppet@production] elasticsearch: split plugin into base and cirrus

https://gerrit.wikimedia.org/r/499790

Change 496782 merged by Gehel:
[operations/puppet@production] elasticsearch: add profile for icinga checks

https://gerrit.wikimedia.org/r/496782

Change 500525 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] elasticsearch: add profile for icinga checks

https://gerrit.wikimedia.org/r/500525

Change 500525 merged by Gehel:
[operations/puppet@production] elasticsearch: add profile for icinga checks

https://gerrit.wikimedia.org/r/500525

Change 487129 merged by Gehel:
[operations/puppet@production] cloudelastic: Add cloudelastic configs

https://gerrit.wikimedia.org/r/487129

Change 500742 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] cloudelastic: add missing monitoring clusters

https://gerrit.wikimedia.org/r/500742

Change 500742 merged by Volans:
[operations/puppet@production] cloudelastic: add missing monitoring clusters

https://gerrit.wikimedia.org/r/500742

Change 500773 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] cloudelastic: allow elastic to bind to public ip

https://gerrit.wikimedia.org/r/500773

Change 500773 merged by Gehel:
[operations/puppet@production] cloudelastic: allow elastic to bind to public ip

https://gerrit.wikimedia.org/r/500773

Change 500940 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o

https://gerrit.wikimedia.org/r/500940

Change 500940 abandoned by Mathew.onipe:
acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o

https://gerrit.wikimedia.org/r/500940

Change 500940 restored by Mathew.onipe:
acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o

https://gerrit.wikimedia.org/r/500940

Change 500940 merged by Vgutierrez:
[operations/puppet@production] acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o

https://gerrit.wikimedia.org/r/500940

Change 501158 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] cloudelastic: use acme_chief to get ssl cert

https://gerrit.wikimedia.org/r/501158

Change 501174 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] acme_chief: generate cert for each cirrus clusters

https://gerrit.wikimedia.org/r/501174

Change 501187 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] tlsproxy::localssl: split title and acme cert name

https://gerrit.wikimedia.org/r/501187

Change 501174 abandoned by Mathew.onipe:
acme_chief: generate cert for each cirrus clusters

Reason:
for this: https://gerrit.wikimedia.org/r/c/operations/puppet/ /501187

https://gerrit.wikimedia.org/r/501174

Change 501187 merged by Vgutierrez:
[operations/puppet@production] tlsproxy::localssl: split title and acme cert name

https://gerrit.wikimedia.org/r/501187

Change 501158 merged by Vgutierrez:
[operations/puppet@production] cloudelastic: use acme_chief to get ssl cert

https://gerrit.wikimedia.org/r/501158

Change 501462 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] icinga: Ok when total shards is zero

https://gerrit.wikimedia.org/r/501462

Change 501462 merged by Gehel:
[operations/puppet@production] icinga: Ok when total shards is zero

https://gerrit.wikimedia.org/r/501462

Gehel mentioned this in T220205: Define constraints for cloudelastic use cases.Apr 5 2019, 2:08 PM

• Mathew.onipe moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.Apr 6 2019, 12:49 PM

debt closed this task as Resolved.Apr 15 2019, 6:03 PM

Change 499785 abandoned by Mathew.onipe:
elasticsearch: split plugin into base and cirrus

https://gerrit.wikimedia.org/r/499785