LVS for Druid
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	Ottomata
	Oct 5 2017, 4:04 PM

Description

In T176223, we decided that in order to use Druid as a backend for AQS (which will in turn be used as a backend for the new Wikistats 2.0 website) we need to make a new 'public' Druid cluster, separate from the existent 'analytics' Druid cluster.

In order to use Druid as a backend, we need LVS to load balance client queries to the Druid broker service, which runs on all Druid nodes. Yesterday, I tried to set this up in https://gerrit.wikimedia.org/r/#/c/378956/. This would enable LVS for the existent Druid analytics brokers. While we don't necessarily need LVS for the internal analytics Druid cluster, it would be nice to have, and since we do need it for the to be created 'public' Druid cluster, we might as well do it for the analytics one too.

Anyway, this failed and was reverted because (among other reasons) Druid lives in the Analytics VLAN. According to @BBlack, router LVS settings are only configured to work in production VLANs.

So, our options are (in order of our preference):

A. Configure routers so that LVS will work in Analytics VLAN.
B. Put the public Druid cluster in the production network.
C. Set up LVS servers inside of the Analytics VLAN.

If we can do A., then we can use LVS for both Druid clusters. One question though. If we do LVS for the analytics Druid cluster in the Analytics VLAN, we still want to restrict incoming connections to the analytics Druid broker service only to $ANALYTICS_NETWORK hosts. Will the existing ferm rules on the Druid boxes that already do this be enough, if the connection is coming in via LVS? We believe so, since the Druid boxes should see the source IP of the client, not the LVS hosts. Just double checking with yall with more LVS knowledge..

If we can't do A., we will do B., but we will then have to make special firewall rules (both ferm and network ACLs) to allow connections between Hadoop and the public Druid cluster in the production network.

So! Can we do A.? If so, can we do it soon? :)

Details

	Subject	Repo	Branch	Lines +/-
	Reassign IPs to druid100[456] to move them out of the Analytics VLAN	operations/dns	master	+6 -8

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T120037 Vital Signs: Please provide an "all languages" de-duplicated stream for the Community/Content groups of metrics
Resolved	None	T120036 Vital Signs: Please make the data for enwiki and other big wikis less sad, and not just be missing for most days
Resolved	odimitrijevic	T130256 Wikistats 2.0.
Resolved	None	T140000 Design new UI for Wikistats 2.0
Resolved	None	T160370 Initial Launch of new Wikistats 2.0 website
Resolved	Milimetric	T156384 Backend for wikistats 2.0
Duplicate	JAllemandou	T174174 Add edits endpoint to AQS using druid as a backend
Open	None	T180971 Create LVS endpoint for druid-public-overlord (for oozie job indexing)
Open	None	T179027 Puppetize LVS interface IP sets per-DC for easy use in ferm rules
Resolved	elukey	T176223 Create Druid public cluster such AQS can query druid public data
Resolved	Ottomata	T177511 LVS for Druid

Event Timeline

Ottomata created this task.Oct 5 2017, 4:04 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 5 2017, 4:04 PM

Ottomata mentioned this in T176223: Create Druid public cluster such AQS can query druid public data.Oct 5 2017, 4:08 PM

elukey added a project: User-Elukey.Oct 6 2017, 3:27 PM

elukey moved this task from Backlog to In Progress on the User-Elukey board.

elukey added subscribers: akosiaris, faidon.Oct 9 2017, 12:38 PM

elukey added a subscriber: ayounsi.Oct 9 2017, 12:50 PM

Ottomata updated the task description. (Show Details)Oct 9 2017, 6:08 PM

We 've had part of this discussion in #wikimedia-netops IRC channel. I can post a backlog (we don't have a bot yet archiving that channel) but some first (and partial) consensus seems to be to go for B. The premise is that a service meant to be public and destined to be used by end-users should not be hosted in the analytics cluster but rather the production cluster. thorium right now being an implicit exception to that "rule". The entire "rule" should be discussed more thoroughly at some point in the future (I don't think it makes sense to sidetrack the druid task for the larger discussion).

AFAICT, and based on the above, A doesn't really make that much sense, since production cluster LVS servers should not be routing to services in the analytics cluster . C could possibly be implemented for internal use by the analytics cluster but should not be used to power services crossing the production-analytics clusters boundary. I can not calculate the usefulness of it, but if there is little to be gained from it, maybe postpone it ?

FWIW, I am not using the terminology analytics VLAN but rather cluster as there are in reality 4 VLANs (one per eqiad rack row)

In T177511#3671178, @akosiaris wrote:

We 've had part of this discussion in #wikimedia-netops IRC channel. I can post a backlog (we don't have a bot yet archiving that channel) but some first (and partial) consensus seems to be to go for B. The premise is that a service meant to be public and destined to be used by end-users should not be hosted in the analytics cluster but rather the production cluster. thorium right now being an implicit exception to that "rule". The entire "rule" should be discussed more thoroughly at some point in the future (I don't think it makes sense to sidetrack the druid task for the larger discussion).

It seems indeed the best course of action, even though the caveat is that all the Hadoop nodes ferm rules will need to be updated to allow the new public cluster (outside of the analytics cluster boundaries) to communicate with them.
The second use case, namely having a load balancing solution in front of the private druid cluster, can surely wait and it is not pressing enough in my opinion to justify the effort for C.

@akosiaris what are the steps to take to move druid100[345] outside the analytics vlan(s)? From what I know we'd need to:

Properly remove the hosts from service (already done by Andrew, they have now role spare)
Assign new IPs via operations/dns to druid100[456]
Configure the host ports to not apply the related Analytics VLAN id
Reimage the hosts

Is the above right or am I missing something ?

Thanks!

In T177511#3671213, @elukey wrote:

@akosiaris what are the steps to take to move druid100[345] outside the analytics vlan(s)? From what I know we'd need to:

Properly remove the hosts from service (already done by Andrew, they have now role spare)

Assign new IPs via operations/dns to druid100[456]

Configure the host ports to not apply the related Analytics VLAN id

Reimage the hosts

Is the above right or am I missing something ?

Thanks!

Remove druid100[456] from any router ACL entry (Garbage collection).

I 'll do 5, the rest all LGTM

Change 383318 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Reassign IPs to druid100[456] to move them out of the Analytics VLAN

https://gerrit.wikimedia.org/r/383318

gerritbot added a project: Patch-For-Review.Oct 10 2017, 8:46 AM

Remove druid100[456] from any router ACL entry (Garbage collection).

I 'll do 5, the rest all LGTM

Done (I had nothing to do actually, there are no ACL entries for druid hosts)

In T177511#3671257, @akosiaris wrote:

Remove druid100[456] from any router ACL entry (Garbage collection).

I 'll do 5, the rest all LGTM

Done (I had nothing to do actually, there are no ACL entries for druid hosts)

I am going to take the 5. spot to add the new druid100[456] IPs to the analytics-in4 filter :)

elukey added a project: netops.Oct 10 2017, 10:15 AM

Mentioned in SAL (#wikimedia-operations) [2017-10-10T10:33:02Z] <akosiaris> T177511 switch druid100[456] to private1-x-eqiad VLANs

Change 383318 merged by Elukey:
[operations/dns@master] Reassign IPs to druid100[456] to move them out of the Analytics VLAN

https://gerrit.wikimedia.org/r/383318

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['druid1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201710101118_elukey_19905.log.

Completed auto-reimage of hosts:

['druid1004.eqiad.wmnet']

Of which those FAILED:

['druid1004.eqiad.wmnet']

elukey removed projects: netops, Patch-For-Review, SRE.Oct 10 2017, 11:22 AM

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['druid1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201710101159_elukey_14560.log.

Completed auto-reimage of hosts:

['druid1004.eqiad.wmnet']

Of which those FAILED:

['druid1004.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['druid1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201710101244_elukey_14320.log.

Completed auto-reimage of hosts:

['druid1004.eqiad.wmnet']

Of which those FAILED:

['druid1004.eqiad.wmnet']

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['druid1004.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201710101326_elukey_31466.log.

Completed auto-reimage of hosts:

['druid1004.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['druid1005.eqiad.wmnet', 'druid1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201710101405_elukey_8347.log.

Updated cr1/cr2 eqiad with the following:

elukey@re0.cr2-eqiad> show system rollback compare 1 0
[edit firewall family inet filter analytics-in4]
       term default { ... }
+      term druid {
+          from {
+              destination-address {
+                  10.64.0.35/32;
+                  10.64.16.172/32;
+                  10.64.48.171/32;
+              }
+              protocol tcp;
+              port [ 8090 8082 ];
+          }
+          then accept;
+      }

Mentioned in SAL (#wikimedia-operations) [2017-10-10T14:22:26Z] <elukey> add druid public cluster's IPs to analytics-in4 on cr1/cr2 - T177511

Also added the following:

elukey@re0.cr2-eqiad# show | compare
[edit firewall family inet filter analytics-in4]
      term puppet { ... }
!      term druid { ... }
[edit firewall family inet filter analytics-in4 term druid from destination-address]
+        /* druid1004 */
         10.64.0.35/32 { ... }
[edit firewall family inet filter analytics-in4 term druid from destination-address]
+        /* druid1005 */
         10.64.16.172/32 { ... }
[edit firewall family inet filter analytics-in4 term druid from destination-address]
+        /* druid1006 */
         10.64.48.171/32 { ... }

I forgot to insert the new term before the default one that rejects everything, so the new rules were not applied. I also added annotations to the IPs.

elukey closed this task as Resolved.Oct 11 2017, 9:36 AM

LVS for DruidClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

LVS for Druid
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...