Maniphest T207321

Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	Ottomata
	Oct 17 2018, 7:38 PM

Description

We're creating a second Hadoop cluster on which we will also install Hive and Presto. This cluster will host public datasets (no private data) like the mediawiki_history table (which itself is built from the replicated tables in labsdb). This cluster is the 'big data'/OLAP version of labsdb.

This cluster needs to be queryable (on restricted ports TBD) from Cloud VPS networks, and it also needs to be accessible by nodes in the Analytics VLAN. Datasets like mediawiki_history will be computed in the existent Analytics Hadoop cluster and pushed over to the cloud-analytics Hadoop cluster for querying from Cloud VPS.

@faidon mentioned that we need to discuss to figure out where these nodes should live, networking-wise. There are two tickets to set them up, 5 are bare metal, the other 3 are ganeti instances: T207194: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet and T207205: Set up 3 Ganeti VMs for datalake cloud analytics Hadoop cluster.

Can we use the same networking model we use for labsdb hosts, or do we need to do something different/better?

Related Objects
Search...

Status	Assigned	Task
Open	None	T204950 Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users
Declined	Ottomata	T204951 Presto cluster online and usable with test data pushed from analytics prod infrastructure accessible by Cloud (labs) users
Resolved	Ottomata	T207321 Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster

Event Timeline

Ottomata triaged this task as High priority.Oct 17 2018, 7:38 PM

Ottomata created this task.

Ottomata updated the task description. (Show Details)

Krenair subscribed.Oct 17 2018, 10:18 PM

Most of the cloud infrastructure hosts either are in the public vlan or are moving there as we update and replace hardware. The labsdb10(09,10,11) hosts are in the labs-support vlan that we are incrementally deprecating. See https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network_and_Policy for more details.

• fdans added a project: Analytics-Kanban.Oct 18 2018, 4:57 PM

• fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

ayounsi claimed this task.Oct 19 2018, 7:03 AM

We currently have the following relevant vlans:

private: 10/8 IPs, not reachable from cloud instances and the Internet
public: public IPs, reachable from clould, the Internet and private vlans
analytics: similar to private, but with extra firewall rules for traffic exiting the vlan (to protect PII data)
labs-support: similar to private but reachable from clould instances, being deprecated as said by bd808

Cloud instances should be considered as the public Internet, and based on the task description, no PII will be stored on those servers, so to me the public vlan is where those servers should reside.

ayounsi reassigned this task from ayounsi to Ottomata.Oct 22 2018, 7:20 PM

How many servers are we talking about both right now, as well as in the mid-term e.g. in the next year or two?

How will data flow into this cluster as well as between the cluster members? Can you describe a little bit the design you're thinking of? That may help drive our decisions in terms of network design :)

Thanks in advance!

Right now, 8: 3 ganeti instances and 5 bare metal worker nodes. We wouldn't be adding more nodes in within this FY, but depending on data size and demand, we might add more next FY. If we did, I'd expect +3 more worker nodes. But, 5 might be enough for quite a while too.

The data that will be loaded into this cluster will be computed in the existent analytics-hadoop cluster. It will then be loaded into this new cloud-analytics Hadoop cluster, and served mostly via Hive and Presto. We'd like to hook into tools that Cloud VPS users are used to, like Quarry. Quarry could then be used to query the Hive/Presto databases in the new cloud-analytics Hadoop cluster.

For performance reasons, we don't want to open up any ports to the public internet. If this does go into the 'public' VLAN, could we restrict access to these nodes using some simple ferm rules? E.g. deny all but private, analytics and Cloud VPS networks?

Where are the labsdb hosts going to live if they are being moved out of the labs-support VLAN? Likely these cloud-analytics nodes should live wherever labdb lives.

If this does go into the 'public' VLAN, could we restrict access to these nodes using some simple ferm rules? E.g. deny all but private, analytics and Cloud VPS networks?

yes, we can auto-generate ferm rules based on network ranges defined in Puppet's constants.

Where are the labsdb hosts going to live if they are being moved out of the labs-support VLAN?

From https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network_and_Policy#labsdb100[4567]
"These hosts are out of warranty and are being replaced by virtual machines inside of the tenant network with dedicated cloud[lab]virt hypervisors. This work is being tracked in T193264"

In T207321#4687651, @ayounsi wrote:

Where are the labsdb hosts going to live if they are being moved out of the labs-support VLAN?

From https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network_and_Policy#labsdb100[4567]
"These hosts are out of warranty and are being replaced by virtual machines inside of the tenant network with dedicated cloud[lab]virt hypervisors. This work is being tracked in T193264"

That appears to be about the toolsdb things rather than prod DB replicas?

Ok, great, then it sounds like this should go in the public VLAN, with ACLs in the Analytics VLAN to allow us to push data there, as well as ferm rules to allow Cloud VPS in.

In T207321#4687656, @Krenair wrote:

In T207321#4687651, @ayounsi wrote:

Where are the labsdb hosts going to live if they are being moved out of the labs-support VLAN?

From https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network_and_Policy#labsdb100[4567]
"These hosts are out of warranty and are being replaced by virtual machines inside of the tenant network with dedicated cloud[lab]virt hypervisors. This work is being tracked in T193264"

That appears to be about the toolsdb things rather than prod DB replicas?

Correct. The wiki replica database servers are the https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network_and_Policy#labsdb10[08|09|10] servers which currently live in the labs-support vlan, but should move somewhere else eventually to drain the deprecated vlan. As noted on the wiki page a firm decision has not yet been made on moving them to the public vlan or another location. We probably will not virtualize them inside cloud services itself due to the presence of sensitive data in the raw database tables which we currently only redact via the view layer that Cloud clients are limited to interacting with.

Not impacting that task, but for labsdb10[08|09|10], the presence of sensitive data + need to be reached from Cloud might require longer conversations. Please open a task as soon as you have a migration timeline.

Ayounsi, for this ticket, shall we ask for these to be set up in the public VLAN?

That sounds good to me but will have @faidon doublecheck.

Ideally please distribute those servers across multiple rows (they all have public vlans).

Faidon asked for a diagram to help understand the data flow. Here we go!

cloud-analytics data lake.png (900×1 px, 79 KB)

faidon mentioned this in T207536: Move various support services for Cloud VPS currently in prod into their own instances.Oct 25 2018, 12:14 AM

hey hey heyyy, the nodes are in! https://phabricator.wikimedia.org/T204177#4695147

How can we move this forward @faidon?

So, this is quite the can of worms :) There are several pieces to this, and honestly, I feel like VLANs is kind of a secondary question, with the primary being the overall design of this new infrastructure especially from a security perspective. Questions such as "what services should we opening up to the public (WMCS/Internet)", "how should data flow from the Analytics cluster", etc.

Personally, I still don't feel like I fully understand all of the pieces, the data flows or even simple stuff like TCP ports, but at the same time don't think this is something that we should discuss as part of this -very specific- task. I think this warrants a bit more planning and a (hopefully small!) platform security review first, the result of which will drive some of the choices that we're going to be executing here.

I realize this may not easy to hear when you've gathered momentum for this, but this is a piece of infrastructure than on one side interfaces with the most public/open infrastructure we operate (WMCS, which is effectively the Internet), and the other with on the most sensitive infrastructure that we operate (Analytics), and so I believe some extra caution and slowdown is warranted.

Practically speaking, I think this conversation needs to expand to a few more folks. I reached out to @chasemp earlier today, to ask for his help and especially given y'all worked on Analytics Security recently. This is a bit of a busy week for everyone and especially the security team, but we're going to sync up next week again and possibly set up a meeting with various involved parties. Hope that's OK?

Ottomata mentioned this in T207194: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet.Oct 25 2018, 7:02 PM

Ping @Nuria too

This is a bit of a busy week for everyone and especially the security team, but we're going to sync up next week again and possibly set up a meeting with various involved parties. Hope that's OK?

Sounds good, let's talk about this next week. Let's keep in mind that all data that will flow to this system from analytics is already in WMCS/Internet. I understand your concerns have more to do with the network and connections rather than with the data per se but I thought this point is worth clarifying.

Just had a great meeting with @chasemp, @faidon, @JAllemandou and @Nuria. The main action item (after Nuria had to go) was to talk with Cloud VPS engineers to see if we could make this cluster on Cloud Virts instead of bare metal in prod. That would be totally fine with us, and actually even preferred. I think we thought this was not possible originally, but if it is, and we can do it within a couple of weeks, we'd like to proceed that way.

So! @bd808 and @Andrew, what do you think? Our planned bare metal resource usage is:

5 x workers: 128G RAM, 48T storage, 48 cores
2 x hadoop masters: 16ishG RAM, 4ish cores, etc. This is flexible.
1 or more various smaller 'coordinator' type nodes: Hive Server & MySQL catalog metastore, Presto coordinator, etc.) Also flexible.

The main resource consumption is the worker nodes. Would it be possible to get equivalent dedicated Cloud Virt resources for this?

My notes from the 2018-10-31 meeting:

https://phabricator.wikimedia.org/T207321#4691776

* hosts that push data into public data lake?
- ACL holes to make this happen
- can we generate this from labsdb directly instead of analytics vlan?
- 7 to 8 hours on 50 machines right now in private hadoop cluster (5-10 days total process right now)

  hadoop distcp
  - bidirectional
  - https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_admin_distcp_data_cluster_migrate.html
https://mapr.com/docs/52/ReferenceGuide/hadoop-distcp.html
https://stackoverflow.com/questions/31862904/how-to-do-i-copy-data-from-one-hdfs-to-another-hdfs

* access from analytics for other purposes?
- what would the ACL allowances look like?
- hdfs open on cloud analytics side namenode and datanode ports
- scope full pools from a -> b

* access from cloud for purposes?
- presto coordinator port

* single source of some data? medawiki history reconstruction?

administration with 
    - query killer
    - ?
    - quota resource per user in presto
    - user auth/authz
    - https://github.com/prestodb/presto/wiki/Security-Troubleshooting-Guide
    - https://prestodb.io/docs/current/security/ldap.html
    - presto maybe has ldap auth?
    - credential distribution?

"You can further restrict the set of users allowed to connect to the Presto coordinator based on their group membership by setting the optional ldap.group-auth-pattern and ldap.user-base-dn properties in addition to the basic LDAP authentication properties."

    - capacity
    - auditing
    
- large cloud VPS hypervisors?

In T207321#4709800, @Ottomata wrote:

Just had a great meeting with @chasemp, @faidon, @JAllemandou and @Nuria. The main action item (after Nuria had to go) was to talk with Cloud VPS engineers to see if we could make this cluster on Cloud Virts instead of bare metal in prod. That would be totally fine with us, and actually even preferred. I think we thought this was not possible originally, but if it is, and we can do it within a couple of weeks, we'd like to proceed that way.

So! @bd808 and @Andrew, what do you think? Our planned bare metal resource usage is:

5 x workers: 128G RAM, 48T storage, 48 cores

2 x hadoop masters: 16ishG RAM, 4ish cores, etc. This is flexible.

Are these numbers per, or total? Are we talking about 48 cores total for the workers, or 225?

1 or more various smaller 'coordinator' type nodes: Hive Server & MySQL catalog metastore, Presto coordinator, etc.) Also flexible.

The main resource consumption is the worker nodes. Would it be possible to get equivalent dedicated Cloud Virt resources for this?

Are these numbers per, or total?

Per worker. This number is also flexible, its just what we were aiming for with our bare metal hardware order.

Oh, I think we won't necessarily need so much storage. CPU and RAM more important. Faster disks might actually better than larger ones in this case.

If we want to do all this work on VMs, there are two clear ways forward. In either case, though, we should probably run some tests to make sure that a VM can do everything that y'all need.

Option one (easiest for cloud team):

So if we were to do this with dedicated cloudvirts, the easiest thing would be to piggyback on our current (almost-completed) order, T201352. Each of those has 72 cores, 512G and about 10Tb in raid10 each.

Your ask is for around 130 physical cores, 520G ram, 240Tb of storage.

So, we could hit those specs with two dedicated cloudvirts, except we would be nowhere near the amount of storage you're suggesting (we'd have more less than a tenth as much.)

We'd have to pay for this somehow, of course.

Option two (still totally doable):

Rack and install 5 of your worker nodes as cloudvirts, and put one giant VM on each. Then use either one dedicated cloudvirt or, more likely, the normal cloud pool for your miscellaneous boxes.

The only concerns I have with this is that these boxes are BIG so I'd want to double-check with the DC people that we can fit them. They are also 10g nics, which we may not have switch space for.

I think either of these options is fine with me. With option one there's a greater chance of allowing self-serve for the analytics people (I'm not positive but I can probably just bind their project to those boxes and let them create/destroy VMs at will). With option two the cloud team would probably have to allocate the giant nodes by hand but that's quite easy.

If possible, I think I slightly prefer option 1. We may need more storage in the future, but I think for the time being it should be fine. @JAllemandou can correct me if I'm wrong, but we might not need more than 10TB or so to host a few mediawiki_history snapshots at once.

Option 2 is fine as well, especially if there is 10G switch room.

I can probably just bind their project

I think we'd prefer a new project for this. analytics project is mostly for prototyping and upgrade testing.

A new project is fine.

Just to clarify, with option one and current specs you'd be getting 5 nodes with:

48 logical cores
128Gb RAM
Either 2Tb or 3Tb storage per node <- note this relatively tiny number!

Let's hear what @JAllemandou thinks. The mediawiki_history dataset is under 1TB (snappy compressed parquet) per snapshots, and we want to keep a few snapshots around. We also may need some space to 'stage' the dataset copy while we load it into HDFS (not sure about this yet). I think ~15TB to start with should be ok.

TL;DR: I think 2/3Tb per host is just enough for a start, but might quickly become too small.
Details:
In term of storage, Hadoop has a default replication factor of 3, giving you (actual space / 3) usage space; so roughly 5Tb.
As @Ottomata pointed, the dataset we want to provide is less than 1Tb (currently 770Gb, growing). We will probably keep some of them (let's assume 3/4, taking into account the copying time period), and we will need enough space on the machine used as an entry-point to the dataset to have it full (copy locally from dumps, then to HDFS - Might be doable streaming).
Lastly, we also need to consider space for logs. Hadoop generates a big bunch of logs, and they have proven very usefull when debugging. I don't enough yet about presto, but we should keep some space for this.

OH duh, I forgot to account for the HDFS replication. Right. Ok in that case, let's go with option 2. Is there room on the switches for 10g? :D

We do have enough 10G ports, I can't speak for rack space though.

Two points I was wondering about:
1/ Will all those hosts need to be in the same vlan/row (eg. cloud-hosts1-b-eqiad)? Ideally they should be spread across multiple rows to avoid the scenario of one row (aka. failure domain) outage taking the whole service down
2/ It looks like the guest VMs will need to communicate to private analytics hosts (on 10/8 space). I think the long term goal is to treat VMs as the public Internet, and thus only communicating with our public IPs.

1/ Will all those hosts need to be in the same vlan/row (eg. cloud-hosts1-b-eqiad)? Ideally they should be spread across multiple rows to avoid the scenario of one row (aka. failure domain) outage taking the whole service down

Yeah, they should be spread out. Ideally between at least 3 rows.

2/ It looks like the guest VMs will need to communicate to private analytics hosts (on 10/8 space). I think the long term goal is to treat VMs as the public Internet, and thus only communicating with our public IPs.

For now, no. In the meeting we had we decided to get the data we need over to these nodes in a more indirectly manner; possibly by parking the data on dumps.wm.org and downloading it in Cloud VPS.

In T207321#4715878, @Ottomata wrote:

1/ Will all those hosts need to be in the same vlan/row (eg. cloud-hosts1-b-eqiad)? Ideally they should be spread across multiple rows to avoid the scenario of one row (aka. failure domain) outage taking the whole service down

Yeah, they should be spread out. Ideally between at least 3 rows.

We don't currently support virt hosting outside of row B.

Interesting...I suppose this service isn't quite as critical as our prod ones. Maybe this is ok?

If row b is down then so is cloud vps so not much point in this hadoop cluster being up :D

Joking aside, if these are instances they can be spread out beyond row b as cloud multi-row becomes a thing along with other instances. There is a nice parity there.

Ok, so plan:

Let's rack the 5 new Hadoop nodes in Row B and set them up as dedicated Clou Virts pool in a new project. Is 'cloud-analytics' an ok name for the project? What do we need to move this forward. @faidon, can @Cmjohnson go ahead and do this to finish up T207194: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet?

Another Q: @Andrew, Can we have some other non dedicated resources in this project for smaller non-worker nodes? We'd need the same resources we requested in T207205: Set up 3 Ganeti VMs for datalake cloud analytics Hadoop cluster.

@Cmjohnson, please name these 5 boxes 'cloudvirtanalyticsXXXX' starting with cloudvirtanalytics1001. And rack them in row B with normal cloudvirt cabling. (If need be I can figure out in better detail what I mean by 'cloudvirt cabling' but @ayounsi is probably the best to ask about that.)

In T207321#4721075, @Ottomata wrote:

Another Q: @Andrew, Can we have some other non dedicated resources in this project for smaller non-worker nodes? We'd need the same resources we requested in T207205: Set up 3 Ganeti VMs for datalake cloud analytics Hadoop cluster.

Probably! 8 cores per box sounds like a lot but it's not out of the question. Can you make a new project request for the project to contain these new works, and include that request for the extra resources?

Cool, done: T208756: New Cloud VPS project 'cloud-analytics'

Ack, +1. Only thing I'd nitpick is that cloudvirtanalytics1001 may be too long for things like physical labels. I think dumps labvirts were just named "labvirts", could we go for that? If not, something shorter would be great. Maybe cloudvirt-an1001 or cloudvirt-dl1001 (for "data lake")?

if cloudvirtanalyticsXXXX is really too long then let's go with cloudvirtdlXXXX

I'd prefer if we used 'analytics' instead of 'data lake'. Can we do cloudvirtanXXXX? cloudvirt-anXXXX?

Let's make this happen! @Andrew are you ok with cloudvirt-anXXXX? @Cmjohnson would you prefer to coordinate racking of these nodes on this ticket or on T207194: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet (we'll have to rename the nodes).

I'd prefer without the dash -- just cloudvirtan1XXX if cloudvirtanalytics1xxx won't fit.

Ok, @Cmjohnson your call then: we'd prefer cloudvirtanalytics1xxx, but if that is too long, then use cloudvirtan1xxx. How should we now proceed?

@Ottomata Lets go with cloudvirtan1xxx.

Ok!

@Cmjohnson I updated T207194: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet to reflect the new naming. Please proceed and then assign to Cloud VPS folks for OS install / puppetization setup as Cloud Virt nodes.

Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.Nov 29 2018, 8:52 PM

I've created a proof-of-concept VM, hadoop-worker-01.cloud-analytics.eqiad.wmflabs. Please check that out and confirm that the specs are adequate (and that it behaves reasonably.) Please don't put any actually useful work there :)

Once/if you're satisfied with that VM, please open a new task for me to create all five fresh VMs on your new hardware. And let me know what you want me to call them. Thanks!

*bump*

AH a task! I missed that. Making now.

Tomorrow (Jan 16) we have a meeting with some SRE folks to revisit this. We've got the cloud-analytics Hadoop and Presto cluster up and running in Cloud VPS (thanks Andrew!). But as Bryan says

One by one you are about to rediscover all the nice things in production network that are not included as out of the box services for Cloud VPS customers. :/

Tomorrow we should discuss (again) what it would take to stand up this cluster in production instead of Cloud VPS. I'll use this task to outline some of my thoughts.

All we need is that this Presto (JDBC) endpoint is accessible via Cloud VPS. We've been told previously that SRE won't be opening up any networking holes between Cloud and Prod. Ok. We consider this service similar to AQS (aka pageview API), except that we don't want the entire internet to have access by default. There will be no private data at all in this cluster.

Perhaps we can set up this cluster in prod, and expose the JDBC/Presto port to the public internet. Both Hive and Presto support LDAP authentication, so in theory we could require that users authenticate via LDAP when using a Presto CLI or a JDBC connection. This would look something like:

Presto service user authenticates to Hive as 'presto' user and has read-only privileges.
Users authenticate to public Presto service as themselves via LDAP. i.e. they must first have a wikitech account.

The tricky bit might be secure authentication. In order to use ldaps, will the client need to have our LDAP server's cacert? Or is our LDAP cacert signed with some public CA?

Some links:

I think there is a distinction to make here when saying "prod", as it's made of several vlans/networks, especially:

public, host have public IPs and are reachable from the Internet (including Cloud), protected by iptables
private, host have private IPs, only reachable from private and prod public, protected by firewalls and iptables
analytics, similar to private, with firewalls limiting outbound flows as well (data leak, etc)

SRE won't be opening up any networking holes between Cloud and Prod.

From Cloud to public prod, everything is already open, it's from Cloud to private prod that traffic shouldn't be permitted.

There will be no private data at all in this cluster.
All we need is that this Presto (JDBC) endpoint is accessible via Cloud VPS

Because of that it looks like the public vlan is a good candidate, and use IPtables make sure the endpoint is only exposed to Cloud IP ranges.
That doesn't remove the need for proper secure authentication, but reduces considerably the attack surface.

Great meeting!

We realized that the same problem we are encountering with the available Cloud VPS infrastructure (monitoring, logging, maintenance, etc.) will also be encountered if the LabsDB MySQL replicas ever get moved into Cloud VPS. Bryan suggested that one day the Cloud VPS team might have an 'infrastructure' project that certain vetted projects could link into to use Prometheus, PuppetDB, Icinga, etc. But that day is not today!

We are going to pursue moving the Presto cluster back into production. Before we do, we want to understand a couple of things:

A. Bryan noted that we won't want to use users regular LDAP accounts for authentication to this service; as we don't want to them to e.g. hardcode their LDAP password in some tool code on a VPS box somewhere. So, LDAP/wikitech users will need auxiliary 'community analytics' (name TBD) accounts. @bd808, if we do this, can we auto create those counterpart accounts when users sign up in wikitech? (Similar to how yall create labsdb users for tool projects.)

B. How will clients actually use the JDBC endpoint. To use LDAPS, will they need to download a cacert? I will try to set up LDAP use in the existing cloud-analytics project as it is now to understand this.

In T207321#4885112, @Ottomata wrote:

A. Bryan noted that we won't want to use users regular LDAP accounts for authentication to this service; as we don't want to them to e.g. hardcode their LDAP password in some tool code on a VPS box somewhere. So, LDAP/wikitech users will need auxiliary 'community analytics' (name TBD) accounts. @bd808, if we do this, can we auto create those counterpart accounts when users sign up in wikitech? (Similar to how yall create labsdb users for tool projects.)

There are 2 ways we could go here:

Extend the existing maintain-dbusers that provisions the $HOME/replica.my.cnf database credentials for each Toolforge member and tool to do similar work for the new presto db (generate username & random password, create LDAP record, store username + password in a file that the user/tool can read).
Build a process for people to self-manage credentials for this new datastore. Self-managed credentials has been on the wishlist for the Wiki Replicas for a while as a replacement for maintain-dbusers. If we avoid worrying about provisioning a physical file this would be pretty easy to add into Striker. Striker already has the ability to create/edit LDAP records.

B. How will clients actually use the JDBC endpoint. To use LDAPS, will they need to download a cacert? I will try to set up LDAP use in the existing cloud-analytics project as it is now to understand this.

I don't think the end user will have any idea that LDAP is involved in this process. Any connection to the LDAP directory would be encapsulated in Presto's configuration. The user should just be passing a username and password to the Presto server via whatever client library they are using. I was trying to figure out how this will actually look for the end user, but at least for the python client I struck out. It looks like they only have Kerberos support right now. Ultimately though it would be similar, something in the client that adds the needed HTTP Basic auth header to the requests to Presto which then validates the credentials against the LDAP directory and allows or denies the request.

I was trying to figure out how this will actually look for the end user

It'd be in the JDBC connection, e.g.

jdbc:presto://example.net:8080/hive/sales?user=test&password=secret

https://prestodb.io/docs/current/installation/jdbc.html

Or from a presto CLI somewhere (scroll down to the Presto CLI section).
https://prestodb.io/docs/current/security/ldap.html

There they mention truststore paths, which is where my question comes from.

In T207321#4885829, @Ottomata wrote:
I was trying to figure out how this will actually look for the end user

It'd be in the JDBC connection, e.g.
jdbc:presto://example.net:8080/hive/sales?user=test&password=secret
https://prestodb.io/docs/current/installation/jdbc.html

That works for a java developer, but nobody else (including Quarry). Quarry is a Python app which is why I was looking for support in their python client.

Or from a presto CLI somewhere (scroll down to the Presto CLI section).
https://prestodb.io/docs/current/security/ldap.html

There they mention truststore paths, which is where my question comes from.

That page says:

Access to the Presto coordinator should be through HTTPS when using LDAP authentication. The Presto CLI can use either a Java Keystore file or Java Truststore for its TLS configuration.

I take that to mean that the SSL certificate (or its signing cert) that is securing the Presto HTTPS endpoint, not the LDAP server, will be needed by the java cli client. I have no idea if the java packages we provide on the Toolforge grid engine nodes already know how to verify Let's Encrypt certs or not. That's probably something to look into. If you end up putting this new service on a *.wikimedia.org vhost behind the misc (or is it all text now?) varnish cluster then that should just work. The reason for recommending TLS for the Presto endpoint is that this is HTTP Basic auth which is trivial to intercept and reuse via packet capture.

That works for a java developer, but nobody else (including Quarry). Quarry is a Python app which is why I was looking for support in their python client.

Shouldn't matter, right? JDBC is pretty ubiquitous. Hmm, maybe it will. https://pypi.org/project/JayDeBeApi/

• Nuria closed this task as Resolved.Jan 17 2019, 2:26 PM

• Nuria set the point value for this task to 3.

Ottomata mentioned this in T214921: Setup elasticsearch on cloudelastic100[1-4].Feb 19 2019, 7:48 PM

	F26768307: cloud-analytics data lake.png
	Oct 24 2018, 3:08 PM

	F26768261: cloud-analytics data lake.pdf
	Oct 24 2018, 3:06 PM

Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto clusterClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...