Order Data Lake Hardware
The mw history is about .5 TB , if we are to use presto the ideal scenario is that with a 3 node cluster we can keep whole dataset in memory.

We shoudl also have disk space -if possible- for at least two dumps.

Hardware request task for the Druid Analytics cluster:

3 nodes with 64G of RAM and four Intel 1.6T SSD disks. We are currently setting a 2.9T raid10 lvm ext4 partition on each node.

I am pretty sure that we can easily order 128/256G of ram each, but I'd like to make sure about how Presto handles memory before committing to a huge value like 0.5T of overall RAM, that seems a lot.

I'm pretty sure that whatever we end up using for this, the more memory we have the better.

Summary from the team discussion: we are going to set the hardware specs to hadoop worker nodes, since it seems that Presto will need HDFS to work properly. The target is 3 nodes for the moment, but since we have a lot of them to order for the hadoop cluster (refresh + expansion) we reserve the final choice of re-shuffling a couple of nodes from hadoop to the public data lake in case it is needed.

Thanks for the very accurate summary @elukey :)

Another point to figure out is what kind of security level we are aiming for, just to do our homework before ordering hardware and choosing Presto + Hadoop as technology for this project.

The public data lake project should be:

  • a way to load and offer analytics public datasets in labs for various explorations (without the need to sign ndas, getting access to production, etc..).
  • a 3 to 5 hosts cluster in labs, hosting a bare minimum Hadoop (able to provide a HDFS layer).

What level of accounting, authentication and access control it is required for a project like this one?

I had a chat with @MoritzMuehlenhoff about this use case, here's some more notes:

  • there will be no data shared with the Hadoop production cluster or any other host in production.
  • we (as analytics) will load periodically public data (no PII) to this new cluster in labs, that will effectively be a new small scale Hadoop cluster in labs.

There seem to be no concerns about the use case, so I'd proceed with the hardware procurement task.

