Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | elukey | T198694 Q1 2018/19 Analytics procurement | |||
Resolved | elukey | T198424 Order Data Lake Hardware |
Event Timeline
The mw history is about .5 TB , if we are to use presto the ideal scenario is that with a 3 node cluster we can keep whole dataset in memory.
We shoudl also have disk space -if possible- for at least two dumps.
Hardware request task for the Druid Analytics cluster: https://phabricator.wikimedia.org/T166510
3 nodes with 64G of RAM and four Intel 1.6T SSD disks. We are currently setting a 2.9T raid10 lvm ext4 partition on each node.
I am pretty sure that we can easily order 128/256G of ram each, but I'd like to make sure about how Presto handles memory before committing to a huge value like 0.5T of overall RAM, that seems a lot.
I'm pretty sure that whatever we end up using for this, the more memory we have the better.
Summary from the team discussion: we are going to set the hardware specs to hadoop worker nodes, since it seems that Presto will need HDFS to work properly. The target is 3 nodes for the moment, but since we have a lot of them to order for the hadoop cluster (refresh + expansion) we reserve the final choice of re-shuffling a couple of nodes from hadoop to the public data lake in case it is needed.
Another point to figure out is what kind of security level we are aiming for, just to do our homework before ordering hardware and choosing Presto + Hadoop as technology for this project.
The public data lake project should be:
- a way to load and offer analytics public datasets in labs for various explorations (without the need to sign ndas, getting access to production, etc..).
- a 3 to 5 hosts cluster in labs, hosting a bare minimum Hadoop (able to provide a HDFS layer).
What level of accounting, authentication and access control it is required for a project like this one?
I had a chat with @MoritzMuehlenhoff about this use case, here's some more notes:
- there will be no data shared with the Hadoop production cluster or any other host in production.
- we (as analytics) will load periodically public data (no PII) to this new cluster in labs, that will effectively be a new small scale Hadoop cluster in labs.
There seem to be no concerns about the use case, so I'd proceed with the hardware procurement task.