Page MenuHomePhabricator

Consider replacing Cloudera's hadoop-hdfs-fuse with a newer, better and writeable HDFS FS mount
Closed, DeclinedPublic

Description

Cloudera's hadoop-hdfs-fuse package has always been a pain. It was never reliable for writes, so we mount HDFS read only. But even the read only mount breaks often and requires manual intervention.

However, I think we still need an HDFS FS mount. Ideally we could find a newer solution that works better for both reads and (small) writes. There seem to be a lot out there!

In T224658: Newpyter - SWAP Juypter Rewrite I'm exploring to see if we can create a fully thin Jupyterhub solution: one that runs both user notebook servers as well as all kernels on a Hadoop worker or in Yarn. To really do this and make it useable for users, they need to be able to write files from a notebook or kernel process running on any worker node, and later also access them from another worker node. The only way I can think to accomplish this is with a shared filesystem of some kind, and I'd much prefer if we could leverage HDFS for this.

Let's explore the HDFS-FS bridges out there and see if we can find one that will fit our needs.

Event Timeline

The candidates are bad. Declining.

Candidates:

Hadoop NFS Gateway

This is actually built into Hadoop!!! This might be less performant (?) than fuse based mounts. A possible con is the usual NFS UID problem. For users this won't matter since we ensure that UIDs are the same on all hosts. This might not be enforced for system installed users like hdfs or hive, etc.

py-hdfs-mount
hadoofus

This looks like it is just an up do date C and Python HDFS API, but perhaps there is a way to use it with Fuse?

Non-candidates: