Page MenuHomePhabricator

Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts
Open, HighPublic

Description

Context: there are periodical rsyncs from labstore1006/7 to stat1007 that pull analytics public data to be published. The worflow is the following:

  • labstore100X starts a rsync pull to stat1007 via cron
  • the rsync's module reads data from /mnt/hdfs, a fuse mountpoint for HDFS
  • data is grabbed from HDFS, and returned to the fuse reader
  • rsync moves data from stat1007 to labstore100X

There are some bottlenecks:

  1. using the hdfs fuse mountpoint for a big dataset may cause performance issues like T234160
  2. the fuse mountpoint is really brittle and not performant
  3. data takes ages to move from HDFS to stat1007 to labstore100X

There are also some future things to solve: when kerberos will be enabled, rsync on stat1007 will need to be able to authenticate before pulling data from the fuse mountpoint. Nothing really very complicated but currently it requires extra config and testing.

A possible solution could be to have the labstore nodes to pull data directly from HDFS. There are no rsync-like commands available but the analytics team wrote one that might be taken as prototype/inspiration.

Security wise, to make this happen the labstore nodes will need to be able to pull data from Hadoop (so install hadoop client packages, be whitelisted in ferm because outside the analytics network, etc..) and eventually they'll need to be kerberized (not a bit deal but mentioning it anyway).

Event Timeline

Nuria created this task.Mon, Sep 30, 3:25 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Sep 30, 3:25 PM
Nuria renamed this task from shorten the time it takes to move files from hadoop to dump hosts to shorten the time it takes to move files from hadoop to dump hosts by Kerberinzing the dump hosts .Mon, Sep 30, 3:26 PM
Nuria renamed this task from shorten the time it takes to move files from hadoop to dump hosts by Kerberinzing the dump hosts to shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts .Mon, Sep 30, 3:40 PM
fdans added a project: Analytics-Kanban.
fdans triaged this task as High priority.Mon, Sep 30, 3:51 PM
elukey renamed this task from shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts to Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts .Tue, Oct 1, 7:19 AM
elukey updated the task description. (Show Details)
elukey added a subscriber: elukey.

Adding @Bstorm because the labstore servers are WMCS boxes.

How long does data take to show up on the labstore servers now once it's been stored in HDFS? Where is the biggest lag?

elukey added a comment.EditedTue, Oct 1, 9:08 AM

The main problem is that a ton of data (like the recent mediawiki history dumps) need to go through the fuse hdfs mountpoint, slowing down things a lot. That mountpoint is not able to cope with a huge amount of data requested, and shouldn't be used for that. Also shifting data from stat1007 to labstore migth be avoid if those hosts could pull directly from HDFS. The mediawiki history dumps rsync started on friday and it has still to finish (IIUC a total of ~400GB) :)

Bstorm added a comment.Wed, Oct 2, 6:16 PM

My main concern would be if we were mounting labstore NFS and moving the data, which would surely cause problems with the current kernel on those servers and the scale of the hardware (people regularly set off alarms by running a "cp" command on the wrong thing on the cloud). If we are talking about running a sync to labstore1006/7 from an hdfs system, that sounds fine to me, but understand that I do not know what client software the labstore would need to do that. I also do not immediately know what the impact to the server would be.

I am interested to know a bit more about the client you'd use to run the sync. As for security, I'll mention that these servers are mounted on Cloud systems read-only, so they are probably the safest Cloud NFS servers to open firewall holes to. I am also curious if anyone has information about possible Kerberos design and such so I can better understand how that could impact the systems in the future.

If we are talking about running a sync to labstore1006/7 from an hdfs system

Yup, that's what we want!

understand that I do not know what client software the labstore would need to do that.

We would take care of it; it is mostly applying the profile::hadoop::common puppet class.

mounting labstore NFS and moving the data
probably the safest Cloud NFS servers to open firewall holes to

We mostly want to more easily copy this public data to these servers so they can be downloaded via http, example: https://dumps.wikimedia.org/other/mediawiki_history/readme.html

Since the nodes themselves are in production (right?) adding the HDFS client (as well as eventually the Kerberization), should be fine and not open up any extra access to the Analytics cluster.

I am also curious if anyone has information about possible Kerberos design

Still very WIP, but best Luca has so far is at https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Kerberos

elukey added a comment.Wed, Oct 2, 6:21 PM

My main concern would be if we were mounting labstore NFS and moving the data, which would surely cause problems with the current kernel on those servers and the scale of the hardware (people regularly set off alarms by running a "cp" command on the wrong thing on the cloud). If we are talking about running a sync to labstore1006/7 from an hdfs system, that sounds fine to me, but understand that I do not know what client software the labstore would need to do that. I also do not immediately know what the impact to the server would be.
I am interested to know a bit more about the client you'd use to run the sync. As for security, I'll mention that these servers are mounted on Cloud systems read-only, so they are probably the safest Cloud NFS servers to open firewall holes to. I am also curious if anyone has information about possible Kerberos design and such so I can better understand how that could impact the systems in the future.

Thanks for the answer!

So the idea would be to install some hadoop-related packages (nothing huge, mostly python and java based tools) on the labstore nodes and allow them to pull data from Analytics. This will be really lightweight and there should be no impact to the host.. Basically the current rsync will be translated into a copy from HDFS.
As for Kerberos, things are still under development, but I have some docs in:

https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Kerberos
https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop_testing_cluster

There are two hosts (krb1001 and krb2001), one for each DC, running on each a KDC daemon. The clients will be able to failover transparently (in theory) to a different KDC without any intervention if necessary. We have some basic automation to create principals and keytabs, and we support only one realm for the moment (WIKIMEDIA).

Nuria renamed this task from Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts to Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts .Fri, Oct 4, 3:53 AM
elukey moved this task from Backlog to In Progress on the User-Elukey board.Wed, Oct 9, 3:12 PM
elukey moved this task from Kerberos to Waiting for others on the User-Elukey board.

@Bstorm any remaining doubts that we can discuss? :)

Well, so far, we generally coordinate on maintenance, reboots and service downtime with @ArielGlenn to at least some extent. Should I assume that analytics team would be maintaining the state of hadoop client software on the dumps servers, and either way, are there special considerations to be taken when doing maintenance and reboots when it comes to the hadoop and kerberos setup?

Well, so far, we generally coordinate on maintenance, reboots and service downtime with @ArielGlenn to at least some extent. Should I assume that analytics team would be maintaining the state of hadoop client software on the dumps servers, and either way, are there special considerations to be taken when doing maintenance and reboots when it comes to the hadoop and kerberos setup?

Yes for sure, we'd maintain the hadoop client, but usually it is just applying a profile and that's it (unless we do a major upgrade, but in that case we take care of all the clients). No special considerations for reboots or maintenance, the client running on the labstore nodes will authenticate before starting, that's it, nothing to worry about on your side :)

Sounds like that takes care of my concerns.

@MoritzMuehlenhoff @ArielGlenn any concerns from your side?

elukey claimed this task.Tue, Oct 22, 7:25 AM
elukey added a subscriber: Milimetric.