Page MenuHomePhabricator

Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts
Closed, ResolvedPublic

Description

Context: there are periodical rsyncs from labstore1006/7 to stat1007 that pull analytics public data to be published. The worflow is the following:

  • labstore100X starts a rsync pull to stat1007 via cron
  • the rsync's module reads data from /mnt/hdfs, a fuse mountpoint for HDFS
  • data is grabbed from HDFS, and returned to the fuse reader
  • rsync moves data from stat1007 to labstore100X

There are some bottlenecks:

  1. using the hdfs fuse mountpoint for a big dataset may cause performance issues like T234160
  2. the fuse mountpoint is really brittle and not performant
  3. data takes ages to move from HDFS to stat1007 to labstore100X

There are also some future things to solve: when kerberos will be enabled, rsync on stat1007 will need to be able to authenticate before pulling data from the fuse mountpoint. Nothing really very complicated but currently it requires extra config and testing.

A possible solution could be to have the labstore nodes to pull data directly from HDFS. There are no rsync-like commands available but the analytics team wrote one that might be taken as prototype/inspiration.

Security wise, to make this happen the labstore nodes will need to be able to pull data from Hadoop (so install hadoop client packages, be whitelisted in ferm because outside the analytics network, etc..) and eventually they'll need to be kerberized (not a bit deal but mentioning it anyway).

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 30 2019, 3:25 PM
Nuria renamed this task from shorten the time it takes to move files from hadoop to dump hosts to shorten the time it takes to move files from hadoop to dump hosts by Kerberinzing the dump hosts .Sep 30 2019, 3:26 PM
Nuria renamed this task from shorten the time it takes to move files from hadoop to dump hosts by Kerberinzing the dump hosts to shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts .Sep 30 2019, 3:40 PM
fdans added a project: Analytics-Kanban.
fdans triaged this task as High priority.Sep 30 2019, 3:51 PM
elukey renamed this task from shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts to Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts .Oct 1 2019, 7:19 AM
elukey updated the task description. (Show Details)
elukey added a subscriber: elukey.

Adding @Bstorm because the labstore servers are WMCS boxes.

How long does data take to show up on the labstore servers now once it's been stored in HDFS? Where is the biggest lag?

elukey added a comment.EditedOct 1 2019, 9:08 AM

The main problem is that a ton of data (like the recent mediawiki history dumps) need to go through the fuse hdfs mountpoint, slowing down things a lot. That mountpoint is not able to cope with a huge amount of data requested, and shouldn't be used for that. Also shifting data from stat1007 to labstore migth be avoid if those hosts could pull directly from HDFS. The mediawiki history dumps rsync started on friday and it has still to finish (IIUC a total of ~400GB) :)

Bstorm added a comment.Oct 2 2019, 6:16 PM

My main concern would be if we were mounting labstore NFS and moving the data, which would surely cause problems with the current kernel on those servers and the scale of the hardware (people regularly set off alarms by running a "cp" command on the wrong thing on the cloud). If we are talking about running a sync to labstore1006/7 from an hdfs system, that sounds fine to me, but understand that I do not know what client software the labstore would need to do that. I also do not immediately know what the impact to the server would be.

I am interested to know a bit more about the client you'd use to run the sync. As for security, I'll mention that these servers are mounted on Cloud systems read-only, so they are probably the safest Cloud NFS servers to open firewall holes to. I am also curious if anyone has information about possible Kerberos design and such so I can better understand how that could impact the systems in the future.

If we are talking about running a sync to labstore1006/7 from an hdfs system

Yup, that's what we want!

understand that I do not know what client software the labstore would need to do that.

We would take care of it; it is mostly applying the profile::hadoop::common puppet class.

mounting labstore NFS and moving the data
probably the safest Cloud NFS servers to open firewall holes to

We mostly want to more easily copy this public data to these servers so they can be downloaded via http, example: https://dumps.wikimedia.org/other/mediawiki_history/readme.html

Since the nodes themselves are in production (right?) adding the HDFS client (as well as eventually the Kerberization), should be fine and not open up any extra access to the Analytics cluster.

I am also curious if anyone has information about possible Kerberos design

Still very WIP, but best Luca has so far is at https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Kerberos

elukey added a comment.Oct 2 2019, 6:21 PM

My main concern would be if we were mounting labstore NFS and moving the data, which would surely cause problems with the current kernel on those servers and the scale of the hardware (people regularly set off alarms by running a "cp" command on the wrong thing on the cloud). If we are talking about running a sync to labstore1006/7 from an hdfs system, that sounds fine to me, but understand that I do not know what client software the labstore would need to do that. I also do not immediately know what the impact to the server would be.

I am interested to know a bit more about the client you'd use to run the sync. As for security, I'll mention that these servers are mounted on Cloud systems read-only, so they are probably the safest Cloud NFS servers to open firewall holes to. I am also curious if anyone has information about possible Kerberos design and such so I can better understand how that could impact the systems in the future.

Thanks for the answer!

So the idea would be to install some hadoop-related packages (nothing huge, mostly python and java based tools) on the labstore nodes and allow them to pull data from Analytics. This will be really lightweight and there should be no impact to the host.. Basically the current rsync will be translated into a copy from HDFS.
As for Kerberos, things are still under development, but I have some docs in:

https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Kerberos
https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop_testing_cluster

There are two hosts (krb1001 and krb2001), one for each DC, running on each a KDC daemon. The clients will be able to failover transparently (in theory) to a different KDC without any intervention if necessary. We have some basic automation to create principals and keytabs, and we support only one realm for the moment (WIKIMEDIA).

Nuria renamed this task from Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts to Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts .Oct 4 2019, 3:53 AM
elukey moved this task from Backlog to In Progress on the User-Elukey board.Oct 9 2019, 3:12 PM
elukey moved this task from Kerberos to Waiting for others on the User-Elukey board.

@Bstorm any remaining doubts that we can discuss? :)

Well, so far, we generally coordinate on maintenance, reboots and service downtime with @ArielGlenn to at least some extent. Should I assume that analytics team would be maintaining the state of hadoop client software on the dumps servers, and either way, are there special considerations to be taken when doing maintenance and reboots when it comes to the hadoop and kerberos setup?

Well, so far, we generally coordinate on maintenance, reboots and service downtime with @ArielGlenn to at least some extent. Should I assume that analytics team would be maintaining the state of hadoop client software on the dumps servers, and either way, are there special considerations to be taken when doing maintenance and reboots when it comes to the hadoop and kerberos setup?

Yes for sure, we'd maintain the hadoop client, but usually it is just applying a profile and that's it (unless we do a major upgrade, but in that case we take care of all the clients). No special considerations for reboots or maintenance, the client running on the labstore nodes will authenticate before starting, that's it, nothing to worry about on your side :)

Sounds like that takes care of my concerns.

@MoritzMuehlenhoff @ArielGlenn any concerns from your side?

elukey claimed this task.Oct 22 2019, 7:25 AM
elukey added a subscriber: Milimetric.

@elukey Just want to doublecheck that CPU resources used by the hadoop client won't be so much. These boxes both have 2 quad core Xeon E52623's but that's not tons for a box that does web service, rsyncs and/or nfs service too.

The CPU usage will be minimal. It will mostly just be network just like rsync is.

Rsync can be a big drain both on CPU and memory resources, depending on the size and number of files to be transferred. But if the hadoop client is not, then I'm fine with it.

Ottomata added a comment.EditedOct 22 2019, 3:27 PM

The hadoop client itself shouldn't cause any extra CPU usage, it is just a dumb file transfer agent. I believe we'll just be doing an HDFS get of a specific directory in HDFS, which will just download the directory. We have some custom code to do an 'rsync' like copy from HDFS, but I'm not sure if we will be using that or not.

Thumbs up from me then.

Thanks a lot for all the feedback! As next step I'd propose to file a puppet patch to:

  1. add the hadoop client packages to labstore
  2. move the rsync that pulls from stat1007 to something that pulls directly from HDFS (this one could/should be a separate patch)

We should wait for a final review that everything is ok security wise, but since I doesn't seem to be a huge problem, I'd start preparing the work. I am out this week so I will not be able, @Ottomata if you have time to progress this it would be super otherwise I'll pick it up on Monday!

Change 545550 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Include hadoop client packages and config on dumps distribution servers

https://gerrit.wikimedia.org/r/545550

Change 545550 merged by Ottomata:
[operations/puppet@production] Include hadoop client packages and config on dumps distribution servers

https://gerrit.wikimedia.org/r/545550

Change 546189 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Allow labstore hosts to contact Hadoop

https://gerrit.wikimedia.org/r/546189

@MoritzMuehlenhoff @ArielGlenn any concerns from your side?

The general approach looks good to me, the rsync.d setup is full of non-ideal cornercases and this is a sensible cleanup. I think this needs to wait on Kerberos being enabled, though.

elukey added a comment.Nov 7 2019, 9:37 AM

After a chat with Moritz and my team, this is what we are planning to do:

  1. Add support for kerberos to labstore nodes, and deploy a Kerberos keytab for the user dumps (or similar)
  2. Research a way to "rsync" files from HDFS to labstore nodes. We have something that may work (as explained in the description) but it needs to be tested.
  3. Modify dumps::web::fetches::stats in puppet accordingly

This would need to be done right after Kerberos is enabled, so it might mean that we'll have to stop rsyncs for a couple of days to adapt everything to the new workflow (will be communicated extensively in case).

it might mean that we'll have to stop rsyncs for a couple of days

Do we need to stop the real rsync jobs? The only one that wasn't working is the mw history rsync, because it is so large.

elukey added a comment.Nov 7 2019, 2:12 PM

it might mean that we'll have to stop rsyncs for a couple of days

Do we need to stop the real rsync jobs? The only one that wasn't working is the mw history rsync, because it is so large.

Yep they will not be able to authenticate via rsync, the hdfs mountpoint will return read error :(

Change 550466 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::dumps::distribution::server: add kerberos

https://gerrit.wikimedia.org/r/550466

Change 550536 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add hdfs-rsync script based on Hdfs python lib

https://gerrit.wikimedia.org/r/550536

elukey added a comment.EditedNov 13 2019, 6:26 PM

@Bstorm summary of the next steps, so you can review and approve or not:

  • The current rsync that pulls from stat1007 to labstore100X will not work anymore when Kerberos will be enabled, since it will require auth, so we decided to pull data directly from Hadoop.
  • There are multiple steps involved to make it happen:

Two things are pending your review/approval:

More info about the latter: it is a repository deployed via scap that contains jars and python modules that we use in our infrastructure. The jars are deployed via scap+git-fat, requiring normally some GBs of disk space. It is also possible to deploy a "lighter" version of Refinery, namely with git-fat disabled, that doesn't "weight" a lot in size (order of MBs).

Last but not the least, all ops related to these new things will be handled by Analytics when needed :) How does the plan sound?

Change 550816 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::dumps::distribution::server: add analytics refinery

https://gerrit.wikimedia.org/r/550816

Change 550841 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/scap@master] Rename notebook env to 'thin', add labstore hosts

https://gerrit.wikimedia.org/r/550841

Change 550841 merged by Ottomata:
[analytics/refinery/scap@master] Rename notebook env to 'thin', add labstore hosts

https://gerrit.wikimedia.org/r/550841

@elukey Sounds fine. As long as the software is in the MB range, it shouldn't be an issue. / is surprisingly large on the dumps hosts
/dev/dm-0 916G 4.4G 865G 1% /
So I'd put it under that filesystem instead of on the dumps one. I believe I already +1'd the patch. I am finally back from travel on Monday in case that helps with any plans here. None of this *sounds* particularly dangerous. If the original rsync is a cron, it'll need cleanup. I haven't checked all the patches closely enough to see if that's a part of it at all. :)

@Bstorm thanks a lot! Yes the Analytics refinery is a scap repo so it will be deployed under /srv, that as far as I can see it is not in a separate partition (so it will go under root, plenty of space in there as you pointed out). I was mistaken, the two CRs pending your approval are:

Both of them can be done anytime, I'll need to generate the kerberos' keytabs first but it will take 2 mins. I can proceed and merge or wait for you on Monday if we want to be extra safe/sure, no rush!

Change 550536 merged by Mforns:
[analytics/refinery@master] Add hdfs-rsync script based on Hdfs python lib

https://gerrit.wikimedia.org/r/550536

Change 550466 merged by Elukey:
[operations/puppet@production] role::dumps::distribution::server: add kerberos

https://gerrit.wikimedia.org/r/550466

Change 550816 merged by Elukey:
[operations/puppet@production] role::dumps::distribution::server: add analytics refinery

https://gerrit.wikimedia.org/r/550816

I have fixed the refinery "thin" deploy configuration with https://gerrit.wikimedia.org/r/#/c/analytics/refinery/scap/+/553702/, now on labstore nodes we occupy 38MB (instead of 8G).

Next steps:

  • complete the work on hdfs-rsync and deploy to labstore nodes
  • file a code change to update the cron puppet config on labstore nodes (to use hdfs-rsync instead of regular rsync).

Change 556681 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] dumps::web::fetches::stats: move to systemd timers

https://gerrit.wikimedia.org/r/556681

Change 556681 merged by Elukey:
[operations/puppet@production] dumps::web::fetches::stats: move to systemd timers

https://gerrit.wikimedia.org/r/556681

Change 556757 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] dumps::web::fetches::analytics::job: add absolute path to bash

https://gerrit.wikimedia.org/r/556757

Change 556757 merged by Elukey:
[operations/puppet@production] dumps::web::fetches::analytics::job: add absolute path to bash

https://gerrit.wikimedia.org/r/556757

Change 557052 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] [WIP] Deploy analytics/hdfs-tools/deploy to hadoop clients

https://gerrit.wikimedia.org/r/557052

Change 557083 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move Analytics systemd timers on labstore nodes to local /mnt/hdfs

https://gerrit.wikimedia.org/r/557083

Change 557084 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable kerberos on labstore nodes

https://gerrit.wikimedia.org/r/557084

Change 557052 merged by Ottomata:
[operations/puppet@production] Deploy analytics/hdfs-tools/deploy to hadoop clients

https://gerrit.wikimedia.org/r/557052

Change 557120 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/hdfs-tools/deploy@master] Fix hostnames for labstore scap targets

https://gerrit.wikimedia.org/r/557120

Change 557120 merged by Ottomata:
[analytics/hdfs-tools/deploy@master] Fix hostnames for labstore scap targets

https://gerrit.wikimedia.org/r/557120

Change 557128 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/hdfs-tools/deploy@master] Fix git_repo in scap.cfg

https://gerrit.wikimedia.org/r/557128

Change 557128 merged by Ottomata:
[analytics/hdfs-tools/deploy@master] Fix git_repo in scap.cfg

https://gerrit.wikimedia.org/r/557128

Change 557132 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/hdfs-tools/deploy@master] Update scap targets to hosts with profile::analytics::cluster::client

https://gerrit.wikimedia.org/r/557132

Change 557133 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Deploy hdfs-tools to profile::analytics::cluster::client hosts

https://gerrit.wikimedia.org/r/557133

Change 557132 merged by Ottomata:
[analytics/hdfs-tools/deploy@master] Update scap targets to hosts with profile::analytics::cluster::client

https://gerrit.wikimedia.org/r/557132

Change 557133 merged by Ottomata:
[operations/puppet@production] Deploy hdfs-tools to profile::analytics::cluster::client hosts

https://gerrit.wikimedia.org/r/557133

Change 546189 merged by Ottomata:
[operations/puppet@production] Allow labstore hosts to contact Hadoop

https://gerrit.wikimedia.org/r/546189

Change 557083 merged by Elukey:
[operations/puppet@production] Move Analytics systemd timers on labstore nodes to local /mnt/hdfs

https://gerrit.wikimedia.org/r/557083

Change 557084 merged by Elukey:
[operations/puppet@production] Enable kerberos on labstore nodes

https://gerrit.wikimedia.org/r/557084

elukey moved this task from In Progress to Paused on the Analytics-Kanban board.Dec 23 2019, 10:34 AM
elukey moved this task from In Code Review to Done on the Analytics-Kanban board.Jan 17 2020, 2:16 PM
elukey set Final Story Points to 21.
elukey changed Final Story Points from 21 to 13.
elukey moved this task from Stalled to Done on the User-Elukey board.Jan 17 2020, 3:38 PM
Nuria closed this task as Resolved.Feb 27 2020, 7:27 PM