Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Nuria
	Sep 30 2019, 3:25 PM

Description

Context: there are periodical rsyncs from labstore1006/7 to stat1007 that pull analytics public data to be published. The worflow is the following:

labstore100X starts a rsync pull to stat1007 via cron
the rsync's module reads data from /mnt/hdfs, a fuse mountpoint for HDFS
data is grabbed from HDFS, and returned to the fuse reader
rsync moves data from stat1007 to labstore100X

There are some bottlenecks:

using the hdfs fuse mountpoint for a big dataset may cause performance issues like T234160
the fuse mountpoint is really brittle and not performant
data takes ages to move from HDFS to stat1007 to labstore100X

There are also some future things to solve: when kerberos will be enabled, rsync on stat1007 will need to be able to authenticate before pulling data from the fuse mountpoint. Nothing really very complicated but currently it requires extra config and testing.

A possible solution could be to have the labstore nodes to pull data directly from HDFS. There are no rsync-like commands available but the analytics team wrote one that might be taken as prototype/inspiration.

Security wise, to make this happen the labstore nodes will need to be able to pull data from Hadoop (so install hadoop client packages, be whitelisted in ferm because outside the analytics network, etc..) and eventually they'll need to be kerberized (not a bit deal but mentioning it anyway).

Details

Subject	Repo	Branch	Lines +/-
Enable kerberos on labstore nodes	operations/puppet	production	+3 -1
Move Analytics systemd timers on labstore nodes to local /mnt/hdfs	operations/puppet	production	+3 -1
Allow labstore hosts to contact Hadoop	operations/puppet	production	+60 -31
Deploy hdfs-tools to profile::analytics::cluster::client hosts	operations/puppet	production	+2 -1
Update scap targets to hosts with profile::analytics::cluster::client	analytics/hdfs-tools/deploy	master	+6 -2
Fix git_repo in scap.cfg	analytics/hdfs-tools/deploy	master	+1 -1
Fix hostnames for labstore scap targets	analytics/hdfs-tools/deploy	master	+2 -2
Deploy analytics/hdfs-tools/deploy to hadoop clients	operations/puppet	production	+29 -3
dumps::web::fetches::analytics::job: add absolute path to bash	operations/puppet	production	+1 -1
dumps::web::fetches::stats: move to systemd timers	operations/puppet	production	+109 -0
role::dumps::distribution::server: add analytics refinery	operations/puppet	production	+3 -0
role::dumps::distribution::server: add kerberos	operations/puppet	production	+19 -0
Add hdfs-rsync script based on Hdfs python lib	analytics/refinery	master	+86 -0
Rename notebook env to 'thin', add labstore hosts	analytics/refinery/scap	master	+11 -10
Include hadoop client packages and config on dumps distribution servers	operations/puppet	production	+9 -0

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	elukey	T211836 Enable Security (stronger authentication and data encryption) for the Analytics Hadoop cluster and its dependent services
Resolved	elukey	T226698 Allow all Analytics tools to work with Kerberos auth
Resolved	elukey	T234229 Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 30 2019, 3:25 PM

• Nuria renamed this task from shorten the time it takes to move files from hadoop to dump hosts to shorten the time it takes to move files from hadoop to dump hosts by Kerberinzing the dump hosts .Sep 30 2019, 3:26 PM

• Nuria renamed this task from shorten the time it takes to move files from hadoop to dump hosts by Kerberinzing the dump hosts to shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts .Sep 30 2019, 3:40 PM

• fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.Sep 30 2019, 3:41 PM

• fdans added a project: Analytics-Kanban.

• fdans triaged this task as High priority.Sep 30 2019, 3:51 PM

Milimetric claimed this task.Sep 30 2019, 4:32 PM

@MoritzMuehlenhoff @ArielGlenn Thoughts?

elukey added a project: User-Elukey.Oct 1 2019, 7:21 AM

Adding @Bstorm because the labstore servers are WMCS boxes.

How long does data take to show up on the labstore servers now once it's been stored in HDFS? Where is the biggest lag?

The main problem is that a ton of data (like the recent mediawiki history dumps) need to go through the fuse hdfs mountpoint, slowing down things a lot. That mountpoint is not able to cope with a huge amount of data requested, and shouldn't be used for that. Also shifting data from stat1007 to labstore migth be avoid if those hosts could pull directly from HDFS. The mediawiki history dumps rsync started on friday and it has still to finish (IIUC a total of ~400GB) :)

elukey mentioned this in T234160: No access to mysql from stat1007.Oct 1 2019, 12:30 PM

Ping @Bstorm :)

My main concern would be if we were mounting labstore NFS and moving the data, which would surely cause problems with the current kernel on those servers and the scale of the hardware (people regularly set off alarms by running a "cp" command on the wrong thing on the cloud). If we are talking about running a sync to labstore1006/7 from an hdfs system, that sounds fine to me, but understand that I do not know what client software the labstore would need to do that. I also do not immediately know what the impact to the server would be.

I am interested to know a bit more about the client you'd use to run the sync. As for security, I'll mention that these servers are mounted on Cloud systems read-only, so they are probably the safest Cloud NFS servers to open firewall holes to. I am also curious if anyone has information about possible Kerberos design and such so I can better understand how that could impact the systems in the future.

If we are talking about running a sync to labstore1006/7 from an hdfs system

Yup, that's what we want!

understand that I do not know what client software the labstore would need to do that.

We would take care of it; it is mostly applying the profile::hadoop::common puppet class.

mounting labstore NFS and moving the data
probably the safest Cloud NFS servers to open firewall holes to

We mostly want to more easily copy this public data to these servers so they can be downloaded via http, example: https://dumps.wikimedia.org/other/mediawiki_history/readme.html

Since the nodes themselves are in production (right?) adding the HDFS client (as well as eventually the Kerberization), should be fine and not open up any extra access to the Analytics cluster.

I am also curious if anyone has information about possible Kerberos design

Still very WIP, but best Luca has so far is at https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Kerberos

In T234229#5541936, @Bstorm wrote:

My main concern would be if we were mounting labstore NFS and moving the data, which would surely cause problems with the current kernel on those servers and the scale of the hardware (people regularly set off alarms by running a "cp" command on the wrong thing on the cloud). If we are talking about running a sync to labstore1006/7 from an hdfs system, that sounds fine to me, but understand that I do not know what client software the labstore would need to do that. I also do not immediately know what the impact to the server would be.

I am interested to know a bit more about the client you'd use to run the sync. As for security, I'll mention that these servers are mounted on Cloud systems read-only, so they are probably the safest Cloud NFS servers to open firewall holes to. I am also curious if anyone has information about possible Kerberos design and such so I can better understand how that could impact the systems in the future.

Thanks for the answer!

So the idea would be to install some hadoop-related packages (nothing huge, mostly python and java based tools) on the labstore nodes and allow them to pull data from Analytics. This will be really lightweight and there should be no impact to the host.. Basically the current rsync will be translated into a copy from HDFS.
As for Kerberos, things are still under development, but I have some docs in:

https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Kerberos
https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop_testing_cluster

There are two hosts (krb1001 and krb2001), one for each DC, running on each a KDC daemon. The clients will be able to failover transparently (in theory) to a different KDC without any intervention if necessary. We have some basic automation to create principals and keytabs, and we support only one realm for the moment (WIKIMEDIA).

• Nuria renamed this task from Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts to Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts .Oct 4 2019, 3:53 AM

elukey added a parent task: T226698: Allow all Analytics tools to work with Kerberos auth.Oct 7 2019, 1:17 PM

elukey moved this task from Backlog to In Progress on the User-Elukey board.Oct 9 2019, 3:12 PM

elukey moved this task from In Progress to Waiting for others on the User-Elukey board.Oct 10 2019, 6:27 AM

elukey moved this task from Waiting for others to Kerberos on the User-Elukey board.Oct 16 2019, 8:13 AM

elukey moved this task from Kerberos to Waiting for others on the User-Elukey board.

@Bstorm any remaining doubts that we can discuss? :)

Well, so far, we generally coordinate on maintenance, reboots and service downtime with @ArielGlenn to at least some extent. Should I assume that analytics team would be maintaining the state of hadoop client software on the dumps servers, and either way, are there special considerations to be taken when doing maintenance and reboots when it comes to the hadoop and kerberos setup?

In T234229#5581817, @Bstorm wrote:

Well, so far, we generally coordinate on maintenance, reboots and service downtime with @ArielGlenn to at least some extent. Should I assume that analytics team would be maintaining the state of hadoop client software on the dumps servers, and either way, are there special considerations to be taken when doing maintenance and reboots when it comes to the hadoop and kerberos setup?

Yes for sure, we'd maintain the hadoop client, but usually it is just applying a profile and that's it (unless we do a major upgrade, but in that case we take care of all the clients). No special considerations for reboots or maintenance, the client running on the labstore nodes will authenticate before starting, that's it, nothing to worry about on your side :)

Sounds like that takes care of my concerns.

@MoritzMuehlenhoff @ArielGlenn any concerns from your side?

elukey claimed this task.Oct 22 2019, 7:25 AM

elukey added a subscriber: Milimetric.

@elukey Just want to doublecheck that CPU resources used by the hadoop client won't be so much. These boxes both have 2 quad core Xeon E52623's but that's not tons for a box that does web service, rsyncs and/or nfs service too.

The CPU usage will be minimal. It will mostly just be network just like rsync is.

Rsync can be a big drain both on CPU and memory resources, depending on the size and number of files to be transferred. But if the hadoop client is not, then I'm fine with it.

The hadoop client itself shouldn't cause any extra CPU usage, it is just a dumb file transfer agent. I believe we'll just be doing an HDFS get of a specific directory in HDFS, which will just download the directory. We have some custom code to do an 'rsync' like copy from HDFS, but I'm not sure if we will be using that or not.

Thumbs up from me then.

Thanks a lot for all the feedback! As next step I'd propose to file a puppet patch to:

add the hadoop client packages to labstore
move the rsync that pulls from stat1007 to something that pulls directly from HDFS (this one could/should be a separate patch)

We should wait for a final review that everything is ok security wise, but since I doesn't seem to be a huge problem, I'd start preparing the work. I am out this week so I will not be able, @Ottomata if you have time to progress this it would be super otherwise I'll pick it up on Monday!

Change 545550 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Include hadoop client packages and config on dumps distribution servers

https://gerrit.wikimedia.org/r/545550

gerritbot added a project: Patch-For-Review.Oct 23 2019, 12:45 PM

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.Oct 24 2019, 4:02 PM

Ottomata mentioned this in T131280: Make aggregate data on editors per country per wiki publicly available.Oct 25 2019, 1:11 PM

Change 545550 merged by Ottomata:
[operations/puppet@production] Include hadoop client packages and config on dumps distribution servers

https://gerrit.wikimedia.org/r/545550

Change 546189 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Allow labstore hosts to contact Hadoop

https://gerrit.wikimedia.org/r/546189

In T234229#5593857, @elukey wrote:

@MoritzMuehlenhoff @ArielGlenn any concerns from your side?

The general approach looks good to me, the rsync.d setup is full of non-ideal cornercases and this is a sensible cleanup. I think this needs to wait on Kerberos being enabled, though.

elukey mentioned this in T237269: Prepare the Hadoop Analytics cluster for Kerberos.Nov 4 2019, 3:21 PM

After a chat with Moritz and my team, this is what we are planning to do:

Add support for kerberos to labstore nodes, and deploy a Kerberos keytab for the user dumps (or similar)
Research a way to "rsync" files from HDFS to labstore nodes. We have something that may work (as explained in the description) but it needs to be tested.
Modify dumps::web::fetches::stats in puppet accordingly

This would need to be done right after Kerberos is enabled, so it might mean that we'll have to stop rsyncs for a couple of days to adapt everything to the new workflow (will be communicated extensively in case).

it might mean that we'll have to stop rsyncs for a couple of days

Do we need to stop the real rsync jobs? The only one that wasn't working is the mw history rsync, because it is so large.

In T234229#5644194, @Ottomata wrote:

it might mean that we'll have to stop rsyncs for a couple of days

Do we need to stop the real rsync jobs? The only one that wasn't working is the mw history rsync, because it is so large.

Yep they will not be able to authenticate via rsync, the hdfs mountpoint will return read error :(

Change 550466 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::dumps::distribution::server: add kerberos

https://gerrit.wikimedia.org/r/550466

Change 550536 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add hdfs-rsync script based on Hdfs python lib

https://gerrit.wikimedia.org/r/550536

@Bstorm summary of the next steps, so you can review and approve or not:

The current rsync that pulls from stat1007 to labstore100X will not work anymore when Kerberos will be enabled, since it will require auth, so we decided to pull data directly from Hadoop.
There are multiple steps involved to make it happen:
- Allow labstore nodes to contact hadoop (https://gerrit.wikimedia.org/r/546189) - will be merged when Kerberos is enabled in Hadoop.
- Allow kerberos client + keytabs to be deployed on labstore nodes - https://gerrit.wikimedia.org/r/550466
- Create a smart rsync-like script that works with Hadoop HDFS, since it is not provided by our packages (https://gerrit.wikimedia.org/r/550536). The rsync-like command will reside in the Analytics Refinery repository, currently deployed on various analytics hosts via scap.

Two things are pending your review/approval:

~~Allow labstore nodes to contact hadoop (https://gerrit.wikimedia.org/r/546189) - will be merged when Kerberos is enabled in Hadoop.~~
Allow kerberos client + keytabs to be deployed on labstore nodes - https://gerrit.wikimedia.org/r/550466
Deploy the Analytics Refinery repository to the labstore nodes (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550816/) to be able to use the rsync like script (since it needs custom python modules). The alternative would be to duplicate the Python code to a deb package and deploy it, but it is a bit overkill to maintain.

More info about the latter: it is a repository deployed via scap that contains jars and python modules that we use in our infrastructure. The jars are deployed via scap+git-fat, requiring normally some GBs of disk space. It is also possible to deploy a "lighter" version of Refinery, namely with git-fat disabled, that doesn't "weight" a lot in size (order of MBs).

Last but not the least, all ops related to these new things will be handled by Analytics when needed :) How does the plan sound?

Change 550816 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::dumps::distribution::server: add analytics refinery

https://gerrit.wikimedia.org/r/550816

Change 550841 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/scap@master] Rename notebook env to 'thin', add labstore hosts

https://gerrit.wikimedia.org/r/550841

Change 550841 merged by Ottomata:
[analytics/refinery/scap@master] Rename notebook env to 'thin', add labstore hosts

https://gerrit.wikimedia.org/r/550841

@elukey Sounds fine. As long as the software is in the MB range, it shouldn't be an issue. / is surprisingly large on the dumps hosts
/dev/dm-0 916G 4.4G 865G 1% /
So I'd put it under that filesystem instead of on the dumps one. I believe I already +1'd the patch. I am finally back from travel on Monday in case that helps with any plans here. None of this *sounds* particularly dangerous. If the original rsync is a cron, it'll need cleanup. I haven't checked all the patches closely enough to see if that's a part of it at all. :)

@Bstorm thanks a lot! Yes the Analytics refinery is a scap repo so it will be deployed under /srv, that as far as I can see it is not in a separate partition (so it will go under root, plenty of space in there as you pointed out). I was mistaken, the two CRs pending your approval are:

Allow kerberos client + keytabs to be deployed on labstore nodes - https://gerrit.wikimedia.org/r/550466
Deploy the Analytics Refinery repository to the labstore nodes - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550816/

Both of them can be done anytime, I'll need to generate the kerberos' keytabs first but it will take 2 mins. I can proceed and merge or wait for you on Monday if we want to be extra safe/sure, no rush!

Change 550536 merged by Mforns:
[analytics/refinery@master] Add hdfs-rsync script based on Hdfs python lib

https://gerrit.wikimedia.org/r/550536

Change 550466 merged by Elukey:
[operations/puppet@production] role::dumps::distribution::server: add kerberos

https://gerrit.wikimedia.org/r/550466

Change 550816 merged by Elukey:
[operations/puppet@production] role::dumps::distribution::server: add analytics refinery

https://gerrit.wikimedia.org/r/550816

I have fixed the refinery "thin" deploy configuration with https://gerrit.wikimedia.org/r/#/c/analytics/refinery/scap/+/553702/, now on labstore nodes we occupy 38MB (instead of 8G).

Next steps:

complete the work on hdfs-rsync and deploy to labstore nodes
file a code change to update the cron puppet config on labstore nodes (to use hdfs-rsync instead of regular rsync).

Change 556681 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] dumps::web::fetches::stats: move to systemd timers

https://gerrit.wikimedia.org/r/556681

Change 556681 merged by Elukey:
[operations/puppet@production] dumps::web::fetches::stats: move to systemd timers

https://gerrit.wikimedia.org/r/556681

Change 556757 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] dumps::web::fetches::analytics::job: add absolute path to bash

https://gerrit.wikimedia.org/r/556757

Change 556757 merged by Elukey:
[operations/puppet@production] dumps::web::fetches::analytics::job: add absolute path to bash

https://gerrit.wikimedia.org/r/556757

Change 557052 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] [WIP] Deploy analytics/hdfs-tools/deploy to hadoop clients

https://gerrit.wikimedia.org/r/557052

Change 557083 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move Analytics systemd timers on labstore nodes to local /mnt/hdfs

https://gerrit.wikimedia.org/r/557083

Change 557084 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable kerberos on labstore nodes

https://gerrit.wikimedia.org/r/557084

Change 557052 merged by Ottomata:
[operations/puppet@production] Deploy analytics/hdfs-tools/deploy to hadoop clients

https://gerrit.wikimedia.org/r/557052

Change 557120 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/hdfs-tools/deploy@master] Fix hostnames for labstore scap targets

https://gerrit.wikimedia.org/r/557120

Change 557120 merged by Ottomata:
[analytics/hdfs-tools/deploy@master] Fix hostnames for labstore scap targets

https://gerrit.wikimedia.org/r/557120

Change 557128 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/hdfs-tools/deploy@master] Fix git_repo in scap.cfg

https://gerrit.wikimedia.org/r/557128

Change 557128 merged by Ottomata:
[analytics/hdfs-tools/deploy@master] Fix git_repo in scap.cfg

https://gerrit.wikimedia.org/r/557128

Change 557132 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/hdfs-tools/deploy@master] Update scap targets to hosts with profile::analytics::cluster::client

https://gerrit.wikimedia.org/r/557132

Change 557133 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Deploy hdfs-tools to profile::analytics::cluster::client hosts

https://gerrit.wikimedia.org/r/557133

Change 557132 merged by Ottomata:
[analytics/hdfs-tools/deploy@master] Update scap targets to hosts with profile::analytics::cluster::client

https://gerrit.wikimedia.org/r/557132

Change 557133 merged by Ottomata:
[operations/puppet@production] Deploy hdfs-tools to profile::analytics::cluster::client hosts

https://gerrit.wikimedia.org/r/557133

Krenair mentioned this in T240803: Hourly pageview data has stopped being published at dumps.wm.o.Dec 16 2019, 12:23 AM

Change 546189 merged by Ottomata:
[operations/puppet@production] Allow labstore hosts to contact Hadoop

https://gerrit.wikimedia.org/r/546189

Change 557083 merged by Elukey:
[operations/puppet@production] Move Analytics systemd timers on labstore nodes to local /mnt/hdfs

https://gerrit.wikimedia.org/r/557083

Change 557084 merged by Elukey:
[operations/puppet@production] Enable kerberos on labstore nodes

https://gerrit.wikimedia.org/r/557084

elukey moved this task from In Progress to Paused on the Analytics-Kanban board.Dec 23 2019, 10:34 AM

team action: review Joseph's latest pull request: https://github.com/wikimedia/hdfs-tools/pull/4/files/66b61430d20bb9f4bf468e6232152bc7945e39d7..12fcc8b807e13c5ce19539ab1e5ceec9d5434a5c

elukey moved this task from Waiting for others to Stalled on the User-Elukey board.Jan 3 2020, 10:59 AM

elukey moved this task from In Code Review to Done on the Analytics-Kanban board.Jan 17 2020, 2:16 PM

elukey set Final Story Points to 21.

elukey changed Final Story Points from 21 to 13.

elukey moved this task from Stalled to Done on the User-Elukey board.Jan 17 2020, 3:38 PM

• Nuria closed this task as Resolved.Feb 27 2020, 7:27 PM

Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts
Closed, ResolvedPublic
Actions

Related Objects
Search...