Page MenuHomePhabricator

Make hadoop cluster able to push to swift
Closed, ResolvedPublic

Description

Per our meeting in 2019-03-28 we have decided to explore swift as the system by which we are going to put binaries in the prod network. Some challenges:

  • we need a swift client in hadoop
  • we need swift credentials deployed to hadoop
  • likely we need some firewall changes to be able to push binaries

@CDanis has graciously agreed to do the initial work on figuring out what is the best client for hadoop to talk to swift. This task (and maybe other subtasks) are to keep track of this work

Event Timeline

Milimetric moved this task from Incoming to Machine Learning Platform on the Analytics board.

So it sounds like the firewall work is done (thanks Arzhel!)

Seems like the next thing is to create a Swift container for this usage -- and maybe one just for testing/playground work as well?

And then figure out something that makes sense for getting credentials/secrets to the Hadoop client that Andrew found. (I couldn't tell from a quick glance at the docs where the config file for Hadoop's Swift client is supposed to live.)

I got tied up with goal work and incident response and have only had a little time to spend on this.

The client that @Ottomata found does look like a good one. It should also be relatively straightforward to make a Swift account that is open to all users of the Analytics cluster. (If we need to lock it down to only certain Analytics users, that might be difficult.)

The thing I'm not sure about / could use some help with from @Ottomata or @elukey is how best to get a secret from the private Puppet repo (where Swift credentials live) into whichever Hadoop configuration file is needed (the examples on https://hadoop.apache.org/docs/current/hadoop-openstack/index.html#Configuring don't include advice on which file to put them in). I'm guessing that these settings are probably needed in core-site.xml on the local filesystem of the worker nodes...?

Some quick notes from today's meeting:

  • elukey has a cloud-vps hadoop cluster for testing changes like this (although it is kind of flakey / needs poking/recreation)
  • also a test cluster in production on old out-of-warranty nodes for testing Kerberos changes
    • this sounds especially useful for testing these changes
  • cross-DC replication would be helpful
    • hadoop upload to both codfw & eqiad
    • or turn on container synchronization in swift
    • not sure how 2-way sync works if we ever have writers in codfw
    • content-addressable filenames would probably help with eliminating the possibility of weird collisions
  • for some users there are concerns about keeping state marking a certain generation of the data as being "consumed" (e.g. for moving into mysql), which generation of the data is "current" (for production serving)
    • some simple, generic design would be good to have for the future
  • re: managing the Swift account(s) and their secrets, it's probably fine to have one Swift account for all Analytics users; TBD how to manage the Swift account key / password in Hadoop config files

Found some better docs here:
https://docs.openstack.org/sahara/latest/user/hadoop-swift.html

So configs will go in core-site.xml. We can probably do this in hiera only via the $core_site_extra_properties param in puppet.

TBD how to manage the Swift account key / password in Hadoop config files

I believe that we could avoid rendering the the sensitive stuff in core-site.xml and pass them when needed on the CLI, e.g.

hadoop distcp -D fs.swift.service.<swift_service_name>.username=<username> -D fs.swift.service.<swift_service_name>.password=<password> \
hdfs://analytics-hadoop/file/to/copy \
swift://<swift_container_name>.<swift_service_name>/file/to/copy

@CDanis, @godog, I've set up a temp test Hadoop cluster in deployment-prep. I'd like to try this out, but am having a little trouble figuring out what how to access the swift cluster there, would love some help. Will ping you on IRC tomorrow. :)

I seem to be able to use hadoop distcp while defining the required properties on the CLI, like:

hadoop distcp \
-Dfs.swift.service.beta.username=XXX \
-Dfs.swift.service.beta.password=XXX \
-Dfs.swift.service.beta.auth.url=http://deployment-ms-fe03.deployment-prep.eqiad.wmflabs/auth/v1.0 \
-Dfs.swift.service.beta.tenant=??? \
/tmp/otto-hadoop0.txt \
swift://beta-hadoop-test.beta/otto-hadoop0.txt

(I took username and password out of /etc/swift/account_AUTH_pagecompilation.env on deployment-ms-fe03).

However, I don't know what .tenant should be, and I don't really know what the container (beta-hadoop-test in the example) should be. I'm getting

Authentication Failure: Authenticate as tenant 'XXX' user 'XXX' with password of length XX  POST http://deployment-ms-fe03.deployment-prep.eqiad.wmflabs/auth/v1.0 => 400 : <html><h1>Bad Request</h1><p>The server could not comply with the request since it is either malformed or otherwise incorrect.</p></html>

@fgiunchedi, can you have a look?

I've ran some tests on deployment-hadoop-test-1, I think the problem is that on the swift side we're using the tempauth middleware to handle authentication (GET <auth_url>, username/password are in headers, and the auth token is also sent back in headers), whereas the swift hadoop client tries keystone authentication (send json to POST <auth_url>, get back tokens).

From https://issues.apache.org/jira/browse/HADOOP-10420 it doesn't look like swift hadoop client supports tempauth (yet), although I haven't checked the actual code, where could we find it ?

Is the tempauth middleware just for beta? I am pretty sure the same commands should work in production too, in case swift there is using something else for auth.

I haven't checked the actual code, where could we find it ?

I think:https://github.com/apache/hadoop/tree/trunk/hadoop-tools/hadoop-openstack

BTW, I just found https://github.com/walmartlabs/hadoop-openstack-swifta, which might be relevant? It also looks like it only supports keystone auth.

Is the tempauth middleware just for beta? I am pretty sure the same commands should work in production too, in case swift there is using something else for auth.

tempauth is used in production too yeah, the two clusters beta/production are setup the same way WRT auth

I haven't checked the actual code, where could we find it ?

I think:https://github.com/apache/hadoop/tree/trunk/hadoop-tools/hadoop-openstack

BTW, I just found https://github.com/walmartlabs/hadoop-openstack-swifta, which might be relevant? It also looks like it only supports keystone auth.

interesting! I'm thinking two possible avenues forward are: implement tempauth for hadoop swift client (sth similar or exactly like HADOOP-10420) or keystone for swift (assuming it could coexist peacefully with tempauth)

I think either of those solutions probably just made this task a lot harder...

A less elegant alternative would be to just write a wrapper that downloaded from HDFS to local filesystem and then uploaded to swift :/

A less elegant alternative would be to just write a wrapper that downloaded from HDFS to local filesystem and then uploaded to swift :/

Indeed, not as elegant but IMHO equally workable (and working for sure) e.g. if we can schedule as a job ran by hadoop and not tied to a single machine

Change 511946 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] [WIP] Oozie utility workflow to upload files to Swift

https://gerrit.wikimedia.org/r/511946

Alright, I've written a bash wrapper to help out with this. I'd do it with just the swift CLI, but we need to be able to source some env vars from another file, which I don't think the Oozie shell action will let us do.

We should be able to include this Oozie (sub)workflow at the end of a data generation job and have it upload a directory from HDFS to Swift. I've tested swift-upload.sh in deployment-prep. I'd like to test it and this Oozie workflow in prod now, but I need to get some Swift creds to try, as well as a test container. @fgiunchedi or @CDanis can yall help with that?

Alright, I've written a bash wrapper to help out with this. I'd do it with just the swift CLI, but we need to be able to source some env vars from another file, which I don't think the Oozie shell action will let us do.

We should be able to include this Oozie (sub)workflow at the end of a data generation job and have it upload a directory from HDFS to Swift. I've tested swift-upload.sh in deployment-prep. I'd like to test it and this Oozie workflow in prod now, but I need to get some Swift creds to try, as well as a test container. @fgiunchedi or @CDanis can yall help with that?

For sure! The account part is split between public puppet (e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/318148 sans 'esams' part) and adding the corresponding credentials in private.git (both repos, private private.git and public private.git as in https://gerrit.wikimedia.org/r/c/labs/private/+/493076) followed by a rolling restart of swift proxies.

Once the user is in place creating the container should be straightforward as uploading a file to it with swift. While we're at it I recommend creating the container with the lowlatency storage policy so that swift will allocate objects on SSDs as opposed to spinning disks, e.g. swift upload --header X-Storage-Policy:lowlatency <container> <object>.

Great thanks!

While we're at it I recommend creating the container with the lowlatency storage policy so that swift will allocate objects on SSDs as opposed to spinning disk

Sure...but I don't think low latency / SSDs are really needed for this use case. I can do this if you still prefer!

Great thanks!

While we're at it I recommend creating the container with the lowlatency storage policy so that swift will allocate objects on SSDs as opposed to spinning disk

Sure...but I don't think low latency / SSDs are really needed for this use case. I can do this if you still prefer!

I agree probably SSDs are not needed for this use case, we do have the space provisioned anyways and no SSDs users ATM so either way works for me! I think it'd be nice to try though, should work transparently.

Change 512183 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[labs/private@master] Add swift analytics_admin dummy account key

https://gerrit.wikimedia.org/r/512183

Change 512184 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add Swift analytics account with analytics:admin user

https://gerrit.wikimedia.org/r/512184

Change 512183 merged by Ottomata:
[labs/private@master] Add swift analytics_admin dummy account key

https://gerrit.wikimedia.org/r/512183

Change 512184 merged by Ottomata:
[operations/puppet@production] Add Swift analytics account with analytics:admin user

https://gerrit.wikimedia.org/r/512184

Change 512203 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Install python3-swiftclient on analytics cluster nodes

https://gerrit.wikimedia.org/r/512203

Change 512203 merged by Ottomata:
[operations/puppet@production] Install python3-swiftclient on analytics cluster nodes

https://gerrit.wikimedia.org/r/512203

Change 512210 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Include Swift analytics_admin auth .env file in HDFS

https://gerrit.wikimedia.org/r/512210

Change 511946 merged by Ottomata:
[analytics/refinery@master] Oozie utility workflow to upload files to Swift

https://gerrit.wikimedia.org/r/511946

Change 512210 merged by Ottomata:
[operations/puppet@production] Include Swift analytics_admin auth .env file in HDFS

https://gerrit.wikimedia.org/r/512210

Ok! Creds deployed, and oozie job merged. Refinery will be deployed this week and we can try it out!

Ok! Creds deployed, and oozie job merged. Refinery will be deployed this week and we can try it out!

Would this need a change in the Analytics VLAN's firewall to allow Hadoop to contact Swift?

Closing ticket as workflow is deployed and available for people (cc @EBernhardson @bmansurov) to try.

There is still another piece about sending a message so clients know the binary is available but let's go into that once we have already tested the workflow

Eric needs the analytics-search user to be able to access the swift auth file so his Oozie jobs can upload to swift.

analytics-search is in the analytics-privatedata-users group. @fgiunchedi, @Nuria: should we make the auth file group readable by analytics-privatedata-users? This would allow almost everyone with Hadoop cluster access to upload to swift.

Sounds good to me, note that there are rate limits in place for write operations (modules/swift/templates/proxy-server.conf.erb) in case you run into those. We'll probably need to start on quotas too.

Change 521954 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Allow analytics-privatedata-users group to access swift auth env file

https://gerrit.wikimedia.org/r/521954

Change 521954 merged by Ottomata:
[operations/puppet@production] Allow analytics-privatedata-users group to access swift auth env file

https://gerrit.wikimedia.org/r/521954

Change 522074 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use 640 mode for swift auth env file

https://gerrit.wikimedia.org/r/522074

Change 522074 merged by Ottomata:
[operations/puppet@production] Use 640 mode for swift auth env file

https://gerrit.wikimedia.org/r/522074

@EBernhardson analytics-search user should now be able to access the auth file

Change 525106 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] Create container with read access during swift upload

https://gerrit.wikimedia.org/r/525106

Change 525106 merged by Ottomata:
[analytics/refinery@master] Create container with read access during swift upload

https://gerrit.wikimedia.org/r/525106