Page MenuHomePhabricator

Allow all Analytics tools to work with Kerberos auth
Open, HighPublic

Description

In T212259 we verified that critical things like oozie, camus and hue could work with Kerberos authentication. This task is meant to track the work to find the configurations needed by the following tools to work with Kerberos:

  • Druid (namely authentication to HDFS)
  • Hue (should be done, but we'd need to verify all use cases from our users)
  • Puppet alarms that require authentication with Hadoop (for example, force the standby namenode to fetch the current image from the master, nagios checks, etc..)
  • Beeline (that will replace Hive, not compatible with Kerberos)
  • Spark
  • Hcatalog actions used by the Search team (see T225310)
  • /mnt/hdfs (alternative if needed https://github.com/EDS-APHP/py-hdfs-mount)
  • geoip::data::archive cron (user is root, probably will need to be changed)
  • labstore100[6,7] rsyncs

At the end of this task, ideally we should be able to say "we are ready to enable kerberos auth on the Analytics Hadoop cluster".

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 519400 merged by Elukey:
[operations/puppet@production] role::druid::test_analytics::worker: add kerberos support

https://gerrit.wikimedia.org/r/519400

Change 519405 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::test_analytics::worker: remove hadoop client config

https://gerrit.wikimedia.org/r/519405

Change 519405 merged by Elukey:
[operations/puppet@production] role::druid::test_analytics::worker: remove hadoop client config

https://gerrit.wikimedia.org/r/519405

Change 519408 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::test_analytics::worker: enable kerberos

https://gerrit.wikimedia.org/r/519408

Change 519408 merged by Elukey:
[operations/puppet@production] role::druid::test_analytics::worker: enable kerberos

https://gerrit.wikimedia.org/r/519408

Change 519422 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] druid: ensure that the druid user is in the pupept catalog

https://gerrit.wikimedia.org/r/519422

Change 519422 merged by Elukey:
[operations/puppet@production] druid: ensure that the druid user is in the pupept catalog

https://gerrit.wikimedia.org/r/519422

Change 519439 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::test_analytics::worker: fix some kerberos parameters

https://gerrit.wikimedia.org/r/519439

Change 519439 merged by Elukey:
[operations/puppet@production] role::druid::test_analytics::worker: fix some kerberos parameters

https://gerrit.wikimedia.org/r/519439

Change 519601 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hue: add more specific alarms for kerberos

https://gerrit.wikimedia.org/r/519601

Change 519601 merged by Elukey:
[operations/puppet@production] profile::hue: add more specific alarms for kerberos

https://gerrit.wikimedia.org/r/519601

elukey updated the task description. (Show Details)Jun 28 2019, 9:15 AM

Change 519607 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add missing kerberos config to Hadoop HDFS (test cluster)

https://gerrit.wikimedia.org/r/519607

Change 519607 merged by Elukey:
[operations/puppet@production] Add missing kerberos config to Hadoop HDFS (test cluster)

https://gerrit.wikimedia.org/r/519607

Change 519610 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::backup::namenode: move crons to timers

https://gerrit.wikimedia.org/r/519610

Change 519610 merged by Elukey:
[operations/puppet@production] profile::hadoop::backup::namenode: move crons to timers

https://gerrit.wikimedia.org/r/519610

elukey updated the task description. (Show Details)Jun 28 2019, 10:24 AM

Change 519612 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::balancer: remove unused kerberos wrapper

https://gerrit.wikimedia.org/r/519612

Change 519612 merged by Elukey:
[operations/puppet@production] profile::hadoop::balancer: remove unused kerberos wrapper

https://gerrit.wikimedia.org/r/519612

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Jul 1 2019, 9:58 AM
fdans triaged this task as High priority.Jul 1 2019, 3:41 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

Change 520178 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Assign role::analytics_test_cluster::client to an-tool1006

https://gerrit.wikimedia.org/r/520178

Change 520178 merged by Elukey:
[operations/puppet@production] Assign role::analytics_test_cluster::client to an-tool1006

https://gerrit.wikimedia.org/r/520178

Change 520182 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::client: add kerberos config

https://gerrit.wikimedia.org/r/520182

Change 520182 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::client: add kerberos config

https://gerrit.wikimedia.org/r/520182

Change 520195 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::client: add spark2 config

https://gerrit.wikimedia.org/r/520195

Change 520195 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::client: add spark2 config

https://gerrit.wikimedia.org/r/520195

Change 520391 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::master: add Yarn unhealthy workers check

https://gerrit.wikimedia.org/r/520391

Change 520391 merged by Elukey:
[operations/puppet@production] profile::hadoop::master: add Yarn unhealthy workers check

https://gerrit.wikimedia.org/r/520391

Change 520398 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop: remove unused script

https://gerrit.wikimedia.org/r/520398

Change 520398 merged by Elukey:
[operations/puppet@production] profile::hadoop: remove unused script

https://gerrit.wikimedia.org/r/520398

Change 520442 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::master: allow nagios to authenticate as hdfs

https://gerrit.wikimedia.org/r/520442

elukey updated the task description. (Show Details)Jul 3 2019, 2:33 PM
elukey added a comment.Jul 3 2019, 2:35 PM

The /mnt/hdfs mountpoint works with Kerberos, but of course needs the user to be authenticated before reading. There are two use cases:

  1. the nagios check that periodically lists /mnt/hdfs to check the mountpoint's health
  2. the rsyncs on labstore100[6,7]:
MAILTO=ops-dumps@wikimedia.org
41 * * * * bash -c '/usr/bin/rsync -rt --delete --exclude readme.html --chmod=go-w stat1007.eqiad.wmnet::hdfs-archive/mediacounts/ /srv/dumps/xmldatadumps/public/other/mediacounts/'
# Puppet Name: dumps-fetch-pageview
MAILTO=ops-dumps@wikimedia.org
51 * * * * bash -c '/usr/bin/rsync -rt --delete --exclude readme.html --chmod=go-w stat1007.eqiad.wmnet::hdfs-archive/{pageview,projectview}/legacy/hourly/ /srv/dumps/xmldatadumps/public/other/pageviews/'
  1. is a more complicated use case since the rsync on stat1007 and the rsync client would probably need to be authenticated? Need to research on that.. Also, do we still need to rsync that data?

Also, do we still need to rsync that data?

Ya, I believe so: https://dumps.wikimedia.org/other/pageviews/2019/2019-07/

  1. is a more complicated use case since the rsync on stat1007 and the rsync client would probably need to be authenticated?

Maybe we can convert this job to a push instead of a pull and run the rsync from stat1007?

elukey updated the task description. (Show Details)Jul 4 2019, 9:06 AM

Change 520714 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profie::analytics::cluster::client: add kerberos support

https://gerrit.wikimedia.org/r/520714

Change 520442 merged by Elukey:
[operations/puppet@production] profile::hadoop::master: allow nagios to authenticate as hdfs

https://gerrit.wikimedia.org/r/520442

Change 520714 merged by Elukey:
[operations/puppet@production] profie::analytics::cluster::client: add kerberos support

https://gerrit.wikimedia.org/r/520714

Change 520745 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::cluster::client: enable krb auth in test cluster

https://gerrit.wikimedia.org/r/520745

Change 520745 merged by Elukey:
[operations/puppet@production] profile::analytics::cluster::client: enable krb auth in test cluster

https://gerrit.wikimedia.org/r/520745

elukey moved this task from Backlog to In Progress on the User-Elukey board.Jul 4 2019, 2:51 PM

Change 520775 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] geoip::data::archive: move to kerberos::systemd_timer

https://gerrit.wikimedia.org/r/520775

elukey moved this task from In Progress to Kerberos on the User-Elukey board.Jul 5 2019, 6:56 AM

Change 520989 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] base::monitoring::host: ignore /mnt/hdfs from disk checks

https://gerrit.wikimedia.org/r/520989

Change 520989 merged by Elukey:
[operations/puppet@production] base::monitoring::host: ignore /mnt/hdfs from disk checks

https://gerrit.wikimedia.org/r/520989

Change 521272 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::base: exclude fuse.fuse_dfs from disk space checks

https://gerrit.wikimedia.org/r/521272

Change 521272 merged by Elukey:
[operations/puppet@production] profile::base: exclude fuse.fuse_dfs from disk space checks

https://gerrit.wikimedia.org/r/521272

Change 520775 merged by Elukey:
[operations/puppet@production] geoip::data::archive: move to kerberos::systemd_timer

https://gerrit.wikimedia.org/r/520775

Change 522123 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] geoip::data::archive: remove parameter

https://gerrit.wikimedia.org/r/522123

Change 522123 merged by Elukey:
[operations/puppet@production] geoip::data::archive: remove parameter

https://gerrit.wikimedia.org/r/522123

Change 525824 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::coordinator: add el refine job

https://gerrit.wikimedia.org/r/525824

Change 525824 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::coordinator: add el refine job

https://gerrit.wikimedia.org/r/525824

Change 526697 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::client: add hive option

https://gerrit.wikimedia.org/r/526697

Change 526697 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::client: add hive option

https://gerrit.wikimedia.org/r/526697

Change 526701 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cdh::hive: allow to render metastore's kerb option on client nodes

https://gerrit.wikimedia.org/r/526701

Change 526701 merged by Elukey:
[operations/puppet@production] cdh::hive: allow to render metastore's kerb option on client nodes

https://gerrit.wikimedia.org/r/526701

Change 526703 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::client: add hive kerberos options

https://gerrit.wikimedia.org/r/526703

Change 526703 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::client: add hive kerberos options

https://gerrit.wikimedia.org/r/526703

Change 526843 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove Spark2 sasl config from the Hadoop test cluster

https://gerrit.wikimedia.org/r/526843

Change 526843 merged by Elukey:
[operations/puppet@production] Remove Spark2 sasl config from the Hadoop test cluster

https://gerrit.wikimedia.org/r/526843

elukey added a comment.Thu, Aug 1, 8:18 AM

@Ottomata I think I have fixed the spark2shell --master yarn issue with the last patch. Added the current issues in https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop_testing_cluster#Use_Spark_2

Change 527979 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::test::refine: fix refine regex

https://gerrit.wikimedia.org/r/527979

Change 527979 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::test::refine: fix refine regex

https://gerrit.wikimedia.org/r/527979

Change 528071 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable spark.authenticate in yarn-site.xml on the Hadoop test cluster

https://gerrit.wikimedia.org/r/528071

Change 528071 merged by Elukey:
[operations/puppet@production] Enable spark.authenticate in yarn-site.xml on the Hadoop test cluster

https://gerrit.wikimedia.org/r/528071

Change 528086 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add more spark security options to yarn-size in Hadoop Test

https://gerrit.wikimedia.org/r/528086

Change 528086 merged by Elukey:
[operations/puppet@production] Add more spark security options to yarn-size in Hadoop Test

https://gerrit.wikimedia.org/r/528086

Change 528400 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::hadoop::worker: set spark.executorEnv.LD_LIBRARY_PATH

https://gerrit.wikimedia.org/r/528400

Change 528400 merged by Elukey:
[operations/puppet@production] Set spark.executorEnv.LD_LIBRARY_PATH in the Hadoop test cluster's workers

https://gerrit.wikimedia.org/r/528400

Change 528483 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove Spark RPC auth/encryption settings from Hadoop test

https://gerrit.wikimedia.org/r/528483

Change 528483 merged by Elukey:
[operations/puppet@production] Remove Spark RPC auth/encryption settings from Hadoop test

https://gerrit.wikimedia.org/r/528483

After some tests to make Refine work with Kerberos in T228291 we decided to leave RPC encryption and authentication disabled for Spark. Spark 2.3.x doesn't work well with local mode and authentication (https://issues.apache.org/jira/browse/SPARK-23476), so we'll enable both features are Kerberos and Spark 2.4.x deployment.

Change 529322 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add kerberos support to Analytics Spark Refine

https://gerrit.wikimedia.org/r/529322

Change 529322 merged by Elukey:
[operations/puppet@production] Add kerberos support to Analytics Spark Refine

https://gerrit.wikimedia.org/r/529322

Change 529324 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add spark config to the Hadoop Test cluster's yarn-site config

https://gerrit.wikimedia.org/r/529324

Change 529324 merged by Elukey:
[operations/puppet@production] Add spark config to the Hadoop Test cluster's yarn-site config

https://gerrit.wikimedia.org/r/529324

Change 529355 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add Hadoop Native libraries path to spark2 default

https://gerrit.wikimedia.org/r/529355

Change 529355 merged by Elukey:
[operations/puppet@production] Add Hadoop Native libraries path to spark2 default

https://gerrit.wikimedia.org/r/529355

Change 531158 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::client: include SWAP/notebooks

https://gerrit.wikimedia.org/r/531158

Change 531158 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::client: include SWAP/notebooks

https://gerrit.wikimedia.org/r/531158

For reference, awesome testing session done by Joseph: https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop_testing_cluster

The only remaining thing to do is to sort out the issues with labstore100[6,7] rsyncs. Those hosts in fact are pulling data via rsync from stat1007's /mnt/hdfs mountpoint, but they'll need authentication when kerberos will be enabled. A solution to avoid extra complications in configs could be to make stat1007 to push data to labstore100[6,7], but not sure if acceptable. Looping in @ArielGlenn :)

@elukey On our previous server we let people pull from us and it was very difficult to manage upgrades or any sort of maintenance. Somewhere there's a ticket with the awfulness.

I remember talking about this on irc; one option is, if the specific data to be rsynced is not too big, to just copy it out to a regular filesystem every so often via cron and let the labstore boxes grab it from there. Anything you want to ship to the labstores could get shovelled into the one tree for pickup.