Page MenuHomePhabricator

Review and improve Oozie authorization permissions
Closed, ResolvedPublic

Description

While working on Hue, I noticed that we don't use the Oozie option to limit the number of people with the admin role. This is what it is listed in puppet:

# This is not currently working.  Disabling
# this allows any user to manage any Oozie
# job.  Since access to our cluster is limited,
# this isn't a big deal.  But, we should still
# figure out why this isn't working and
# turn it back on.
# I was not able to kill any oozie jobs
# with this on, even though the
# oozie.service.ProxyUserService.proxyuser.*
# settings look like they are properly configured.

The comments refers to oozie.service.AuthorizationService.authorization.enabled, that it is listed in https://oozie.apache.org/docs/4.2.0/AG_Install.html#User_Authorization_Configuration

This means that all users in oozie are admins, so they can kill/restart/etc.. any job. I created https://gerrit.wikimedia.org/r/c/operations/puppet/+/626595 and turned the option on for the Test cluster, and I was able to kill/start/stop jobs via Hue's ui (with my username listed as admin). I also tried to temporary remove my user from the admin list, and I wasn't able to kill jobs running as analytics as expected.

If we want to turn this on (and I am really supportive) we should test other use cases, like when people inside groups like analytics-search, analytics-privatedata, etc.. want to kill/start/restart oozie jobs from their username. For example, an analyst/researcher kicks off an oozie coordinator as the system user analytics-privatedata (via kerberos-run-command on a stat100x host) and then wants to kill the same job in Hue (logged in as their user).

https://oozie.apache.org/docs/4.2.0/AG_Install.html#User_Authorization_Configuration lists the option of using ACLs, but it doesn't explain how. More recent docs. like http://oozie.apache.org/docs/5.2.0/AG_Install.html#Defining_Access_Control_Lists, add more info that we could test.

If I got it correctly, one could set the group.name= option in the oozie's coordinator/bundle .properties file, listing what are the groups allowed to act (stop/kill/etc..) on the job. So if this works, we'd only need to follow up with owners of non Analytics team coordiantor/bundles to add the option to their properties file.

This could be a good task for @razzi to understand the beauty of Oozie and Kerberos :)

Event Timeline

@Nuria @Ottomata I think that this could be a good second task for Razzi, since it needs some review of Oozie and how it currently works. Thoughts?

Testing idea: on analytics1030 (test cluster, where oozie runs) we have:

elukey@analytics1030:~$ sudo cat /etc/oozie/conf/adminusers.txt
# Admin Users, one user by line
otto
nuria
milimetric
mforns
fdans
joal
elukey
klausman
razzi

So I think it is sufficient to:

  1. sudo puppet agent --disable "razzi - testing"
  2. Remove either mforns or razzi manually from adminusers.txt and save the file
  3. systemctl restart oozie

At this point a job can be sent to be run as analytics. With the group setting set to say analytics-privatedata-users, the user that got removed in 2) should be able to kill the job via Hue.

Confirmed with @mforns that adding to the bundle.properties

oozie.job.acl = <group that user belongs to> (in this case wikidev)

Allows administering jobs via the ui.

On analytics1030, where oozie runs in the test cluster, however, not all users are present; they will need to be added to the node that runs oozie for this to work. This is an-coord1001 in the production cluster.

The group that we'd expect, analytics-privatedata-users, is not there either:

razzi@an-coord1001:~$ groups razzi
razzi : wikidev adm ops
razzi@an-coord1001:~$ groups mforns
mforns : wikidev analytics-admins

So we'll need to review the users / groups on an-coord1001 for this to work.

@razzi have you tried to add analytics-privatedata-users as oozie.job.acl and see if it works? In theory it should, IIUC the group membership is checked using HDFS (hence the Namenode, so an-master100x, where the groups are deployed).

@elukey we did try that, and it didn't work. It's possible we misconfigured something; could give that another try.

I don't find the docs that were pointing to the fact that oozie checks Hadoop perms, so at this point I cannot really support my argument :(

On the hadoop masters we do deploy users without their ssh keys in hiera, see:

admin::groups_no_ssh:
  - analytics-users
  - analytics-privatedata-users
  # elasticsearch::analytics creates the analytics-search user and group
  # that analytics-search-users are allowed to sudo to.  This is used
  # for deploying files to HDFS.
  - analytics-search-users

So we could add this block to the coordinator.yaml config, so that groups will be deployed. Another thing that I noticed is that oozie.service.AuthorizationService.admin.groups can be more flexible than the current solution with the manual admin list. We could set the value in oozie-site.xml to analytics-admins and avoid a list of manual names, but it would make testing more difficult, so we can do it as last step?

Change 630218 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::coordinator: add analytics users without ssh keys

https://gerrit.wikimedia.org/r/630218

@razzi I had to start the decom of the hadoop test cluster sooner, so all the testing env is now gone sorry. I think that we can proceed anyway with deploying in production:

  • users/groups deployed on an-coord1001 (like https://gerrit.wikimedia.org/r/630218)
  • puppet change to be able to set oozie.service.AuthorizationService.admin.groups instead of the admin list
  • reach out to all the coordinator users (probably only the discovery team) to update their property files
  • update our docs about this restriction when making oozie coords (and possibly announce it to analytics-announce@)
  • deploy the puppet change for oozie and restart it

Hey @elukey,
On friday @razzi and I encountered a puppet compiler error when trying to test your puppet change for the test cluster.
Razzi created a task for the error: T263876.
We believe that is unrelated, but didn't want to merge it anyway.

Yep not related but +1 on waiting, thanks! The admin module changes are battle tested so we can deploy directly user/groups on an-coord1001 even if pcc doesn't work now.

@razzi when you have a moment let's create the puppet patches for this, so we can start reviewing the code etc..

Change 631849 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] oozie: use admin groups to determine admin access

https://gerrit.wikimedia.org/r/631849

Change 630218 abandoned by Elukey:
[operations/puppet@production] role::analytics_test_cluster::coordinator: add analytics users without ssh keys

Reason:

https://gerrit.wikimedia.org/r/630218

Change 631849 merged by Razzi:
[operations/puppet@production] oozie: use admin groups to determine admin access

https://gerrit.wikimedia.org/r/631849

Mentioned in SAL (#wikimedia-analytics) [2020-10-08T17:42:05Z] <razzi> restart oozie server on an-coord1001 for T262660

Mentioned in SAL (#wikimedia-analytics) [2020-10-08T18:08:30Z] <razzi> restart oozie server on an-coord1001 for reverting T262660

@razzi let's use the adminlist txt file for production (an-coord1001), then we'll test later on an-test-coord1001 (test cluster) to see if it works in there or not (it runs Bigtop's oozie version that should be 4.3).

Change 634152 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable admin list checks for Oozie in Analytics Hadoop

https://gerrit.wikimedia.org/r/634152

Change 634152 merged by Elukey:
[operations/puppet@production] Enable admin list checks for Oozie in Analytics Hadoop

https://gerrit.wikimedia.org/r/634152

Change 634179 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable admin list for Oozie in Analytics Hadoop - second attempt

https://gerrit.wikimedia.org/r/634179

Change 634179 merged by Elukey:
[operations/puppet@production] Enable admin list for Oozie in Analytics Hadoop - second attempt

https://gerrit.wikimedia.org/r/634179

2020-10-15 07:03:20,281  INFO AuthorizationService:520 - SERVER[an-coord1001.eqiad.wmnet] Oozie running with authorization enabled
2020-10-15 07:03:20,282  INFO AuthorizationService:520 - SERVER[an-coord1001.eqiad.wmnet] Admin users will be checked against the 'adminusers.txt' file contents

Oozie on an-coord1001 uses the admin list now.

Next steps: test the admin list group property on an-test-coord1001 (since it runs oozie 4.3 from Bigtop).

Change 637587 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] oozie: Add admin groups for authorization

https://gerrit.wikimedia.org/r/637587

Change 637587 merged by Razzi:
[operations/puppet@production] oozie: Add admin groups for authorization

https://gerrit.wikimedia.org/r/637587

Alright, looks like this is working. I created a "hello world" job, restarted oozie:

sudo service oozie restart

I ran the following command as analytics:

sudo -u analytics kerberos-run-command analytics oozie job -config job.properties -run
...
job: 0000000-201105000955761-oozie-oozi-W
razzi@an-test-coord1001:~$ oozie job -kill 0000000-201105000955761-oozie-oozi-W
...
Error: E0508 : E0508: User [razzi] not authorized for WF job [0000000-201105000955761-oozie-oozi-W]

This is expected since I removed the ops group from admin groups to test this; and my razzi user is not a member of analytics-admins.

Next steps: reset the permissions on the test cluster to include ops once again, and update the production cluster to use this setting rather than the admin users txt file.

Change 640260 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] oozie: Use admin groups for permissions

https://gerrit.wikimedia.org/r/640260

Change 643437 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_test_cluster::coordinator: add ops to oozie admins

https://gerrit.wikimedia.org/r/643437

Change 643437 merged by Elukey:
[operations/puppet@production] role::analytics_test_cluster::coordinator: add ops to oozie admins

https://gerrit.wikimedia.org/r/643437

Change 665352 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::coordinator: deploy analytics-product users

https://gerrit.wikimedia.org/r/665352

Change 665352 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::coordinator: deploy analytics-product users

https://gerrit.wikimedia.org/r/665352

Change 640260 abandoned by Razzi:

[operations/puppet@production] oozie: Use admin groups for permissions

Reason:

Important parts done elsewhere, on to airflow

https://gerrit.wikimedia.org/r/640260