Page MenuHomePhabricator

Add samwalton to analytics-privatedata-users
Closed, ResolvedPublic

Description

Requestor provided information and prerequisites

  • Wikitech username: samwalton
  • Preferred shell username: samwalton
  • Email address: swalton@wikimedia.org
  • Ssh public key (must be dedicated key for wmf production): N/A (will use LDAP access only)
  • Requested group membership: analytics-privatedata-users
  • Reason for access: I'd like the ability to run queries in Hue on the mediawiki_history table so that I can work on T271962.
  • Name of approving party (hiring manager for WMF staff): Danny Horn?
  • Requestor -- Please Acknowledge that you have read and signed the L3 Wikimedia Server Access Responsibilities document: Confirmed
  • Requestor -- Please coordinate obtaining a comment of approval on this task from the approving party.

SRE Clinic Duty Confirmation Checklist for Access Requests

This checklist should be used on all access requests to ensure that all steps are covered, including expansion to existing access. Please double check the step has been completed before checking it off.

This section is to be confirmed and completed by a member of the SRE team.

  • - User has signed the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document.
  • - User has a valid NDA on file with WMF legal. (This can be checked by Operations via the NDA tracking sheet & is included in all WMF Staff/Contractor hiring.)
  • - User has provided the following: wikitech username, preferred shell username, email address, and full reasoning for access (including what commands and/or tasks they expect to perform)
  • - User has provided a public SSH key. This ssh key pair should only be used for WMF cluster access, and not share with any other service (this includes not sharing with WMCS access, no shared keys.)
  • - access request (or expansion) has sign off of WMF sponsor/manager (sponser for volunteers, manager for wmf staff)
  • - Patchset for access request

For additional details regarding access request requirements, please see https://wikitech.wikimedia.org/wiki/Requesting_shell_access

Event Timeline

@Samwalton9: if your hiring manager is @DannyH, he just needs to comment on this task saying that he approves you to access confidential data.

Volans triaged this task as Medium priority.Mar 15 2021, 8:18 PM
Volans updated the task description. (Show Details)

@Ottomata: is this access request actually needed?
According to https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue a manual sync of the user is enough to access Hue as the LDAP account of the user is already in the wmf group.

@Volans he can already access Hue, but when he tried to query private data (mediawiki_history, for T271962) he got a permissions error.

@Volans he can already access Hue, but when he tried to query private data (mediawiki_history, for T271962) he got a permissions error.

Specifically, Error while compiling statement: FAILED: SemanticException Unable to determine if hdfs://analytics-hadoop/wmf/data/wmf/mediawiki/history is encrypted: org.apache.hadoop.security.AccessControlException: Permission denied: user=samwalton, access=READ, ...

@nshahquinn-wmf , @Samwalton9 : thanks for the additional info, I've spoke with analytics and indeed the membership of the analytics-privatedata-users group is needed to access some data, proceeding with the request.

I can confirm that:

  • L3 was signed on Sep 15 2016, 17:41.
  • Has a valid NDA because WMF staff, confirmed by LDAP search on ldap-corp1001.
  • Has provided all the required informations on the task description.

@Samwalton9: your Phabricator account is not linked with the LDAP account, you can link it in your settings. See also https://www.mediawiki.org/wiki/Phabricator/Help#Using_your_Wikimedia_developer_account for more informations.

@DannyH: as @Samwalton9 manager your approval here on task is needed in order to process this request.
@Ottomata: as the approver for the analytics-privatedata-users LDAP group your approval here on task is needed in order to proceed with this request.

@nshahquinn-wmf @Samwalton9 if possible I'd ask to use Pyspark + hive to explore mediawiki history, Hue is currently not in a great shape after the hadoop upgrade and we should use it as few as possible :)

@elukey what exactly will happen if we use Hue too much?

I recommended that Sam use Hue since he just wants to write an SQL query without having to deal with Jupyter, Python, wmfdata, and pandas (and his query will take more than a minute to run). I'm sure he could figure out all that eventually, but it would be a lot of extra work, like trying to use a fireplace instead of an oven to bake bread.

I appreciate the constant work you do keeping all our systems running smoothly, and I will understand if you have to remove Hue because the maintenance burden is too great. But as long as you don't do that, we really want to keep using it because it fills a use case that none of our other tools do.

Perhaps Superset + SQLLab is a better option? Or is the fact that the query takes longer than a minute mean that it will timeout there?

@elukey what exactly will happen if we use Hue too much?

I recommended that Sam use Hue since he just wants to write an SQL query without having to deal with Jupyter, Python, wmfdata, and pandas (and his query will take more than a minute to run). I'm sure he could figure out all that eventually, but it would be a lot of extra work, like trying to use a fireplace instead of an oven to bake bread.

I appreciate the constant work you do keeping all our systems running smoothly, and I will understand if you have to remove Hue because the maintenance burden is too great. But as long as you don't do that, we really want to keep using it because it fills a use case that none of our other tools do.

Nothing, keep in mind that it is a little brittle at the moment, that's it :)

Perhaps Superset + SQLLab is a better option? Or is the fact that the query takes longer than a minute mean that it will timeout there?

Joseph brought up the good point that Superset + Presto (without Alluxio caching etc..) might not be great for this use case since the dataset is around half a Terabyte, and moving data a lot over the network for each query might be a little overkill (so it is not a small datasets).

Oh, in case needed: approved by me too.

Change 673046 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] admin: add Sam Walton as LDAP only account

https://gerrit.wikimedia.org/r/673046

Change 673046 merged by Volans:
[operations/puppet@production] admin: add Sam Walton as LDAP only account

https://gerrit.wikimedia.org/r/673046

@Samwalton9 The patch has been merged. Within ~half an hour Puppet should have run everywhere and should apply the changes.
Feel free to resolve it once you can test it all works fine.

Thanks @Volans! Looks like I can get further than before, but I now see the following error when attempting to run a query: Error while compiling statement: FAILED: RuntimeException Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.

I assume you need kerberos setup and if you have it, you need to initialize it with "kinit" first

This might have got lost in all the comments, but to query mediawiki_history Sam will need to be in the analytics-privatedata-users group. Hm, we don't have this case covered in our [[ What access should I request | What access should I request? docs ]].

I think that Sam will not need ssh for this, so probably this is like https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Dashboards_in_Superset_that_do_access_private_data. I'll add Sam now (sorry @Volans, thanks for the patches so far), and we'll see. If he needs ssh too we can add that back later.

Change 673270 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add samwalton as a posix user in analytics-privatedata-users, but no ssh

https://gerrit.wikimedia.org/r/673270

Change 673270 merged by Ottomata:
[operations/puppet@production] Add samwalton as a posix user in analytics-privatedata-users, but no ssh

https://gerrit.wikimedia.org/r/673270

org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Permission denied: user=samwalton, access=WRITE ... :(

access=WRITE ... ?

What is the query you are trying to run?

Samwalton9-WMF claimed this task.

access=WRITE ... ?

What is the query you are trying to run?

I'm doing a simple test query:

SELECT COUNT(*) FROM mediawiki_history
WHERE event_entity = 'revision'
AND snapshot = '2021-02';

I've just tried it again and it's working as intended, so I guess this was a temporary issue!

Hm, maybe your user had to be created on the host that runs Hue first. I did not manually do that after I merged, so it may have just done that now. Great!@