Page MenuHomePhabricator

Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE
Closed, ResolvedPublicRequest

Description

Requestor provided information and prerequisites

This section is to be completed by the individual requesting access.

  • Wikimedia developer account username: Andrew McAllister (WMDE)
  • Email address: andrew.mcallister@wikimedia.de
  • SSH public key (must be a separate key from Wikimedia cloud SSH access):

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC0TtVLJ3ZBLmnxOiFawTtOAoitHj9BpLRyXJZRuG6CI+6pco37i1wqHYCwOpc4jZg8fkPEN0k8lWLkFSzqKt/ZIb44o2g6ntKfzTKl9nsq6hbvPhjELxoU1KvnPeolz5e6d7mkL6uZlOS28zw5QutrjGlfFOREAeitMDkDXJ5YcJOq9/KDtePinHfyqb4djHdA5uuFM4ml4oD/u6FSSSRTUTN3CPbdramhRyfkDDZzBz60td8XTPgK+l7xVqQyQYWQZpeVufRVwYCq7uNMrFYuMI2M/Nuux3YUhq/1ats0LJRUJl1TOmY4g5kY1ltK2ARPsbM3MPc1rJAS2pFw4n8lyYgnZAkju7X0/qqsTfzf3jSh2e2tyFUir3pprMN14QA7PZfLlpZXPMtIKqDmUDVU260XiVDzqdFzUQKWYM0/kEeQ+6wiTfOkpZjP0weaJxj55x8WBVX/cKmVXYjSGiXX+SRPfKViFTg00RzpFGULnnC000KC4ThOiypBUjKmK9fuhVL0+RpPS5+SvHkB4ji4/6tKVm0ec1XL73235buKTwXgb8cLoJGtNZu5dQqesnJeaa0FaYqxXlRqId9WHAioD/IQn0B2WloPsjHH9V+09ICRm2qsScyG2uCG0svdcWSe4sVXRY6/49H0lMGQAVnLoNm9s2trBu/TEPGWuxQLtQ== andrew.mcallister@wikimedia.de

The above is the same one I provided in T335437: Requesting access to analytics for AndrewTavis_WMDE. Please let me know if I should make a new one for this specifically!

SRE Clinic Duty Confirmation Checklist for Access Requests

This checklist should be used on all access requests to ensure that all steps are covered, including expansion to existing access. Please double check the step has been completed before checking it off.

This section is to be confirmed and completed by a member of the SRE team.

  • - User has signed the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document.
  • - User has a valid NDA on file with WMF legal. (All WMF Staff/Contractor hiring are covered by NDA. Other users can be validated via the NDA tracking sheet)
  • - User has provided the following: developer account username, email address, and full reasoning for access (including what commands and/or tasks they expect to perform)
  • - User has provided a public SSH key. This ssh key pair should only be used for WMF cluster access, and not shared with any other service (this includes not sharing with WMCS access, no shared keys.)
  • - The provided SSH key has been confirmed out of band and is verified not being used in WMCS.
  • - access request (or expansion) has sign off of WMF sponsor/manager (sponsor for volunteers, manager for wmf staff)
  • - access request (or expansion) has sign off of group approver indicated by the approval field in data.yaml

For additional details regarding access request requirements, please see https://wikitech.wikimedia.org/wiki/Requesting_shell_access

Details

TitleReferenceAuthorSource BranchDest Branch
search: Set group ownership of processed sparql queriesrepos/data-engineering/airflow-dags!539ebernhardsonwork/ebernhardson/sparql-query-accessmain
Customize query in GitLab

Event Timeline

JMeybohm changed the task status from Open to Stalled.Nov 3 2023, 8:36 AM
JMeybohm moved this task from Untriaged to In Discussion on the SRE-Access-Requests board.
JMeybohm added subscribers: Gehel, mpopov, Ottomata and 2 others.

From the conversation in slack it seems unclear how this should be solved. @Ottomata and @mpopov suggested changing the tables ownership to analytics-privatedata-users which would probably need feedback from @Gehel and probably @EBernhardson.
With all of those new subscribers I'll set this to Stalled until there is a clear path forward from an access request perspective.

This dataset is derived from event.wdqs_external_sparql_query which is probably considered PII, as a direct log of queries issued against the public sparql cluster. Mostly that means we shouldn't simply make this dataset world readable. Making the dataset group readable and owned by analytics-privatedata-users seems reasonable. The analytics-search user is a member of the analytics-privatedata-users group so it should have appropriate permissions to set the ownership. Implementation for this dataset should amount to adding a couple parameters to the relevant airflow dag execution. An alternate solution could be to add AndrewTavis_WMDE to the analytics-search-users group. That would give general access to anything generated by the search team's data pipelines.

Thank you, @EBernhardson! There was a discussion on Slack where the decision was to not give me analytics-search-users access, so if that stands it sounds like we're looking to add readability of the dataset to analytics-privatedata-users . Happy to take any further steps on my end for analytics-search-users access if @Gehel expects that I'll need access to more of the search team's datasets :)

Hey. I'm having a hard time interpreting whether this is still stalled (maybe I'm misinterpreting the discussions or getting mixed messages from Slack vs the discussion here).

So is the way forward to add @AndrewTavis_WMDE to the analytics-search-users group?

So is the way forward to add @AndrewTavis_WMDE to the analytics-search-users group?

This would accomplish the goal of the task, but is probably not the right way to go.

Whoever owns and operates the data requested and the airflow jobs that generate should do as @EBernhardson suggested:

Implementation for this dataset should amount to adding a couple parameters to the relevant airflow dag execution.

Actually, it might be as simple as doing a big hdfs dfs chgrp -R analytics-privatedata-users <path to data>? @EBernhardson ?

This should now be resolved, existing partitions are owned by analytics-privatedata-users and new datasets going forward should also receive that group.

I can confirm that I have access to discovery.processed_external_sparql_query now :) I'll resolve this, but please let me know if it would be best to let you all close these in the future.