Page MenuHomePhabricator

Request access to data for Wikimedia Donation Patterns research
Closed, ResolvedPublic

Description

Who
Oliver Keyes

Access Group
analytics-privatedata-users

Why
https://meta.wikimedia.org/wiki/Research:Exploring_Wikimedia_Donation_Patterns

Steps

Ops Clinic Duty Checklist for Access Requests

Most requirements are outlined on https://wikitech.wikimedia.org/wiki/Requesting_shell_access

This checklist should be used on all access requests to ensure that all steps are covered. This includes expansion to access. Please do not check off items on the list below unless you are in Ops and have confirmed the step.

  • - User has signed the L3 Acknowledgement of Wikimedia Server Access Responsibilities Document.
  • - User has a valid NDA on file with WMF legal. (This can be checked by Operations via the NDA tracking sheet & is included in all WMF Staff/Contractor hiring.)
  • - User has provided the following: wikitech username, preferred shell username, email address, and full reasoning for access (including what commands and/or tasks they expect to perform. (DYNKM wikitech name, use email in wikitech.)
  • - User has provided a public SSH key. This ssh key pair should only be used for WMF cluster access, and not share with any other service (this includes not sharing with WMCS access, no shared keys.) >>! In T188945#4029062, @DYNKM put ssh key.
  • - access request (or expansion) has sign off of WMF sponsor/manager (sponser for volunteers, manager for wmf staff) (@Capt_Swing made the request).
  • - non-sudo requests: 3 business day wait must pass with no objections being noted on the task
  • - sudo requests: all sudo requests require explicit approval during the weekly operations team meeting. No sudo requests will be approved outside of those meetings without the direct override of the Director of Operations.
  • - Patchsets for access request: 2 patches, first to add user, second to add to group. https://gerrit.wikimedia.org/r/#/c/416993/ & https://gerrit.wikimedia.org/r/#/c/416996/

Event Timeline

Capt_Swing created this task.

Done, I *believe* - same dev access name.

Added SRE-Access-Requests. Removed myself as assignee now that next steps are (I assume) from Ops, but not sure who to re-assign to.

@DYNKM I believe you need to do at least 1 more thing: can you upload your public key in a comment? I've never shepherded anyone through this process before, but see the comment thread on T142780 for an idea of what needs to happen next.

That'd be:

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC28nrvknbeIAlF31jJCw1ucjaToH7dcGtUtWuZM2dsfWRqjYx9xuIByEisePnS1r9EPh0y3swTDn6zBHlWGDkkuzPWkF3SXCWlOqPFI94AQRPyBqyUNecT8Hm3YMkTKFZNM3TKsKd+rrlgBgFU5RDGgkFZkrsEEG1KhOpOc0IbtHlJxxLqjoQZggBCQ7v9TRwnnlZ6WB25/4Fxd7tQLKWdYCbT4iiiSEG4A5A30GIRqE9bSzgMGzH3jjTQabxNlp7O690pUs3iPdbWn1qV/IThGDXKskfwhjbYKW4065+vnI1p8F8qA8Lu8B1N6pji1y407o6zQodSvmRS5TdlrrzF okeyes@uw.edu

Change 416993 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] add shell user dynkm/oliver keyes

https://gerrit.wikimedia.org/r/416993

Change 416996 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] adding dynkm/oliver keyes to analytics-privatedata-users

https://gerrit.wikimedia.org/r/416996

Resumed sounds ideal; thank you!

Change 416993 merged by Vgutierrez:
[operations/puppet@production] Re-add shell user ironholds/oliver keyes

https://gerrit.wikimedia.org/r/416993

everything ready, your user will be added to analytics-privatedata-users after it's approved on next Monday ops meeting

Vgutierrez changed the task status from Open to Stalled.Mar 15 2018, 9:23 AM
Vgutierrez added a project: Ops-Access-Reviews.

Change 416996 merged by RobH:
[operations/puppet@production] adding ironholds/oliver keyes to analytics-privatedata-users

https://gerrit.wikimedia.org/r/416996

RobH updated the task description. (Show Details)
RobH subscribed.

Access has been merged live.

I suspect I might still be missing an access somewhere; running a hive query produces:

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:88)
	at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1835)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1449)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListingInt(FSNamesystem.java:5073)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:5060)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:888)
	at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getListing(AuthorizationProviderProxyClientProtocol.java:336)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:630)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)

	at org.apache.hadoop.ipc.Client.call(Client.java:1472)
	at org.apache.hadoop.ipc.Client.call(Client.java:1409)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
	at com.sun.proxy.$Proxy16.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:573)
	at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
	at com.sun.proxy.$Proxy17.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2101)
	at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:887)
	at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:870)
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:815)
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:811)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:811)
	at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1742)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:302)
	at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
	at org.apache.hadoop.hive.shims.Hadoop23Shims$1.listStatus(Hadoop23Shims.java:133)
	at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217)
	at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:75)
	at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:308)
	at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:472)
	at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:573)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:332)
	at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:324)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1307)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1304)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1304)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:578)
	at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:573)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:573)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:564)
	at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:418)
	at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:142)
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:214)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
	at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:80)
Job Submission failed with exception 'org.apache.hadoop.ipc.RemoteException(Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:88)
	at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1835)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1449)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListingInt(FSNamesystem.java:5073)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:5060)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:888)
	at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getListing(AuthorizationProviderProxyClientProtocol.java:336)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:630)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

@Ottomata any thoughts?

Wait, scratch that; made it work!

Nope, happening again. Query draft:

hive -e "
USE wmf;
SELECT
  dt AS timestamp,
  md5(concat_ws('//', client_ip, user_agent)) AS user_identifier,
  normalized_host.project AS project,
  page_id
FROM webrequest INNER JOIN (
  SELECT md5(concat_ws('//', alias1.client_ip, alias1.user_agent)) AS uuid,
  alias1.normalized_host.project AS project
  FROM wmf.webrequest alias1 INNER JOIN ironholds.fundraising_ids alias2
  ON alias1.normalized_host.project == alias2.wiki
  AND alias1.page_id == alias2.pageid
  WHERE
  alias1.is_pageview = TRUE
AND
  alias1.normalized_host.project_class == 'wikipedia'
AND
  alias1.agent_type == 'user'
AND
  alias1.normalized_host.project IN ('en', 'fr')
AND
  alias1.year = 2018
AND
  alias1.month = 01
AND
  alias1.day > 7
AND
  alias1.webrequest_source IN ('misc', 'text')
) alias3
ON md5(concat_ws('//', client_ip, user_agent)) = alias3.uuid
AND normalized_host.project = alias3.project
WHERE
  is_pageview = TRUE
AND
  normalized_host.project_class == 'wikipedia'
AND
  agent_type == 'user'
AND
  normalized_host.project IN ('en', 'fr')
AND
  year = 2018
AND
  month = 01
AND
  day > 7
AND
  webrequest_source IN ('misc', 'text');" > fundraising.tsv

ping @DYNKM , this query will not run as written. please work with us to improve it . cc @JAllemandou Best place is #wikimedia-analytics

@Nuria sorry for the bother; I worked out an alternate/actually faster way of doing it anyhoo!