Page MenuHomePhabricator

Request access to Analytics cluster for Urbanecm
Closed, DeclinedPublic

Description

Rationale

I'm looking into T231145, and to give me more ideas what might be the cause, I need to run a SQL query. I can do it via Toolforge, but doing so would more-or-less force me to put some information about what the security-protected task is about at Toolforge, which is environment that's fundamentally insecure given how Cloud VPS work.

Since I'm a deployer, I technically can run SQL queries against the production databases, but since this is long-running query, it would slow down the site. Given I already can run SQL queries against unfiltered databases, this is not a big change permission-side, and is useful in other cases as well.

Also, this is not the only cause when I thought "wish I was able to access analytics cluster", namely, I work as a community ambassador for the Growth team, and it's sometimes useful for me to know stuff stored in Hadoop (eventlogging data, precisely), to not bother Growth team members having access, given I'm skilled enough to run the queries myself. This is an usecase I have as a WMF part time contractor, not in my volunteer capacity, so mentioning it here for completeness and to indicate this likely isn't "onetime stuff".

Approval process

I'm not sure what the approval process for analytics access for volunteers is, but https://wikitech.wikimedia.org/wiki/Production_shell_access says "[Get] at least one comment of support from a Wikimedia Foundation employee, explaining why it is a good idea to accept your request. The comment of support should be from [...] the employee you will be collaborating with if you're not [an employee]". I don't "have" an employee I plan to collaborate with right now, but I started to do the T231145 query after conversation with @Niharika, and I collaborated with other staff members as well, probably anyone of those can give the comment required.

The same page also says "The project lead where your access will be granted". Not sure who that would be, so I'm not CCing anyone.

Groups required

Since my use-case includes querying the databases and EventLogging data (which are considered private IIRC), I'll probably need researchers and analytics-privatedata-users. Not sure through, so feel free to correct me.

Notes

I already have production shell access (I'm a deployer), and as such, I already signed an NDA (see T192830 and nda LDAP groups members to verify). Also, for the same reason, my SSH key is already in puppet, and I'm not sure if I need to recite it here.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Hello -- I'm not working with @Urbanecm on this specific project, but he is a paid part-time contractor working with @Trizek-WMF and me on the Growth team. He is under NDA with our team, and has been responsibly working with us for over a year.

I think traditionally what was done for these sorts of things was to just run the query against some codfw replica from production, but this is fine too. I think you just need the researcher group.

I can vouch for @Urbanecm helping out on {T231145}. It would be great if this request comes through quick!

I think you just need the researcher group.

For databases, yes, but the rationale includes eventlogging data, which aren't included in researcher group. Or do I misunderstand the groups' permissions (in that case, https://wikitech.wikimedia.org/wiki/Analytics/Data_access should be clarified).

I think you just need the researcher group.

For databases, yes, but the rationale includes eventlogging data, which aren't included in researcher group. Or do I misunderstand the groups' permissions (in that case, https://wikitech.wikimedia.org/wiki/Analytics/Data_access should be clarified).

Certainly historically I worked on EventLogging data with researcher access. It might have changed?

I think you just need the researcher group.

For databases, yes, but the rationale includes eventlogging data, which aren't included in researcher group. Or do I misunderstand the groups' permissions (in that case, https://wikitech.wikimedia.org/wiki/Analytics/Data_access should be clarified).

Certainly historically I worked on EventLogging data with researcher access. It might have changed?

Well, https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging says "Only events in certain whitelisted EventLogging schemas are ingested into MySQL/MariaDB". MariaDB is ambiguous enough, but I believe it's the same mariadb researchers have access to. I don't see https://meta.m.wikimedia.org/wiki/Schema_talk:EditorJourney in https://github.com/wikimedia/puppet/blob/production/modules/eventlogging/files/plugins.py (can be blind), and since https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive says "In order to access Hive, you need an account with production shell access in either the analytics-privatedata-users or the analytics-users user group.", I believe just researcher doesn't help. Not sure if analytics-users would be enough, but given editorjourney is certainly not otherwise public, it probably needs analytics-privatedata-users. As said above, it's merely just a guess based on the docs, so I can be totally wrong.

Hello -- I'm not working with @Urbanecm on this specific project, but he is a paid part-time contractor working with @Trizek-WMF and me on the Growth team. He is under NDA with our team, and has been responsibly working with us for over a year.

What Marshall says.

jbond triaged this task as Medium priority.Sep 9 2019, 9:17 AM

@Nuria are you able to approve @Urbanecm access to researchers and analytics-privatedata-users

Change 537524 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] admin: add urbanecm to researchers, analytics-privatedata-users

https://gerrit.wikimedia.org/r/537524

Uploaded a patch for this. But need approval from @Nuria before moving forward with it.

@MMiller_WMF and @Urbanecm

I want to clarify that we do not grant access to private data to volunteers (we have limits as to what amount of users we can support and right now our policy requires a formal research collaboration with the research team to get access). The comments at "https://wikitech.wikimedia.org/wiki/Production_shell_access" refer to shell access rather than data access. This might be unfortunate but we simply cannot support as many requests as we get in terms of resources and data safety.

So we understand, does @Urbanecm have a formal contract with WMF in place to work on https://phabricator.wikimedia.org/T231145 ?

Also, the ticket referenced is already been closed, right?

@MMiller_WMF and @Urbanecm

I want to clarify that we do not grant access to private data to volunteers (we have limits as to what amount of users we can support and right now our policy requires a formal research collaboration with the research team to get access). The comments at "https://wikitech.wikimedia.org/wiki/Production_shell_access" refer to shell access rather than data access. This might be unfortunate but we simply cannot support as many requests as we get in terms of resources and data safety.

Is this mentioned somewhere in the docs? Also, I don't understand what is "data safety" in this context. Does that mean you believe there is higher risk of leaking volunteers than leaking staff?

Also, is this comment related only to analytics-privatedata-users, or also to researchers?

So we understand, does @Urbanecm have a formal contract with WMF in place to work on https://phabricator.wikimedia.org/T231145 ?

No - that would be in my volunteer capacity.

Also, the ticket referenced is already been closed, right?

Yes.

Sorry this is disappointing but given our very limited resources we really cannot support ad-hoc data access for community members, the best way we have found to have a policy around granting access has to do with employment or active collaborations with research team. I have added a note to this extent to the wikitech docs. Again , my apologies, in this case the ticket that prompted this request seems like is been resolved.

Sorry this is disappointing but given our very limited resources we really cannot support ad-hoc data access for community members, the best way we have found to have a policy around granting access has to do with employment or active collaborations with research team. I have added a note to this extent to the wikitech docs.

I'm sorry, I should've said that explicitly. I meant the analytics-specific page, https://wikitech.wikimedia.org/wiki/Analytics/Data_access. That page says "Otherwise, if you're a volunteer, you'll need to find an employee who will sponsor you through the volunteer NDA process. If you're a researcher, it's possible to be sponsored through a formal collaboration with the Wikimedia Foundation's Research team.", which I interpret that formal collaboration is just a possibility, not as a requirement.

I also think the update at https://wikitech.wikimedia.org/wiki/Production_shell_access should be clarified, most, if not all, of shell accesses levels grants also some access to private data (as in, not otherwise public), so it should be made clear this is relevant to analytics data access IMO.

Again , my apologies, in this case the ticket that prompted this request seems like is been resolved.

The ticket where I needed to query unredacted copy of databases is indeed resolved, but it was only the very last case as-of writing that prompted me to create this request. In the few weeks when this request was in the queue, few other cases appeared, not all of them have a Phabricator ticket, so sadly, I can't link anything. Obviously, I can workaround that by querying live production databases, but that can work only for really short queries - otherwise, it'll slow down the site. I thought it's useful to run the queries elsewhere, so production DB shell is used only when absolutely necessary. Since that solution doesn't seem to be liked by your team, could you please suggest an alternative solution?

@Urbanecm I have corrected the information about data access, sorry about that. From what i can tell the query that lead to you requesting access can be run against labs databases, it is very few the info that is on analytics replicas that is not present on labs databases.

Since that solution doesn't seem to be liked by your team, could you please suggest an alternative solution?

Please team up with someone that already has data access. It is unfortunate but we really do not have the ability to support data access for as many requests as we get.

@Urbanecm I have corrected the information about data access, sorry about that. From what i can tell the query that lead to you requesting access can be run against labs databases, it is very few the info that is on analytics replicas that is not present on labs databases.

In that particular case, yes, if we accept the practice of putting confidential data at labs - which probably isn't the best thing to do. Regarding data at labs, I frequently query data of private abusefilters, which is redacted in labs.

Since that solution doesn't seem to be liked by your team, could you please suggest an alternative solution?

Please team up with someone that already has data access. It is unfortunate but we really do not have the ability to support data access for as many requests as we get.

If there's really no way, closing as declined. Just out of curiosity, may I ask what "support" mean?

Change 537524 abandoned by Herron:
admin: add urbanecm to researchers, analytics-privatedata-users

Reason:
related task closed as declined

https://gerrit.wikimedia.org/r/537524

Sorry this is disappointing but given our very limited resources we really cannot support ad-hoc data access for community members, the best way we have found to have a policy around granting access has to do with employment or active collaborations with research team.

The idea that access for a formal collaboration with an external non-wikimedia group is an acceptable use of resources but access for wikimedians is somehow a significantly bigger problem deserves some scrutiny imo.

The idea that access for a formal collaboration with an external non-wikimedia group is an acceptable use of resources but access for wikimedians is somehow a significantly bigger problem deserves some scrutiny imo.

(my last post on this regard)
Again, our resources are limited and while this policy might not be perfect is certainly clear. The analytics team does not run a platform intended for wide community access, to do so we will need many times the resources we have in terms of people and infrastructure. Sorry this answer is disappointing but analytics serves the community by making publicly accessible as much data as we can about the movement (and this is a goal towards which we work every day). We do not provide a publicly accessible computation platform. We simply cannot do both.

The analytics team does not run a platform intended for wide community access, to do so we will need many times the resources we have in terms of people and infrastructure. Sorry this answer is disappointing but analytics serves the community by making publicly accessible as much data as we can about the movement (and this is a goal towards which we work every day). We do not provide a publicly accessible computation platform.

We're not talking about making data publicly accessible or talking about a publicly accessible platform. @Urbanecm is a technical wikimedia community member with an NDA, shell access (deployment access, so much more significant than this), and is being denied access to analytics machines because he's not part of a formal research collaboration? It seems to me that had a foundation staff/contractor member requested access for this purpose it would have been granted, and worse than that - had someone not directly involved in wikimedia, but working in a research collaboration, requested access, that would've been accepted too?

I don't believe any part of production should be off-limits to volunteers in principle. Also, while analytics may be able to block people being added to analytics groups, volunteers could still be added to the ops group (at least one volunteer is in there, others have been in the past, and I hope more will be in future).

The idea that access for a formal collaboration with an external non-wikimedia group is an acceptable use of resources but access for wikimedians is somehow a significantly bigger problem deserves some scrutiny imo.

I don't believe any part of production should be off-limits to volunteers in principle. Also, while analytics may be able to block people being added to analytics groups, volunteers could still be added to the ops group (at least one volunteer is in there, others have been in the past, and I hope more will be in future).

Going to add my 2c adding my point of view of Ops/support person in the Analytics team. We are not, by any mean, hiding or shielding any part of the production from volunteers based on their status, since we do value their work (as all the WMF) and we would love to support all the communities as best as possible. There are resource constraints though, since granting access to Hadoop and all its datasets needs support to each user in the following:

  1. Introduction about datasets and where to find them
  2. Introduction about how to query the Hadoop cluster efficiently and avoiding resource exhaustion (this happens a lot of times with tools like Spark or Hive since it is difficult to apply proper, per-user, limits to them).
  3. Support data exploration or reporting when needed (special queries, etc..)
  4. Etc..

The above is not a trivial amount of work, and we have a limited number of people working daily to provide this level of support. You may think that not every user will need it, but from our experience most of people do. We are extremely happy to help, but we do try to review every access request in depth to see if it is really needed or not (independently from the user working status with the WMF) because we know that we'll have to provide a good support. For example, if you check Nuria's responses to access request, you'll notice that even WMF staff is asked to provide a good motivation to get access to the cluster. The research team is a special case: we know in advance (before the start of the fiscal) the expected amount of people that will require access to data (researchers within the WMF and outside of it), since the research team reviewed/accepted them following a process (even the Research team probably needs to say no to a lot of collaboration requests due to resource constraints I imagine).

Last but not the least: being in the ops group does not allow you to get access to all the other groups and systems without any care of process to follow and/or limitations. Even if a ops user can technically log in and self-add herself/himself to any group, there is still a process to follow for systems like Hadoop due to what I wrote above. Having a lot of permission on an infrastructure requires a good level of judgment about when and where to use them. I also hope that more people from the community will be able to access the ops group and work with the WMF.

@Urbanecm I hope that this task didn't communicate the wrong message to the community, we may be able to offer more support in the future. We are working on providing a good balance to our users and we are still learning the best way to do it.