Page MenuHomePhabricator

Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis
Closed, ResolvedPublic

Description

The Inuka team uses the KaiOSAppFeedback data stream to collect free-text feedback from users of the KaiOS app. This feedback has not yet been analyzed in detail to surface any key points, so @Rileych is planning a side project to do just that.

To do this, she will need a one-time dump of three fields (dt, version, feedback) from the data stream. dt and version are not sensitive, but since feedback contains free text input by users, we cannot rule out the possibility that it contains sensitive information. As a Wikimedia Foundation staff member, she has signed a non-disclosure agreement for sensitive data like this, but does not have access to the Analytics Cluster where the data is stored.

Open questions:

  1. What approval do I need to transfer this data to her?
  2. How can I transfer this data to her? The only method authorized by the data access guidelines is SSH, via the filesystem of a analytics client machine, but that would mean that it's impossible to transfer data to anyone without giving them and then forcing them to set up production access.

Event Timeline

  1. What approval do I need to transfer this data to her?
  2. How can I transfer this data to her? The only method authorized by the data access guidelines is SSH, via the filesystem of a analytics client machine, but that would mean that it's impossible to transfer data to anyone without giving them and then forcing them to set up production access.

Can someone from Security or SRE help me answer these questions? 😊

hi @nshahquinn-wmf! I think the best people to answer this question are the Analytics folks, specifically @Ottomata

@nshahquinn-wmf probably the simplest thing to do would be to get her analytics-privatedata-users ssh access. Then we don't have to think about the data privacy issues. Would that be possible?

probably the simplest thing to do would be to get her analytics-privatedata-users ssh access.

+1 to this approach from me (and ostensibly the Security-Team) so as to (hopefully) avoid any extraneous data transfers and repeated requests if this were to become more than a one-time ask. @Dsharpe and @JFishback_WMF might want to weigh in on this as well, but for an NDA'd Foundation employee this should be fine AFAIK, with all of the standard best practices assumed to be followed (e.g. don't post the data publicly anywhere, don't leave it unencrypted on a shared device, etc.)

@nshahquinn-wmf probably the simplest thing to do would be to get her analytics-privatedata-users ssh access. Then we don't have to think about the data privacy issues. Would that be possible?

Well, this is actually what I wanted to avoid. Getting and setting up private data access is a very complicated process, and at this point there's no reason for it to be more than a one-time thing (Chelsea's main role is as a technical program manager, and this is something that just happens to lie at an intersection of the team's curiosity and her side interests).

Morever, it seems like a greater privacy risk. What's more of a risk: sending a single dataset which might not contain a single piece of sensitive information via some secure non-SSH method? Or giving another person permanent access to everything in the analytics cluster? (Obviously, if Chelsea were going to get heavily involved in analysis, it would be worth the risk of giving her permanent access. But that's not the case.)

As the production access guide says, "All access privileges require a clear, ongoing need for the access. If you have a one-time need for data, request the data from the Analytics team instead." That's what Chelsea is trying to do here.

As the production access guide says, "All access privileges require a clear, ongoing need for the access. If you have a one-time need for data, request the data from the Analytics team instead." That's what Chelsea is trying to do here.

I'd be shocked if this was only a one-time ask, as in my personal experience this is almost never the case with such requests. But if this truly will be, with some extremely high level of confidence, then permanent access wouldn't be necessary. Such access could be time-limited with analytics removing the access after some period of days or weeks. Or the data could be provided by a privileged user if transmitted in a secure way and then properly disposed, with auditable confirmation of said actions.

Morever, it seems like a greater privacy risk. What's more of a risk: sending a single dataset which might not contain a single piece of sensitive information via some secure non-SSH method? Or giving another person permanent access to everything in the analytics cluster? (Obviously, if Chelsea were going to get heavily involved in analysis, it would be worth the risk of giving her permanent access. But that's not the case.)

Again, the analytics-privatedata access wouldn't have to be permanent. Furthermore, giving access to the data via the standard approach of adding an individual to analytics-privatedata would theoretically provide more focused ownership of risk and increased audibility. Increasing the transfers of data and involving more humans in the process would tend to be the riskier approach both in terms of potential error and mishandling of data. I would also hope that we would trust NDA'd Foundation employees with most private data and to only access the data they need - if this isn't an assumption we can generally make, we likely have far larger problems as an organization.

@nshahquinn-wmf your question raises another - where are you looking to send the data to? If you're sending sensitive data to an insecure laptop, then securing it in transmission is only one facet of the issue.

How many records are we talking about? Presumably, if the intent is that Chelsea will manual review them for insights, it won't be that many. Couldn't someone with ssh access quickly review the data for sensitive information? If it's clean, then we don't have anything to worry about, yes? Since this is just a one-time request, it seems like that course of action would save time over developing a new secure method for transmission/review of the data.

Getting and setting up private data access is a very complicated process

While there is always room for improvement I don't think it's _that_ complicated, given how regularly these happen and us having a rotating clinic duty every week specifically to handle them.

Yes, she would have to create a SSH key but the rest of it is basically just saying "I need this .. for .." and a manager saying "approved" since this ticket alread exists and has the access request tag on it. We can also add an expiry_date that will automatically remind us when it's time to renew or remove it again.

I think it might have been done in less time than the time spent on the discussion to be honest.

Thank you for giving detailed feedback so quickly, everyone! I was expecting that I would have to keep nudging people for answers, but clearly I was wrong 😊

Quick summary: in order to spare Chelsea the big headache of using the command line and to follow the principle of least privilege, I still just want to send her the specific file she needs. But there's no approved method (like, I don't know, Google Drive or sending an encrypted container and providing the password via video call). This kind of situation happens fairly frequently for my team, so it's worth us taking the time to figure out a reasonable process.

Getting and setting up private data access is a very complicated process

While there is always room for improvement I don't think it's _that_ complicated, given how regularly these happen and us having a rotating clinic duty every week specifically to handle them.

Yes, I think you folks do a good job of handling the process! When I said it's very complicated, I was actually talking mostly about the setup the user has to do: generating a key pair, setting up an SSH config file, using the command line to navigate the server and retrieve files, and so on. These can be very difficult things for non-technical folks, so it seems very unreasonable for this to be the only allowed way of transferring sensitive data.

Moreover, this is only the difficulty of getting access to one of the analytics client machines to access a data file that someone like me has already exported. If we're talking about Chelsea querying for the data on her own, the difficulty doubles at least because now she has to contend with writing SQL and navigating our profusion of data access technologies.

I think it might have been done in less time than the time spent on the discussion to be honest.

I definitely sympathize since this has already become a big discussion! But this situation (wanting to transfer a single sensitive data file to folks who have NDAs but aren't technical) comes up fairly frequently for Product Analytics and we really should have a proper process for it.

As the production access guide says, "All access privileges require a clear, ongoing need for the access. If you have a one-time need for data, request the data from the Analytics team instead." That's what Chelsea is trying to do here.

I'd be shocked if this was only a one-time ask, as in my personal experience this is almost never the case with such requests.

I have a lot of personal experience with this too! Helping folks throughout the org access data is one of my main responsibilities, and I definitely get requests that are truly one-time things or rare enough (like once a year) that it's absolutely not worth the requestor doing the work to get their own access. This is highly likely to be a one-time project.

And even if it was going to repeat every six months, I'd still want to just pull the files and send them to Chelsea. Me spending ten minutes every 6 months is better than her spending 5 hours minimum figuring out all the complexities of SSH access and the analytics infrastructure!

But if this truly will be, with some extremely high level of confidence, then permanent access wouldn't be necessary. Such access could be time-limited with analytics removing the access after some period of days or weeks. Or the data could be provided by a privileged user if transmitted in a secure way and then properly disposed, with auditable confirmation of said actions.

That second option (I, a privileged user, providing it to Chelsea) is what I want! But as far as I can tell, there are no officially approved secure ways. On a couple of occasions, I have provided user email addresses (confidential data) to other teams at the Foundation for use in surveys. These specific uses of email addresses were approved by Legal, but there was no discussion of the transmission method and I used Google Sheets without thinking much about it.

If Google Sheets is not appropriate, okay, but there has to be another method. It's not reasonable to expect all these folks to all set up SSH access and learn to use the command line. Maybe sending an encrypted container and giving the password over a video call?

Morever, it seems like a greater privacy risk. What's more of a risk: sending a single dataset which might not contain a single piece of sensitive information via some secure non-SSH method? Or giving another person permanent access to everything in the analytics cluster? (Obviously, if Chelsea were going to get heavily involved in analysis, it would be worth the risk of giving her permanent access. But that's not the case.)

Again, the analytics-privatedata access wouldn't have to be permanent. Furthermore, giving access to the data via the standard approach of adding an individual to analytics-privatedata would theoretically provide more focused ownership of risk and increased audibility. Increasing the transfers of data and involving more humans in the process would tend to be the riskier approach both in terms of potential error and mishandling of data. I would also hope that we would trust NDA'd Foundation employees with most private data and to only access the data they need - if this isn't an assumption we can generally make, we likely have far larger problems as an organization.

All things being equal, yes, Chelsea accessing the data on her own is better since it's more auditable. But in this case, all things are not equal. One option is giving Chelsea full access to a massive collection of sensitive data (temporarily, sure); the other is sending her a single file.

This doesn't have anything to do with not trusting Chelsea; she's extremely trustworthy and is not going to misuse her access. It's about following the principle of least privilege and reducing the number of hackable endpoints. There's sensitive data I don't have access to (search logs, donor data, content of delete wiki pages, etc.), despite being a data scientist here for 5 years and very trustworthy (in my biased opinion 😂). That seems completely sensible to me, since I have never yet needed it for my work.

@nshahquinn-wmf your question raises another - where are you looking to send the data to? If you're sending sensitive data to an insecure laptop, then securing it in transmission is only one facet of the issue.

Agreed. I'll make sure that Chelsea has full-disk encryption turned on and understands not to upload it to a cloud service and to delete it as soon as she's done with it.

How many records are we talking about? Presumably, if the intent is that Chelsea will manual review them for insights, it won't be that many. Couldn't someone with ssh access quickly review the data for sensitive information? If it's clean, then we don't have anything to worry about, yes? Since this is just a one-time request, it seems like that course of action would save time over developing a new secure method for transmission/review of the data.

She isn't actually planning to review them manually; she wants to do some quantitive analysis on her laptop, using SPSS or R or something like that (I forget which). There are 34,000 records so it's too big to review manually.

I would suggest to use GPG to send an encrypted file as an email attachment. But then she still has to create a keypair and be comfortable using that.

There is https://flowcrypt.com/ to make that process a lot easier though. Maybe that's an option.

also T269519 might be related

Probably not helpful, but users can now access Superset to query Presto and Druid via SQL without SSH access. They just need to be in the wmf or nda LDAP groups.

On the topic of ssh accesses, there shouldn't be a "big headache of using the command line" for getting access to the cluster. I don't think anyone here with "Technical" in their role would have a problem for doing that, but it wouldn't be necessary. There shouldn't be a need to use a command line, even. There are graphical tools for creating SSH keys and transferring files via ssh. And if the file to copy was in the bastion host, that would be even easier, as no jumping would be needed.
If getting access is being such a big issue (and for multiple people!), that seems a sign that the documentation is in urge need for improvement. It would be a matter of following a number of steps with screenshots. Fill this value here, then click that button, copy the following magical settings into this file.

Most likely, different instructions will be needed for different OS (due to having access to different clients), but there would be just 2-3 of them.

On a couple of occasions, I have provided user email addresses (confidential data) to other teams at the Foundation for use in surveys. These specific uses of email addresses were approved by Legal, but there was no discussion of the transmission method and I used Google Sheets without thinking much about it.

Discussing this further with @JFishback_WMF, it sounds like what we're missing is an actual WMF-Legal -approved policy (or, at the very least, set of guidelines) so as to provide consistent, authoritative guidance for these situations. I'm not aware of anything like that currently existing - there are some vaguely-related usage and employment policies on officewiki, but they don't really apply here IMO. Still, this likely isn't something that can be turned around within a day or two, so perhaps a specific process needs to be proposed for this task, perhaps by discussing with Chelsea what she's most comfortable with re: various secure transportation mechanisms (and maybe ITS as well), and then having WMF-Legal approve or disapprove of said process while the more definitive policy is crafted.

I have a lot of personal experience with this too! Helping folks throughout the org access data is one of my main responsibilities, and I definitely get requests that are truly one-time things or rare enough (like once a year) that it's absolutely not worth the requestor doing the work to get their own access. This is highly likely to be a one-time project.
...
And even if it was going to repeat every six months, I'd still want to just pull the files and send them to Chelsea. Me spending ten minutes every 6 months is better than her spending 5 hours minimum figuring out all the complexities of SSH access and the analytics infrastructure!

Though it wouldn't be a short- or even medium-term solution, perhaps our current processes should be reviewed with an eye towards facilitating quicker, auditable access to various data sets so as to limit or reduce entirely these kinds of transactions.

This doesn't have anything to do with not trusting Chelsea; she's extremely trustworthy and is not going to misuse her access. It's about following the principle of least privilege and reducing the number of hackable endpoints. There's sensitive data I don't have access to (search logs, donor data, content of delete wiki pages, etc.), despite being a data scientist here for 5 years and very trustworthy (in my biased opinion 😂). That seems completely sensible to me, since I have never yet needed it for my work.

Correct, but the sensitive data you mention above isn't relevant to your job while the data within the analytics cluster is ostensibly relevant to Chelsea's job, but is not accessible in a granular enough way so as to satisfy the principle of least privilege. Unfortunately, IME, we have to make concessions to reality like this all of the time. And while such actions do occasionally introduce risk of one kind, they mitigate risk of another, which reduces the decision down to what is essentially a judgment call.

jcrespo added a subscriber: jcrespo.

Based on previous comment by @sbassett, it seems the right direction is to contact Legal & IT support for permission/best practices/support on how to do that adequately and securely, if tools needed for analysis are not available on the cluster. Given that for this specific case, this is not a request to share passwords, IPs or other known very sensitive piece of data and the data was (I assume) expected to be handled this way, I think is a reasonable request to make.

I am removing the SRE-Access-Requests (keeping the SRE one) and making myself available (as member of the Data persistence team, with experience handling datasets for backups) to help transfer securely data, assuming someone has prepared the above steps.

Further refinements on policies and changes/discussion to improve current procedures should and can occur on separate tickets.

I spoke with a member of WMF-Legal about this issue and the tl;dr is that we do not presently have a sufficiently low risk alternative to ssh for transferring files outside of the analytics cluster. At least not one that isn't equally or more difficult than configuring ssh. As @sbassett postulated above, WMF-Legal would be amenable to an alternative if one is proposed/developed/installed/configured/purchased/whatever, but I think that's not a quick ask. Since it sounds like this is a long-term, ongoing need, @nshahquinn-wmf I would reach out to ITS and/or SRE about a new permanent solution for getting data where you need it to go.

Untagging Security-Team for now, but please feel free to add back if there is something else needed.

Thanks, @sbassett, @JFishback_WMF, and @jcrespo, for the further input! Yes, it sounds like I would need to pursue this primarily with Legal and ITS. One question: if Legal approves a solution, would your teams (Security and SRE) still need to approve it as well, or would the legal approval be sufficient?

Assuming legal approves and IT helps you on client side, we (SREs) will be able to help the person with any transference needed. However you will want too coordinate with the dataset owner- if they are analytics data you will want to coordinate with them, or if you need database data with the DBAs, for pure practical reasons.

On the topic of ssh accesses, there shouldn't be a "big headache of using the command line" for getting access to the cluster. I don't think anyone here with "Technical" in their role would have a problem for doing that, but it wouldn't be necessary. There shouldn't be a need to use a command line, even. There are graphical tools for creating SSH keys and transferring files via ssh. And if the file to copy was in the bastion host, that would be even easier, as no jumping would be needed.
If getting access is being such a big issue (and for multiple people!), that seems a sign that the documentation is in urge need for improvement. It would be a matter of following a number of steps with screenshots. Fill this value here, then click that button, copy the following magical settings into this file.

A graphical client for SSH is a pretty good idea, although we would then probably need to get that client reviewed and approved, so it would still take some time.

But I strongly disagree with the idea with anyone "technical" in their title should be expected to deal with the complexity.

First, not all technical people are software engineers. Technical program managers like Chelsea are experts in how to make technical teams and the software development process run smoothly and efficiently. It's not expected that they are experts in software development itself, and they can do their jobs excellently even if they've never touched the command line in their life.

Second, people are busy and don't have the time to be familiar with every software technology or system. I've helped some very competent software engineers get set up to access our production data, and it still takes them some hours to look up the access process, follow it, set up their access, and learn about the wide variety of data access systems we have. If they're only going to need data once in a year, it really isn't worth their expensive time to do all that.

Assuming legal approves and IT helps you on client side, we (SREs) will be able to help the person with any transference needed. However you will want too coordinate with the dataset owner- if they are analytics data you will want to coordinate with them, or if you need database data with the DBAs, for pure practical reasons.

That makes sense, but my question was something a bit different: if Legal approves a solution that doesn't require technical help (like me just downloading the data and following a particularly procedure to send it securely), do y'all still need to approve it? Or are you comfortable with their determination that the solution is technically secure?

Analytics tools are improving every day, The data engineering team are doing a great job of offering web-based interfaces and APIs of querying datasets, but they are just very few people, so the improvement and new features offered can only go as fast. More engineers are needed! Privacy and transparency is something we all want at wikimedia, but sometimes it is really hard to have both at the same time.

do y'all still need to approve it?

We ask you to loop us in. For production access is always better to ask for permission as we may be able to provide easier ways (e.g. if you will end up downloading using your ssh account, we may be able to put it in a place where it doesn't affect other services or any other improvement, as well as check that it is later deleted from the temporary location).

Makes sense—thanks, @jcrespo! I'm removing the SRE and Security tags since the questions for those folks have been answered.

(Maybe split this subthread into a new task "Connecting to prod should be easy?")

A graphical client for SSH is a pretty good idea, although we would then probably need to get that client reviewed and approved, so it would still take some time.

Not that it is a bad idea to review the clients, but if there is a whitelist of allowed ssh clients... it doesn't seem to be listed anywhere! No mention on https://wikitech.wikimedia.org/wiki/Production_access, not even on L6

But I strongly disagree with the idea with anyone "technical" in their title should be expected to deal with the complexity.

First, not all technical people are software engineers. Technical program managers like Chelsea are experts in how to make technical teams and the software development process run smoothly and efficiently. It's not expected that they are experts in software development itself, and they can do their jobs excellently even if they've never touched the command line in their life.

No, no, no. I tried precisely _not_ to mean that someone with "technical" in their role would be a "technician". Just that someone with that role wouldn't be a computer illiterate, but a slightly advanced user. We don't need Aunt Tillie to connect to prod (although she should be able to edit the wikis!), nor need that a computer user with a specific domain of graphical editing is able to. But, even if using it the "hardcore way" with a console client, it needs little more than being opening a terminal and . Doing your taxes (with a program wizard to guide you) is many orders of magnitude harder. If it takes so much, we are doing something wrong.

Second, people are busy and don't have the time to be familiar with every software technology or system. I've helped some very competent software engineers get set up to access our production data, and it still takes them some hours to look up the access process, follow it, set up their access, and learn about the wide variety of data access systems we have. If they're only going to need data once in a year, it really isn't worth their expensive time to do all that.

Hours? That still seems too much. Maybe I have a cognitive blindness, not realizing how hard it can be, or I'm missing some critical lengthy step. _I really want to understand the problem, here._

As I see it, a _simplified_ process for "downloading a file from a server" would be:

  • Open a terminal
  • Run ssh-keygen -f $HOME/.ssh/prod.key -t ed25519
  • Run cat $HOME/.ssh/prod.key.pub and copy its contents
  • Open a Task requesting access and include the line from the previous step. Sign the appropriate legalpad and request your manager to approve it.

Approval may take several days. Meanwhile,

Then, once you have access, you could simply connect to some-server.eqiad.wmnet by just doing in a terminal:

  • ssh some-server.eqiad.wmnet

Or for copying an already-prepared file:

  • scp some-server.eqiad.wmnet:file-built-by-neil.tgz .

Noting that, in order to avoid being asked multiple times for the key password, the first time on the user session you might need to do (in some environments this isn't needed):

  • ssh-add $HOME/.ssh/prod.key

I can understand that the approval process may be considered cumbersome, or reviewing the documentation and the legal text to take some time. But I don't see it as _technically_ hard. It may still take them too long to e.g. figure out in which system is the piece of data they need, or how to extract it. But finding the data is a slightly different point to be stuck (we may still need a summary of the documentation, some cheatsheets, or perhaps a smaller legal agreement, but we need to know where the problem lies).

Note: While the above should work out of the box on Linux and Mac OS X, and nowadays -after setting it up- also even in Windows, but will depend in the end on the exact system used. If using a different client, small variations will be needed.

Note 2: This assumes a basic user, and is concerned just to setting up access to prod. More advanced users may want to have more complex config rules, use a Yubikey, etc. These things are out of the scope from the syllabus of "Connecting to prod 101"

kzimmerman added a subscriber: kzimmerman.

@Rileych reassigning to you since this depends on your getting private data access; once you've gotten that access, please reassign to Neil so he can pass the data to you!

@Rileych now has access to a snapshot of this data through a different, Legal-approved route.

@nshahquinn-wmf How did you end up resolving this? We should document it somewhere, so we can point to it for future requests like this. I recall you mentioning that this issue comes up not infrequently.