Page MenuHomePhabricator

Request for Private repos to be enabled
Closed, ResolvedPublic5 Estimated Story Points

Description

Hello! I'm trying to use GitLab more often (instead of GitHub) and as I was creating a new project I noticed that "Other visibility settings have been disabled by the administrator."

Use case: As a data scientist who works with confidential data and performs analyses that are meant for internal use & eyes only, I would like to be able to upload my version controlled work to Wikimedia's GitLab instance without it being publicly accessible.

https://gitlab.wikimedia.org/help/public_access/public_access says this about Private repos:

Private projects can only be cloned and viewed by project members (except for guests).
They appear in the public access directory (/public) for project members only.

This sounds great for my use case as my current options are:

  • For projects on my local machine and projects on our analytics cluster (stat100X hosts): copying to HDFS occasionally as a way to backup
  • For projects on my local machine: using private repositories on GitHub with my individual (non-organization) account

Event Timeline

Just chiming in to +1 this or at least open the discussion. A few miscellaneous thoughts:

  • Many of my personal notebooks that I should keep private don't actually contain highly sensitive data -- e.g., as an extreme example, I'm not printing out editor IP addresses as part of any analyses. They are generally of the form of aggregate analyses of, for example, top external referrer domains to Wikipedia which is technically private data but also would almost certainly pass a privacy review if the raw data needs to be released.
  • As a workaround, when I want to share the code, I either have to share the actual location of the notebook on the stat machines or make a copy that I purge of outputs and upload to Github etc.
  • My understanding is that we are hosting the Gitlab instance so hopefully that makes it pretty secure though I'm not sure if the only folks who could ever see a private repo are guaranteed to be NDAed?
  • Perhaps if this is implemented, there is some way to ensure that folks don't accidentally switch a private repo to public (without destroying any history that might still contain sensitive info).
brennen triaged this task as Medium priority.Mar 30 2022, 10:11 PM
brennen set the point value for this task to 5.
brennen subscribed.

Thanks for writing this up.

Just as a general disclaimer for future readers of this comment, no PII, production or production-adjacent credentials, or secrets along the lines of keys or passwords should be stored in repos on our GitLab instance.

Anyway, there's understandable interest here, so this is a discussion we should have. At least we should document the reasons for current policy. We talked about this during the consultation about moving to GitLab. It's mostly:

  1. We don't offer secret projects on Gerrit.
  2. Not offering them reduces risk and saves us answering a lot of questions.
    • Security risk if someone stores secrets
    • Need to police private projects for non-Wikimedia usage, illegal content, or other forms of abuse
    • Potential harm of encouraging closed-source contributions from organizations or other groups

We currently have an exception for some Security workflows, and if I remember right have said that we'd evaluate exceptions on a case-by-case basis.

Some background:

Thank you @Isaac for sharing your perspective and workflows. Thank you @brennen for sharing background info, current policy, and being open to discussion.

Is it possible to enable this for project groups (e.g. repos/product-analytics) where the only members of those groups are Foundation staff or NDA'd? This would rely on teams to self-police, and we already do that.

In case of Research, I'm not sure if they ever collaborate with people who don't get NDA'd (perhaps @Isaac can confirm) but if that's the case, I can imagine a repos/research-nda project group where the only members are Foundation staff and NDA'd collaborators.

Perhaps if this is implemented, there is some way to ensure that folks don't accidentally switch a private repo to public

This is key, I think. Unlike GitHub where you can make a private repo public, I think the safest thing to do here would be to offer the option at repo creation but then lock it in permanently.

I'll say one more thing I forgot to mention initially which is that years ago the Fundraising team had to subscribe to a paid GitHub plan so that their organization could have private repos (mainly Jupyter notebooks that analyzed A/B tests of fundraising campaigns/banners). It would be nice to be able to use internal services and storage for that instead.

I would like to raise the priority of this request, so that our team can share notebooks with potentially sensitive content. We need to be able to document code and reference historical analysis. Using a private repo would reduce the likelihood of potential leaks of sensitive data when backing up code.

@Jcross, let me know if you or someone from your team needs more information. Thank you!

@EChetty (tagged but currently on leave) has been in touch with @thcipriani about private repos for T316049: Unify all Product Analytics ETL jobs and Tyler said that's definitely possible.

Ideally we would get private repos enabled just for our namespace (https://gitlab.wikimedia.org/repos/product-analytics) so anyone on the team could effortlessly create a private repo within repos/product-analytics at any time. If for some reason it is not possible, it'd be great to at least be able to ask the admins to create private repos in that namespace on our behalf. It'd be no different than the current process for creating repos on Gerrit where you request the repo and then eventually someone processes the request. That would be much less ideal but it would be substantially better than not having private repos at all.

I also want it noted that we have a need for this and having private repos in our on-prem GitLab instance is a much better alternative to relying on GitHub for this functionality.

@thcipriani: Emil mentioned checking in with you on this in the context of storing executed notebooks in private repos on GitLab for T322533 and said you didn't see why it couldn't be done.

My team routinely needs to share analyses and reports containing sensitive data (mainly geographic) with each other for code reviews. Can we please be allowed to create private repositories in repos/product-analytics? Should I file a separate Phab task for that since this one is about enabling it in general?

There are a lot of good use cases for private repositories:

  • Embargoed information, soon to be pubic
  • Vandalism-fighting counter measures that's details may change vandals' tactics
  • Sharing hiring tasks with candidates

And probably many more I can't think of.

But we disallow private repositories currently (on Gerrit and on GitLab T284962) mostly because there are so many ways for them to be made public.

How repos can become public

  • Configuration change – A simple change to a repository can change it from public to private. To mitigate this, there are access controls. But miscommunication or confusing UI can result in a formerly private repo becoming public.
  • Forks – Especially true with the GitHub/GitLab model, someone forking a repo and changing settings on their fork can expose information. This is what happened in the Uber data leak, the Equifax data leak, and to Starbucks.
  • Phishing – Since there is only one credential standing between bad actors and sensitive data forge credentials are especially vulnerable to phishing attacks, as happened with Slack and dropbox.

And there's also the security model of the forge to consider.

A fast-moving upstream can merge a bad change that exposes public respoitories.

This happened to GitHub.

And GitLab has had at least 3 incidents matching the search "GitLab+Private repos" in the Common Vulnerabilities and Exposures (CVE database) (1, 2, 3).

This is especially troubling for us considering GitLab offers no advanced notice of security releases. Meaning, each of the above CVEs would have been unveiled while we scramble to update Wikimedia's GitLab.

Mitigations

Given there little we can do to defend against some of the (admittedly, unlikely, but not unprecedented) ways for code to become public, the best we can do is to avoid putting sensitive information into private repositories, especially the kind that would be devastating if it were to leak.

Things like:

  • Sensitive passwords/production credentials
  • Personally identifying information (PII)
  • Any sensitive code or information we expect to be permanently private

What's next

Like I said—there are a lot of good use-cases for private repositories. So, I'd like to have a process that ensures we're allowing folks to use private repositories if they need them, but continue to disallow private repos for the purposes of storing sensitive information.

My goals would be:

  1. No free-for-all private repos
  2. Allow users to request private repos through some process (a phab form? Dunno if there's any analogous process that exists that we could copy?)
    1. Ideally the form would require users to acknowledge that any data in a repo is only private-ish
  3. Ensure there's a list of known private repositories that can be reviewed by administrators on some cadence

I chatted with @sbassett, he suggested I loop in Privacy Engineering for some discussion/assistance (👋 @JFishback_WMF ).


Thoughts about any of the above would be welcome ❤

There are a lot of good use cases for private repositories:

  • Embargoed information, soon to be pubic
  • Vandalism-fighting counter measures that's details may change vandals' tactics
  • Sharing hiring tasks with candidates

And probably many more I can't think of.

But we disallow private repositories currently (on Gerrit and on GitLab T284962) mostly because there are so many ways for them to be made public.

This analysis skips over the potential negative issues caused by private repositories, which IMO is the main reason that Gerrit never had private repositories. The most obvious is that Wikimedia development defaults to a public and transparent stance, and having private repositories directly works against that.

My goals would be:

  1. No free-for-all private repos

+1

  1. Allow users to request private repos through some process (a phab form? Dunno if there's any analogous process that exists that we could copy?)
    1. Ideally the form would require users to acknowledge that any data in a repo is only private-ish

Rather than inventing a new process I would propose that we roughly mirror how Phabricator handles private tasks. That is, we have a "security" project where things that are embargoed are temporarily stored, and access mirrors our existing acl*security. Then we'd have a "WMF-NDA" project, analogous to WMF-NDA - presumably the sensitive geographic data work from T305082#8683787 can be stored here. I think these groups strike a good balance between private but also allow a trusted and diverse-ish set of people access.

If someone truly needs a separate ACL, they should have to clear some high bar to justify it, like Space requests.

  1. Ensure there's a list of known private repositories that can be reviewed by administrators on some cadence

I would suggest the list be available to a larger set of people, e.g. the security or WMF-NDA groups. Semi-relatedly, this is probably a good time to start discussing how we add some GitLab admins who are volunteers.

Maybe I'm a bit naive - but what precisely is the space of data analysis that both does not contain PII, while also not able to be public? Surely a data analysis that does not contain PII can be public? And if it does contain PII then it wouldn't be allowed under the proposed policy, anyways. Seems a bit catch-22 for actual uses here, but maybe I just lack imagination.

More to the point, we're presumably not allowing PII because we think the risk of a data leak is too high. But if this is for non-PII but still sensitive data, isn't the risk the same?

The most obvious is that Wikimedia development defaults to a public and transparent stance, and having private repositories directly works against that.

@Legoktm This view skips over any work that isn't engineers developing open source software, which is not the primary use case I described when I wrote this request. (But in terms of developing software there are actually valid reasons to keep some of it private, and Tyler enumerated them.)

Maybe I'm a bit naive - but what precisely is the space of data analysis that both does not contain PII, while also not able to be public? Surely a data analysis that does not contain PII can be public? And if it does contain PII then it wouldn't be allowed under the proposed policy, anyways.

If I have a Jupyter notebook that includes analysis of, say, editing or traffic data and the outputs include countries on the country protection list and I want somebody on my team to review it for correctness before sharing it with stakeholders (e.g. WMF's Legal team), my best path right now is to download that notebook from the stat box I was working on and then either email it or DM it to them via Slack, which they will need to download to view. If I find a mistake and want to correct it, rather than pushing the corrected version to GitLab and saying "hey I just updated it, please refresh if you've already opened it" I have to download it from the stat box, email/DM it, then they have to download it. The workflow is constrained to digital fax.

Thank you for your detailed and well-thought-out response, @thcipriani!

My goals would be:

  1. No free-for-all private repos
  2. Allow users to request private repos through some process (a phab form? Dunno if there's any analogous process that exists that we could copy?)
    1. Ideally the form would require users to acknowledge that any data in a repo is only private-ish
  3. Ensure there's a list of known private repositories that can be reviewed by administrators on some cadence

That sounds very reasonable!

I would suggest the list be available to a larger set of people, e.g. the security or WMF-NDA groups. Semi-relatedly, this is probably a good time to start discussing how we add some GitLab admins who are volunteers.

+1 to both points. (Side question: Are there GitLab admins who are volunteers but haven't signed NDAs?)

(Side question: Are there GitLab admins who are volunteers but haven't signed NDAs?)

There are not. At this time, GitLab admins are all current WMF staff. I agree that it'd be good to onboard some volunteers, but that would be conditioned on NDA in the same way that other similar access is.

My goals would be:

  1. No free-for-all private repos
  2. Allow users to request private repos through some process (a phab form? Dunno if there's any analogous process that exists that we could copy?)
    1. Ideally the form would require users to acknowledge that any data in a repo is only private-ish
  3. Ensure there's a list of known private repositories that can be reviewed by administrators on some cadence

I chatted with @sbassett, he suggested I loop in Privacy Engineering for some discussion/assistance (👋 @JFishback_WMF ).

I'm personally fine with most of this proposal, from a security perspective. In line with point #3 above, I think having some automation in place (should be easier with Gitlab's nice CI) to audit private repos and perform various secret checks, etc. on them, maybe a bit more aggressively than public repos, would be a good idea. I'd also recommend setting some kind of default TTL for private repos, like maybe 7 days or something. This would be most applicable for various security patch workflows. And obviously folks should be allowed to overwrite such a configuration provided they have compelling arguments to do so.

The most obvious is that Wikimedia development defaults to a public and transparent stance, and having private repositories directly works against that.

@Legoktm This view skips over any work that isn't engineers developing open source software, which is not the primary use case I described when I wrote this request. (But in terms of developing software there are actually valid reasons to keep some of it private, and Tyler enumerated them.)

No it doesn't. To quote from the WMF Guiding Principles: "In general, where possible, we aim to do much of our work in public, rather than in private, typically on public wikis." Note that there's nothing about "our work" being limited to developing software vs research/analytics. That is roughly what I said, that we default to being public and transparent. I didn't say that we shouldn't have private repositories at all, just that we need to be cognizant about the downsides of doing so. Hence my proposal to reuse the Phabricator private task/space framework rather than trying to come up with a new scheme that figures out how to balance the different tradeoffs and values.

Certainly I would much rather have us self-host private repositories than use a proprietary host like GitHub, but since we have the opportunity, it's a good time to ensure we're implementing best practices when setting up a new service.

There are not. At this time, GitLab admins are all current WMF staff. I agree that it'd be good to onboard some volunteers, but that would be conditioned on NDA in the same way that other similar access is.

I've split this thread to T333386: Onboard non-WMF staff as GitLab admins.

my best path right now is to download that notebook from the stat box I was working on and then either email it or DM it to them via Slack, which they will need to download to view.

Excuse me, but isn't copying private data off of stat machines and distributing it via Slack a big NO in the first place? I assumed it was and if this is common practice it seems like an issue. Shouldn't the data sharing happen on the stat machines themselves?

what precisely is the space of data analysis that both does not contain PII, while also not able to be public? Surely a data analysis that does not contain PII can be public? And if it does contain PII then it wouldn't be allowed under the proposed policy, anyways.

I agree with that, I also still can't imagine what is _at the same time_ sensitive data that can't be published but also does not contain private user data. It seems likely to me that this is going to create a lot of grey area and is a slippery slope.

Ideally the form would require users to acknowledge that any data in a repo is only private-ish

Similar to above, I have a hard time understanding what "private-ish" would mean.

@Legoktm This view skips over any work that isn't engineers developing open source software, which is not the primary use case I described when I wrote this request. (But in terms of developing software there are actually valid reasons to keep some of it private, and Tyler enumerated them.)

No it doesn't. To quote from the WMF Guiding Principles: "In general, where possible, we aim to do much of our work in public, rather than in private, typically on public wikis." Note that there's nothing about "our work" being limited to developing software vs research/analytics. That is roughly what I said, that we default to being public and transparent. I didn't say that we shouldn't have private repositories at all, just that we need to be cognizant about the downsides of doing so. Hence my proposal to reuse the Phabricator private task/space framework rather than trying to come up with a new scheme that figures out how to balance the different tradeoffs and values.

Certainly I would much rather have us self-host private repositories than use a proprietary host like GitHub, but since we have the opportunity, it's a good time to ensure we're implementing best practices when setting up a new service.

Ah, thank you for elaborating! When you put it that way we are 100% in agreement. Indeed, my team makes a vast majority of our analyses (code, queries, reports, but not data) openly available on GitHub & GitLab to be as transparent as possible in accordance with the WMF Guiding Principles :)

my best path right now is to download that notebook from the stat box I was working on and then either email it or DM it to them via Slack, which they will need to download to view.

Excuse me, but isn't copying private data off of stat machines and distributing it via Slack a big NO in the first place? I assumed it was and if this is common practice it seems like an issue. Shouldn't the data sharing happen on the stat machines themselves?

Not exactly. It depends on the specifics of the "private data" (a very broad umbrella term). The biggest concern is personal information (see https://foundation.wikimedia.org/wiki/Policy:Privacy_policy#Definitions) such as users' IP addresses, email addresses, real names, etc. and generally we do not transfer such data from the analytics cluster to our laptops or anywhere, really (including the Foundation's Google Drive). The only exception I can think of is responding to subpoenas where we are legally required to do so by law. In general when we do copy private data (over SSH to our encrypted, Foundation-provided laptops) it's data like A/B tests we ran (and which usually don't contain PII) and that we need to analyze locally due to technical limitations of the analytics cluster. This used to be more common because we needed to use R for statistical analysis of experiments and R support on the stat boxes was atrocious to the point of unusable. This is increasingly rarer as support for R has improved substantially in the past year, so we are able to contain most analyses within the stat machines. When it comes to the copied data on our laptops, we delete it upon publishing the report/results of the analysis to adhere to whatever the data retention period is (90 days in most cases, but occasionally longer after obtaining prior approval/exemption from Legal).

The other thing is, "private data" can also include data that is, simply, not publicly availabletechnically private. For example, if you ask me how many accounts were registered on a specific wiki as a result of a campaign (assuming the wiki had Campaigns extension enabled), I would need to access event.serversideaccountcreation table – a private dataset – to calculate this private statistic. And in most (but not necessarily all!) cases you & I would then actually be able to share that count publicly with organizers of that campaign and that wiki's community, but in some cases we might not be able to without also having Legal & Privacy review it first. (By the way, having private repos would facilitate that review process.)

Furthermore, when notebooks are shared/uploaded (in any way, internally or externally), the only data shared is what's printed in the output cells. The raw data queried from the data lake stays in memory in the Jupyter kernel, separate from the notebook itself. So if I retrieve that ServerSideAccountCreation data and then aggregate it and print the total, that's still technically private data, but it doesn't contain any personal information and in many (but not necessarily all) cases does not pose a privacy or security risk. And as a standard/best practice, we scrub outputs that have sensitive information whenever we publish the notebook publicly.

Finally, between all laptops required to be fully encrypted and us having an enterprise key management encryption system in place (meaning communications including confidential, sensitive, or even attorney-client privileged information can take place on channels within workspace), it is actually OK to share some (but not all!) private data (within reason!!) while adhering to https://meta.wikimedia.org/wiki/Data_retention_guidelines and exercising (a lot of!) caution and discretion.

what precisely is the space of data analysis that both does not contain PII, while also not able to be public? Surely a data analysis that does not contain PII can be public? And if it does contain PII then it wouldn't be allowed under the proposed policy, anyways

I agree with that, I also still can't imagine what is _at the same time_ sensitive data that can't be published but also does not contain private user data. It seems likely to me that this is going to create a lot of grey area and is a slippery slope.

Here's a more concrete example which might help: let's say somebody from Legal needed to know about Turkish editors and email addresses for some regulatory compliance reason. I obviously wouldn't DM them a CSV of email addresses, but I could share with them a count of how many registered users edited from Turkey in the last month and had a verified email address on file. I could even share with them the notebook that yielded that count so they could review the query (or have someone else review it) AND have it for future reference – provided I didn't just print the raw data out at any point, or if I did, provided I cleared the output before sharing. The data – in this case a simple count statistic – does not contain PII and, yet, it cannot be made public without a security/privacy risk assessment.

Hi all! I've read this thread and I want to weigh in on this with a perspective from the Privacy Engineering team. I think that there are two primary facts to consider here:

  1. The Product Analytics team (PA) has an organizational mandate from WMF to be doing this work, and they have been doing this work, despite the organizational constraint that they cannot share the outputs of their analyses in the same place as the code that produces those outputs. This is unlikely to change any time soon.
  2. The main issue here is data that is sensitive (i.e. it could potentially be used in harmful ways by a malicious actor) but not confidential (i.e. not certain to be used in harmful ways / not defined as PII in the WMF Privacy Policy) — what @mpopov alludes to with the Turkish editors example above.

With these two facts in mind, I think that a set of self-hosted private repos on Gitlab is the best way to enable PA's work and ameliorate the existing privacy harm of sending sensitive data in Slack or via email. Of course, there need to be mitigations and regulations on the private repos, in line with @sbassett and @thcipriani's suggestions above.

Finally, in addition to these administrative processes, I'm starting to think about a practical rubric of good data handling procedures for the Product Analytics team, specifically covering how we collect/analyze/share data that fall into the following categories:

  • Data that could certainly be used to cause harm
  • Data that could likely be used to cause harm
  • Data that could possibly be used to cause harm
  • Data that is unlikely be used to cause harm and is private for administrative reasons

Hope this is helpful!

Tyler is going to draft a policy & Phabricator form, so I'm assigning this to him. Security and Legal will review the draft and provide feedback.