Page MenuHomePhabricator

Security Concept Review For new CI
Closed, ResolvedPublic

Description

Project Information

  • Name of project: new CI
  • Project home page: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/CI_Futures_WG
  • Name of team which owns the project: Release Engineering
  • Primary contact for the project: @LarsWirzenius
  • Target date for deployment: 2020
  • Link to code repository: n/a
  • Is this a brand-new project: yes
  • Has this project ever been reviewed before: no
  • Has any risk assessment (STRIDE, etc.) been performed: in progress, see https://phabricator.wikimedia.org/T240679
  • Is there an existing RFC or has this been presented to the community: kind of, blog posts in Phame
  • Is this project tied to a team quarterly goal: yes
  • Does this project require its own privacy policy: no?

Description of the project and how it will be used

We need to replace the existing CI system at the foundation.
https://www.mediawiki.org/wiki/User:LarsWirzenius/NewCI has our current thinking of what the new system
will look like, except it doesn't include the fact that we'd like to use Argo on Kuberenetes.

Description of any sensitive data to be collected or exposed

None, hopefully. But CI will build artifacts for deployments, which means it's an avenue for attack.

Technologies employed

  • Kuberenetes
  • Argo
  • Go
  • Gerrit

Dependencies and vendor code

  • some K8s cluster, possibly hosted by a commercial provider
  • Gerrit

Working test environment

We don't have this yet, but we can set something up if need be.



Scoping Question 1: Do you have a final candidate list of new technologies that will be introduced within the new CI/CD and what those technologies will replace within the existing system? It's unclear from the various pieces of documentation where Releng is at in their selection process and we'd like to have this narrowed down to as small a list as possible prior to any review.

We're currently aiming at using Argo (https://argoproj.github.io/) running on Kubernetes, until and unless that turns out to be inadequate or unsuitable. The other two candidates we were considering at the end were Zuul v3 and GitLab CI, but those are not being actively considered at the moment.


Scoping Question 2: Can you clarify the specifics of the testing and staging environments from the image promotion pipeline? Where will these environments exist and who will be the ostensible maintainers of said environments?

I'm afraid testing and staging environments are unclear for now: we don't yet know where and how and by whom, or even if, they will be implemented.


Scoping Question 3: This comment within the task description - some K8s cluster, possibly hosted by a commercial provider - seems to imply the potential for SaaS/PaaS options. Is this still being considered? Can we get a sense of what systems and services would be candidates for such an option?

A commercial K8s provider is definitely being considered. We've mostly been talking about GKE, but haven't formally considered the options. The idea that we wouldn't use WMF K8s came up late in the process, at TechConf.

Event Timeline

Hey @LarsWirzenius - is there a more specific deployment date (or even quarter) for this? Just trying to get a sense of where the Security-Team can prioritize this amongst our reviews backlog. Thanks.

chasemp triaged this task as Medium priority.Jan 7 2020, 6:53 PM
chasemp moved this task from Incoming to Back Orders on the Security-Team board.

The deployment date is currently open, but sooner rather than later. RelEng is currently blocked on this by some discussions on K8s hosting with SRE, and doing a threat model with them.

sbassett removed a project: Security-Team.

First review initially scheduled for 2019-01-10.

Hey @LarsWirzenius

We had a few initial comments and questions:

  1. General Comment 1: Security Preview s should never be considered a hard blocker of anything. Apologies if that hasn't been made clear within our documentation. If you/Releng feel you have some specific questions or concerns which are of critical importance and that only the Security-Team can address, feel free to let us know what those are.
  2. Scoping Question 1: Do you have a final candidate list of new technologies that will be introduced within the new CI/CD and what those technologies will replace within the existing system? It's unclear from the various pieces of documentation where Releng is at in their selection process and we'd like to have this narrowed down to as small a list as possible prior to any review.
  3. Scoping Question 2: Can you clarify the specifics of the testing and staging environments from the image promotion pipeline? Where will these environments exist and who will be the ostensible maintainers of said environments?
  4. Scoping Question 3: This comment within the task description - some K8s cluster, possibly hosted by a commercial provider - seems to imply the potential for SaaS/PaaS options. Is this still being considered? Can we get a sense of what systems and services would be candidates for such an option?

1: We're not blocked on the review. We're, separately, working with SRE on a threat model for CI. https://www.mediawiki.org/wiki/User:LarsWirzenius/NewCI/threats is the current status on that.

  1. We're currently aiming at using Argo (https://argoproj.github.io/) running on Kubernetes, until and unless that turns out to be inadequate or unsuitable. The other two candidates we were considering at the end were Zuul v3 and GitLab CI, but those are not being actively considered at the moment.
  1. I'm afraid testing and staging environments are unclear for now: we don't yet know where and how and by whom, or even if, they will be implemented.
  1. A commercial K8s provider is definitely being considered. We've mostly been talking about GKE, but haven't formally considered the options. The idea that we wouldn't use WMF K8s came up late in the process, at TechConf.

We've written https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Seakeeper_proposal as a possible description of what the K8s setup would be, but that's not final either. @dduvall
has had the lead on that.

My apologies for everything being a question mark.

We (RelEng and SRE) had a second threat modelling meeting today, and the threats page is updated.

The scope of a successful conceptual assessment with associated early risk reasoning is dependent on context. In this case, there are enough moving parts and combined efforts to require some upfront level setting. Some of this is addendum and/or reframing of 2019 TechConf notes.

We know T217325 was filed in relation to the CI Futures WG (CIFWG) formed in Feb of 2019. The CIFWG published a report based on surveys, discussions, interactions, and history. This report had several items described as very hard requirements which were part of the general working requirements collated.

Very Hard Requirements:

  • Must be hostable by the Foundation
  • Must be free software / open source
  • Must support git
  • Must have a version we can easily use for evaluation
  • Must be comprehensible without too much effort
  • Must support self-serve CI

It's unclear to us from the document which stakeholders proposed which hard requirements to move forward, or whether these are security, operational, or administrative in origin.

A candidates page was compiled and seems to be reflected in a sheet and T217325. The sheet is somewhat confusing as the last standing solutions here are: argo, gitlab, gocd, and spinnaker. These were reduced down to: argo, zuulv3, and gitlab. PoC's were peformed for argo, gitlab, and Zuul, after which a vote with these options commenced.


Fast forwarding to the most recent state of things

In the most recent meeting of the CIFWG on 2019-12-11 the notes indicate essential roadblocks in ownership and implementation remain.

  • Staffing, domain knowledge and support concerns raised therein are an issue in conjunction with the as-written sea keeper proposal.
  • The posed questions of 3rd party hosting is in opposition to the first hard requirement for the project. This is at-this-time further reinforced in the Scope of Work section of the architecture document.

These are non-negotiable requirements that must all be fulfilled by our future CI system.

SelfHostable Must be hostable by the Foundation. It’s not acceptable for WMF to rely on an outside service for this, to achieve the security, reliability, and privacy required of Wikimedia.

FreeSoftware Must be free software / open source. “Open core” like GitLab can be good enough, as long as we only need the parts that provide software freedom. This is partly due to the SelfHostable requirement, but also because free software is a form of free knowledge, and it’s a WMF value is to prefer open source.


In Dec of 2019, T240679 was created to coordinate threat modeling and a page was created to document. The in-progress threat model has a diagram that indicates portions of the system (build nodes in particular) are to be considered insecure (and so untrusted). This is not compatible with the requirements in the architecture document:

Promotion Must promote (copy) Docker images and other build artifacts from “testing” to “staging” to “production”, rather than rebuilding them, since rebuilding takes time and can fail or produce a different result. Once a binary, Docker image, or other build artifact has been built, exactly that artifact should be tested, and eventually deployed to production.

When combined with our understanding of Artifact store for temporary blobs

Stores build artifacts from build nodes: binaries, Docker images, translation files, etc. Deployments to test enviroments happen from here. Build logs will be stored here.

This seems to indicate artifacts (including images) are going to be built in an untrusted environment and promoted through to production.

This seems confirmed in the architecture document:

Artifact storage must be secure, as everything that gets deployed to production goes via it.

The threat modeling documentation also seems to require a defined Test environment

Runs the code provided by the developer, in an environment more or less like production, so the developer can test their changes, for when they need more than their personal machines to do that.

But the answered scoping questions above indicate neither a testing nor staging environment are currently planned for:

I'm afraid testing and staging environments are unclear for now: we don't yet know where and how and by whom, or even if, they will be implemented.

The architecture document also requires testing environments, or some environment(s) with similitude to production for an effective pipeline:

The new CI system will need to provide various environments, which are sufficiently production-like for testing or running the software. These environments will include all the components that are needed for simulating production, for the tests in question. The specific components depend on the test, and a mechanism for specifying them will be built detail during the CI implementation phase. Ideally, the main difference between a test environment and production is capacity.

This is further reinforced in the WMF Development Ecosystem and CI internal components section of the architectural document. It's not possible to get a solid overview of the intended trust zones and developer workflow with this ambiguity.



Summation

Required information

  • Are the requirements defined in the earliest phases of the project still remaining? (If not, how were they decided and what has changed?)
  • What is the developer workflow for this system from conception to production? Will there be testing and staging environments and what is the graduation criteria between them?
  • Is testing->production artifact promotion a necessity here and could it be achieved where artifacts are created in a same or more trusted portion of the pipeline before promotion?
  • Is there a relationship to the existing deployment-prep Cloud environment? Its unclear to us from the docs what the integration and path forward is though there is clear overlap and relation.

threat modeling

  • Consider an RBAC view of end user relationships to the system. Staff, volunteer and trusted volunteer are not defined roles. This would also match and be reflected in the Credentials management and access_control section of the architectural document.
  • What are the threats to the promotion pipeline proposed in regards to build nodes and the security of the system? Any binary artifact must be built in a same or more trusted environment to be promoted.
  • What are the agreed upon components of the system (testing, staging, production)?

3rd party relationships

  • In the 2019-12-11 meeting the possibility of GKE is posed. The WMF has relationships with Rackspace (very small) and AWS (small to modest depending) already. We have three use cases with AWS and possibly more. If we are going to add Google Cloud to this list that conversation should get started sooner than later as we will be splitting what small expertise we have. This is not the first time outside compute and/or resourcing has been a topic. A task for discovery was created previously for cloud services, and (as mentioned) AWS is in use for a few things.
  • The barriers to a 3rd party SaaS, IaaS, PaaS solutions are primarily ideological (potentially), legal, security, and privacy. The Security Team has expertise in-house for evaluating 2 of the 4 and a good working relationship with legal. To this point our understanding of any manifestation of this proposed system has excluded this work due to the first hard requirement established early on in the lifecycle here.

Outcome for Security Team

As of the 2019-12-11 meeting with TODO: eval what externally hosted solutions are available? this project has many variables and conceptual possibilities, some of which are mutually exclusive. The threat model similarly is too work-in-progress (This is a first preliminary draft. It's meant to provide a basis for discussion. It does not represent any final decisions.) and inconsistent with input on this task for us to contribute more without clearer parameters.

Additionally, the language that seems most directly to influence the self-hosting requirement:

SelfHostable Must be hostable by the Foundation. It’s not acceptable for WMF to rely on an outside service for this, to achieve the security, reliability, and privacy required of Wikimedia.

is not an outcome of any explicit risk, privacy, legal, or operational assessment we are aware of for this use case. It's worth noting that there is PII involved here in the form of IPs, user-agents, and other details that can identify users and as such the implementation details and privacy of end users must be considered.


TLDR; We believe deciding on final components (incl. environments), ownership and stewardship for components and data locality is required to move forward from here.

The Security-Team would like to express gratitude to release engineering for taking on this monumental task, none of the output here is meant to make light of the work to this point or moving forward. Thank you. We have our own needs and considerations when it comes to a modern pipeline and want to assist in moving this process forward.

(@chasemp @sbassett @Reedy)

References

https://phabricator.wikimedia.org/T238261#5666625
https://www.youtube.com/watch?reload=9&v=M_rxPPLG8pU&start=859

chasemp changed the task status from Open to Stalled.Jan 21 2020, 8:54 PM
chasemp moved this task from In Progress to Waiting on the Security-Team board.

Some clarifications on some details:

This seems to indicate artifacts (including images) are going to be built in an untrusted environment and promoted through to production.

This seems confirmed in the architecture document:

Artifact storage must be secure, as everything that gets deployed to production goes via it.

Yes, that is something that is wrong in the currently uploaded diagram, and should be fixed. I was supposed to add comments on the wiki page but never found the time to do so. Apologies for not getting to it earlier, it would've made your life easier.

The idea is that we have two channels for ci: an "untrusted" one for generic patchsets and a "trusted" one for building whatever is needed after a patch has been merged.

See an updated diagram here https://people.wikimedia.org/~oblivian/ci/ci-model.pdf

But the answered scoping questions above indicate neither a testing nor staging environment are currently planned for:

I assume that refers to our user-area testing environment, AKA "beta".

The kubernetes "staging" cluster should only be used by deployers and is at the moment part of the production environment. It's supposed to be the place where we catch bad releases before we release to production.

[CUT]

  • Consider an RBAC view of end user relationships to the system. Staff, volunteer and trusted volunteer are not defined roles. This would also match and be reflected in the Credentials management and access_control section of the architectural document.

I think a security point of view of this needs to have two roles, more or less:

  • Anyone who can submit a patch to a project (so: anyone with an account on gerrit)
  • Anyone who can merge/deploy (those *need* to coincide) a patch on that project

I don't think staff/volunteer makes sense in this context.

  • What are the threats to the promotion pipeline proposed in regards to build nodes and the security of the system? Any binary artifact must be built in a same or more trusted environment to be promoted.

100% agreed, as stated above. The build of the production artifacts should happen in the trusted zone of the pipeline.

  • In the 2019-12-11 meeting the possibility of GKE is posed. The WMF has relationships with Rackspace (very small) and AWS (small to modest depending) already. We have three use cases with AWS and possibly more. If we are going to add Google Cloud to this list that conversation should get started sooner than later as we will be splitting what small expertise we have. This is not the first time outside compute and/or resourcing has been a topic. A task for discovery was created previously for cloud services, and (as mentioned) AWS is in use for a few things.

GKE is just an example, being AIUI the most complete managed kubernetes environment in one of the major cloud providers. If we want to consolidate to just one cloud provider (which in many ways is a very wrong thing to do, though), it's also acceptable.

I hope this clarifies things a bit.

Firstly, I just want to say thanks to @chasemp, @sbassett, @Reedy for taking the time to review the deep trove of meeting notes, supporting documents, and proposals around this year-long process of planning for the future of CI at WMF. As is often the case when making changes to such a widely utilized part of our developer ecosystem, there are some false starts, missing details, scope creep, revisions (revisions, revisions), and adaptions along the way. Making sense of such a large project at this point in the process is no small feat. So thank you for doing that work and for providing such a thorough and actionable response.

I will respond to some of your feedback inline, but I also want to give a broader view of where our heads are at in Release-Engineering-Team regarding next steps for the project.

TL;DR: We're taking a small step back and:

  1. Narrowing the scope to general purpose CI.
  2. Revising supporting documents to incorporate feedback and reflect the newly narrowed scope.
  3. Centering the proposal document around concrete scenarios.

The longer version:

Narrowing scope

We recognize that some of our documentation and process has conflated the requirements and policy of a general purpose CI system with that of the Deployment Pipeline project or another form of continuous delivery/deployment that we are working towards in the long term. While these systems are highly interrelated, they are also distinct and therefore can (and should) be reasoned about separately, for the sake of clarity in forming security policy, modeling threat, and proposing implementation.

The Deployment Pipeline is another important and ambitious project that will no doubt benefit from the success of this one; It both hinges on the success of a well planned and implemented CI platform, and deserves its own properly scoped process of planning, review, and implementation.

At its outset, this project has been driven by a very real need to replace the aging CI system we run now which handles for the most part general purpose workloads, is critical in supporting the daily work of WMF staff and volunteers, and is composed of unmaintained (some fully deprecated) components. Narrowing scope to accomplish a timely replacement seems self-evidently justifiable.

Revising supporting documents

Both the architecture document and Seakeeper proposal will be revised to reflect our narrowed scope and to incorporate the feedback presented here that remains relevant. In addition, we will work to remove requirements beyond our authority to fulfill such as the self-hosted requirement. (To be clear, we will still work to influence decisions around hosting in accordance with our free-software and privacy values, but we recognized our inability to enforce such requirements and will not jeopardize the success of this project on that front.)

Centering proposal document around concrete scenarios

Taking the feedback to clarify "final components, [...] ownership and stewardship" to heart, we'll be collaborating with CI stakeholders to collect real-world user scenarios which will better inform the design of security mechanisms and modeling of threat in the Seakeeper proposal. By moving into the concrete as much as is possible at this stage, and with a newly narrowed scope, we're hoping we'll achieve the degree of clarity and detail that is necessary to perform a thorough security review.

In the most recent meeting of the CIFWG on 2019-12-11 the notes indicate essential roadblocks in ownership and implementation remain.

  • Staffing, domain knowledge and support concerns raised therein are an issue in conjunction with the as-written sea keeper proposal.

What are the roadblocks you're seeing around staffing and ownership? The meeting notes do capture some comments made around the staffing requirements for a CI system in general but RelEng is already tasked and staffed for fulfilling such requirements—as we do on a day-to-day basis now by maintaining our current CI. The Seakeeper proposal seems inline with our current commitments and staffing in that it puts administration of Argo components squarely with RelEng.

Summation

Required information

  • Are the requirements defined in the earliest phases of the project still remaining? (If not, how were they decided and what has changed?)

Hopefully a revised version of the requirements following descoping will make this clearer.

  • What is the developer workflow for this system from conception to production? Will there be testing and staging environments and what is the graduation criteria between them?

There are too many to enumerate fully—as what we are proposing is a replacement general purpose CI platform-world scenarios into our proposal for the sake of designing security mechanisms and modeling threat will tease out many of the workflows.

  • Is testing->production artifact promotion a necessity here and could it be achieved where artifacts are created in a same or more trusted portion of the pipeline before promotion?
  • Is there a relationship to the existing deployment-prep Cloud environment? Its unclear to us from the docs what the integration and path forward is though there is clear overlap and relation.

These are some of the aspects that are beyond the scope of a CI platform and more in the realm of continuous deployment or the Deloyment Pipeline project. Hopefully separating the two will help us to reason better about both.

threat modeling

  • Consider an RBAC view of end user relationships to the system. Staff, volunteer and trusted volunteer are not defined roles. This would also match and be reflected in the Credentials management and access_control section of the architectural document.

A big point taken here. We'll work to clarify these roles in our modeling.

  • What are the threats to the promotion pipeline proposed in regards to build nodes and the security of the system? Any binary artifact must be built in a same or more trusted environment to be promoted.
  • What are the agreed upon components of the system (testing, staging, production)?

Again, I think these are CD concerns but we should let these concerns (if only in the abstract) inform general CI platform requirements. For example, a requirement to "schedule Argo Workflows based on properties of the origin event payload to specific k8s nodes and namespaces" speaks to these concerns without coupling the CI platform to any future CD implementation detail.

3rd party relationships

  • In the 2019-12-11 meeting the possibility of GKE is posed. The WMF has relationships with Rackspace (very small) and AWS (small to modest depending) already. We have three use cases with AWS and possibly more. If we are going to add Google Cloud to this list that conversation should get started sooner than later as we will be splitting what small expertise we have. This is not the first time outside compute and/or resourcing has been a topic. A task for discovery was created previously for cloud services, and (as mentioned) AWS is in use for a few things.
  • The barriers to a 3rd party SaaS, IaaS, PaaS solutions are primarily ideological (potentially), legal, security, and privacy. The Security Team has expertise in-house for evaluating 2 of the 4 and a good working relationship with legal. To this point our understanding of any manifestation of this proposed system has excluded this work due to the first hard requirement established early on in the lifecycle here.

IMHO, the conversation about third-party CI providers—and even PaaS providers to a great extent—is chiefly an administrative question that needs attention from executive management before we can entertain it as a factor in this project's proposal. The best we can do on this front is to make our CI platform design general enough to translate to many different k8s providers while individually pushing for what we think is inline with our foundational values.

TLDR; We believe deciding on final components (incl. environments), ownership and stewardship for components and data locality is required to move forward from here.

The Security-Team would like to express gratitude to release engineering for taking on this monumental task, none of the output here is meant to make light of the work to this point or moving forward. Thank you. We have our own needs and considerations when it comes to a modern pipeline and want to assist in moving this process forward.

(@chasemp @sbassett @Reedy)

And again, thanks to you all for the thoughtful, thorough, and actionable feedback!

chasemp moved this task from Waiting to Our Part Is Done on the Security-Team board.

Post-All-Hands f2f I think there is quite a bit more clarity here, and my understanding is releng is in discovery mode for K8sAAS offerings.

TL;DR: We're taking a small step back and:

Narrowing the scope to general purpose CI.
Revising supporting documents to incorporate feedback and reflect the newly narrowed scope.
Centering the proposal document around concrete scenarios.

Scott and I talked this morning and our feeling is post all this and with some specificity in regards to a 3rd party partner it will be a whole new deal and we prefer not to pollute this task. So for now, I'm going to resolve this.

Sounds good, @chasemp. Looking forward to giving this another go when we're ready.