Page MenuHomePhabricator

Develop evaluation criteria for comparing Platform as a Service (PaaS) solutions
Closed, DeclinedPublic

Description

Determine must/should/may evaluation criteria for comparing FLOSS PaaS solutions. These should cover both operational and end user concerns.

Add proposed criteria as comments on this task and then summarize here in the main description as consensus is found.

Announced on labs-l: https://lists.wikimedia.org/pipermail/labs-l/2016-May/004494.html

What's a PaaS and why do I care?

PaaS is an acronym for "Platform as a Service". In this case the platform being described is a software hosting platform and the "as a Service" part means that the platform will be hosted in the Tool Labs project of the Wikimedia Labs public cloud computing offering. The Labs team has already chosen Kubernetes as the next generation replacement for Open Grid Engine to handle the process of running tools and bots on a flexible grid of compute nodes. Using Kubernetes directly is possible, but it provides a really large number of choices for the end user to make.

A PaaS would add an opinionated command and control layer over the top of Kubernetes and hide most or all of the complexity of the underlying system from the average Tool Labs maintainer. Another way to think if this is that PaaS is to Kubernetes as jsub is to Open Grid Engine. The jsub program and it's siblings jstart and qcronsub form a platform that Tool Labs maintainers have been using since they moved over from toolserver. The conventions of where files must be placed on the bastions, where output logs are sent, and how to start and stop grid jobs form a type of PaaS over Open Grid Engine.

jsub has a couple of downsides. First and foremost it is a homegrown system which means that no one helps share the burden of maintaining the software and its documentation. Additionally it is a "leaky abstraction" in that it allows and in some cases requires the user to use Open Grid Engine terminology and command line arguments to accomplish certain tasks. Finally it's a system that still allows a nearly infinite amount of freedom in how things are done. This can be great for the occasional power user, but is challenging for the typical beginner because there are no best practices to follow and a large number of choices to make.

A PaaS doesn't need to be inflexible, but ideally it will provide an obvious best way to do most things that lowers the initial learning curve and complexity for a typical small Tool Labs project. For the programming language geeks in the audience, I'm hoping for something closer to Python than Perl. Some tools may end up being more difficult to fit into a given PaaS workflow than others due to the opinions enforced by the PaaS.

Related Objects

Event Timeline

I would also suggest spending some time playing with Heroku / Google App Engine to get a sense of the PaaS developer UX before doing this.

T128158: Tools web interface for tool authors (Brainstorming ticket) includes some discussion of pain points and accidental complexity in the current workflows that may be useful for review. Many of the comments I have left there are focused on the initial Labs & Tool Labs account creation process which will likely not be covered by the PaaS.

No criteria, but Fedora ships OpenShift Origin with its current (24) beta. I haven't tested it, but my understanding is that this would allow to "easily" set up a copy of https://www.openshift.com/. (If something similar exists for Heroku/GAE, pointers appreciated.)

I like that idea from both an administration and a user perspective as if any problem comes up, there will be more people who share the set-up and thus can help (and I assume Red Hat offers consulting services as well).

valhallasw moved this task from Backlog to Ready to be worked on on the Toolforge board.

Not sure what the explicit criteria is exactly, but the PaaS should have some tooling/integration/support for debug log collection. An ideal solution would also provide tooling for examining logs.

See also T127367: [toolforge,jobs-api,webservice,storage] Provide modern, non-NFS log solution for Toolforge webservices and bots

bd808 renamed this task from Develop evaluation criteria for comparing PaaS solutions to Develop evaluation criteria for comparing Platform as a Service (PaaS) solutions.Jul 25 2016, 11:20 PM

Several users of the current kubernetes solution have asked for the ability to use custom Docker images or at least to be able to install packages via apt-get that are specific to their tool.

Being able to use custom Docker images would be extremely helpful, because it removes the need to have to compile source or find binaries and/or ask for packages to be installed.

I'll note that from https://lists.wikimedia.org/pipermail/labs-l/2016-May/004494.html and related documents one can click links for several layers without ever finding a definition of what is being discussed here. The summary of this task is also unclear; one might as well interpret that you're looking for a third party PaaS where to host Labs.

Yes, we are currently in the process of deciding if we should move to GCP
or AWS (or even Heroku) for tools. Glad to clear that up.

  • FLOSS project with OSI/FSF/DFSG approved/compatible license
  • Public project roadmaps
  • Active volunteer community
  • Good (relative term I know) end user documentation upstream
  • Good maintainer/installer documentation upstream
  • Documentation also under an open license
  • Release packaging that is compatible with Wikimedia workflows (signed debs or easy to build local packages basically)
  • Responsible vulnerability reporting policy and documented security release workflows

Tenant oriented criteria:

  • Ability to customize container by installing additional applications/libraries (i.e. things that can be installed with apt-get)
  • Repeatable builds/tagging to allow rolling back to a known good state following a failed deployment
  • Easily scriptable workflow for deploying changes
  • Support for short lived/terminating jobs (e.g. monthly bot scripts)
  • Scheduled job submission (e.g. cron)
  • Any tooling that runs on a client (i.e. not on the Tool Labs servers) must be cross-platform for Windows, Linux, OS X

Nice to haves:

  • Log aggregation via CLI and/or web interface
  • Ability to define health checks for a service (e.g. /foo/bar/test should return a 200 OK response in <5s.)
  • Alerting for failed health checks

From a 'requirements' pov - that is, things that we need to build/consider in our current environment before we can use a PaaS. This list is non-exhaustive, and not all of these are hard requirements - but all of these require study and consideration.

  1. Support for 'custom' kubernetes setups. Ours is different from, say, GKE in the following ways:
    1. We have Service Accounts disabled. Service accounts are still insecure to use with authentication / authorization schemes we use (one user per namespace). We will allow enabling them when this restriction goes away.
    2. We do not have Cluster DNS, primarily because it relies on (1)
    3. We do not have any Ingress controller setup - instead we do our own custom fake thing. We should replace this with an actual ingress controller at some point soon.
    4. We do not have any useful persistent attachable storage volumes. We 'fake' NFS with hostPath and mounting NFS on the worker nodes, since our authorization scheme for NFS (from nfs-exportsd) relies on a stable set of IP addresses to provide service to. We do not have Ceph or anything else that could provide attached network storage.
    5. We have custom extra Admission controllers (See https://wikitech.wikimedia.org/wiki/Tools_Kubernetes#Custom_admission_controllers). They place other restrictions on what pods can run, primarily around what uid a container can run as, what docker registry it can pull images from and what hostPaths it can mount.
    6. We do not have a LoadBalancer (exposes an IP to the external world for a service) either. Some of these are fixable and under our control (Ingress, Cluster DNS), some require upstream to catch up (Service Accounts), some are non-negotiable (such as the UID and registry enforcement), and some are under our control but require a lot of work (such as persistent storage).
  2. Requirements for persistent storage. The PaaS machinery itself might require object storage (swift) or block storage (ceph?) to store blobs / config / whatever.
  3. Requirements for docker registry setup. Right now we only have a single, non-redundant docker registry. This is ok for just now, since we can rebuild all images quickly. If we can not (i.e. we allow arbitrary docker image building), we'll need to build and scale a docker registry backed by swift.
  4. Base operating system requirements. We'd like to continue using Debian (rather than a 'container native' host). This is primarily a function of 'do we really have the manpower to support a different base host?' but I'm not personally tied to it. However, I'd definitely not want to move to a debian-like OS (aka Fedora)
  5. Any other stateful storage requirements - such as a database.
  6. Do any of these components need extra 'real' hardware? Ideally we'll just run everything in VMs in the tools project.

Operational concerns - these are things that will make the life of the people who are 'taking care' of the PaaS bearable or hellish.

  1. How many PaaS components are there, and how will they be deployed? Will they be deployed with puppet/something else, or in-cluster? What's the upstream support like if we need to deviate from them for good reasons?
  2. How will upgrades be performed? If the PaaS is decoupled, can the individual components be upgraded one by one, with at least N-1 version compatibility? Or will they need to be upgraded all at the same time, resulting in possible downtime?
  3. What's the failure mode for PaaS component failure? Do user applications get affected, or only new builds / operations on the PaaS itself that have problems? In kubernetes, the master failing doesn't directly affect user applications, and this means a kubernetes master outage isn't immediately user visible. This is a great feature to have, makes life a lot less stressful post-paging.
  4. What's the monitoring situation for the PaaS components? Can they integrate with our current / preferred setup (which I am going to say is 'prometheus') well?
  5. How are the PaaS components configured? Can this configuration be easily maintained with version control?
  6. Do they need to version match with kubernetes components as well?

Community concerns - does the PaaS have a big enough community to sustain itself over time?

  1. Does the community have a Code of Conduct? Are they nice people?
  2. Are there active community support forums with participation by core developers? Both realtime-ish (IRC / Slack) and async (mailing lists / discourse)? How active are they?
  3. Is this a traditional 'open-core' product, where some features will be provided under a non-free license to paying customers only?
  4. Is there a community of people running this in non-commercial-cloud (GCP, AWS, Azure) environments? Are they able to openly talk about it? Are they doing it without an army of consultants?
  5. Is the development dominated by one company, or is it more diverse? Will it still be around in 5y? OpenGridEngine pretty much died because of this (Oracle...)
  6. How is the documentation? User facing documentation is very important (so we do not have to write it ourselves), but also distinct from admin facing documentation. Are these two separate and well maintained?
  7. What is the community's relationship with kubernetes upstream?

A lot good thoughts here. I don't think either of the two contenders I know of can satisfy it all :)

An addendum to the community aspect above me, what is their relationship to Kubernetes upstream?

@chasemp yup, added :) Yeah, I think we need to distill these into a set of criteria we can evaluate. My feel a year ago when we started using k8s was that none of the PaaS projects will fit us yet. Not sure how it'll be now.

Integration - we already have a bunch of stuff we use in our infrastructure, so whatever we pick should integrate with it.

  1. Git hosting - does it require its own, or can we plug into phabricator?
  2. Authentication / Authorization - how does this work? can we plug into LDAP as we have it? Can we force that containers launched by users of a specific UID run as that UID only (we already do this for kubernetes namespaces, so this only requires that we be able to assign each tool a 'namespace' and the PaaS respects that)
  3. Does this plug into the kubernetes authentication and authorization mechanisms we use? Currently it's ABAC+tokenauth, but should hopefully move at some point to RBAC+CA certs.
  4. Does this still allow direct access to kubectl and the kubernetes API? Can you mix them? If so how? Direct access is a pre-requisite I think - the abstraction is going to be leaky anyway, and it'll suck if you can't 'reach in'. It'll also make tools like PAWS impossible.
  5. Does this touch puppet at all? This depends on if this is deployed incluster or out of cluster. If out of cluster, are there easy ways to deploy this (debs?).

I'd like to repeat my statement from the introduction of Kubernetes: To me, it makes much more sense to embrace it instead of using it as something that it was not intended to be used for.

When I go to https://www.openshift.org/, I am practically forced to use Git for source code management. I need to be able to rebuild the application from scratch every time. Every deployment is logged and can be reverted. Secrets can be easily managed as environment variables. This ticks a number of boxes that are currently unticked for vital bots and web applications.

When there is a problem, I can ask other OpenShift users for help; they don't need to know about service groups (in the Cloud-Services sense) or anything else because OpenShift follows the paradigm that containers are already isolated, so everything can be root. Also users who want to do something truly complicated or awesome with containers can do so without having to address the Labs peculiarities unless need be.

When I would want to add access to replica servers, /public/dumps or some persistent storage, I could package that as cartridges, again allowing users to solve problems outside of these spheres with the support from everybody else and only having to resort to the Cloud-Services team when something Labs-specific is not working.

So, again, I'd like to suggest to set up a PaaS organizationally as part of Toolforge, but not technically. It should be a separate project, perhaps on separate hardware where only best practices for that specific PaaS are allowed. The only connection to the existing project should be that maintainers can redirect their tool's webspace from https://tools.wmflabs.org/$tool/ to https://$tool.tools.wmflabs.org/, to be served by the PaaS.

I'd like to repeat my statement from the introduction of Kubernetes: To me, it makes much more sense to embrace it instead of using it as something that it was not intended to be used for.

This is really my goal as well I think. I'm personally against making the existing jsub workflow target Kubernetes as a backend. I would much rather spend that energy on getting an evaluation of PaaS options done and the chosen winner deployed. The existing webservice integration made a lot of sense to give us some experience in actually running things on k8s. At the time that work was started OpenStack and Deis Workflow really didn't exist in a usable state.

We do need to make a plan to sunset OGE in the larger roadmap. That plan needs to include time and support resources to get people migrated to Kubernetes. Ideally that migration would be moving directly to the workflow enforced/encouraged by the PaaS, but we really don't know enough today to determine if that is practical. The migration plan however is orthogonal to the need to establish comparison criteria for PaaS products. The only point of potential overlap is determining to what extent methods for supporting existing NFS based workflows are weighted in the PaaS decision. I think we need that evaluation point in the matrix regardless of the influence it is given in the final evaluation.

I pretty much agree with all the things @bd808 said, and most of the things that @scfc said :)

A lot of these questions are answered by adopting an opinionated Kubernetes platform like Rancher. It does take multi-tenancy and multi-cloud very seriously and could be a good platform for deploying something based on Knative later on. It's not as comprehensive as OpenShift on the application layer but I think that's actually a good thing for us because OpenShift might be a bit too enterprise-y or heavyweight for our needs.

I know we're still very early in these discussions but I wanted to raise awareness about this option. It certainly would let us deploy an open source project with a strong community and decrease our usage of custom scripts for maintaining the whole thing up in the air.

  1. Support for 'custom' kubernetes setups. Ours is different from, say, GKE in the following ways:
    1. We have Service Accounts disabled. Service accounts are still insecure to use with authentication / authorization schemes we use (one user per namespace). We will allow enabling them when this restriction goes away.

Fixed in our new cluster. They are enabled, used and secured via RBAC and PodSecurityPolicy.

  1. We do not have Cluster DNS, primarily because it relies on (1)

Also fixed.

  1. We do not have any Ingress controller setup - instead we do our own custom fake thing. We should replace this with an actual ingress controller at some point soon.

Fixed, but our own custom thing may still be useful. The thing here is: everyone must enable some custom thing or move to a proprietary custom thing like GKE/AKS, etc. Weirdly enough, our custom thing may prove to be better in some ways than Ingress if combined with calico very effectively. Even the projects like OpenShift and Rancher ultimately would fall into the same traps--but once we don't have to support the old cluster, all this may disappear in a poof of DNS and possibly a slightly different ingress 😁

  1. We do not have any useful persistent attachable storage volumes. We 'fake' NFS with hostPath and mounting NFS on the worker nodes, since our authorization scheme for NFS (from nfs-exportsd) relies on a stable set of IP addresses to provide service to. We do not have Ceph or anything else that could provide attached network storage.

We are persisting in this because namespaces make the volumes not so good. However, this is now incorrect. NFS is largely the standard by which persistentVolumes got developed around in some ways. The structure of toolforge around shared NFS is a much bigger hurdle to overcome. With Ceph almost here and the current state of k8s dev (these were written quite a while ago), if the grid and k8s are separated, we'll be in a much better situation.

  1. We have custom extra Admission controllers (See https://wikitech.wikimedia.org/wiki/Tools_Kubernetes#Custom_admission_controllers). They place other restrictions on what pods can run, primarily around what uid a container can run as, what docker registry it can pull images from and what hostPaths it can mount.

Fixed in T215678: Replace each of the custom controllers with something in a new Toolforge Kubernetes setup

  1. We do not have a LoadBalancer (exposes an IP to the external world for a service) either.

This is fixable using BGP fun in Calico with Ingress or even without it (using dynamicproxy).

I think our new cluster will resolve most of the concerns about ours not being the same as GKE (back then). Now, I daresay, we will have some good advantages over that (that could still be ported to a public cloud without vendor lock-in).

If we setup a triggerable workflow that can be hit by git hooks of some kind, the rest can be done, even buildpacks. I'm going to suggest that there is no (and likely will not be) off-the-shelf solution for this creature. However, there are platforms such as Kubernetes and all it's various critters that can be pieced together carefully to meet all criteria. We are well on our way now with the new cluster (presuming we can overcome a few hurdles during the migration) as a solid foundation to build upon.

Beyond this update of the current state, I'd like to contribute that we would do well to be more involved in the decision-making and development around the tools we bring into the PaaS design, including Kubernetes working groups. We cannot expect any of these projects to conform to our requirements as passive consumers while the large players have such heavy hands in them. We don't have a terribly unique use case at this point for Kubernetes, workflows, gitops, etc. except that it has semi-public access, which means we don't need to be even heavily involved--just a bit. Many companies avoid letting their devs directly interact with Kubernetes because they break things and because of the learning curve. I think WMCS (and other Toolforge organizations if revived) should stay involved in the projects we knit together since there is often community governance and development that goes on--unlike in, say Gridengine and jsub. I think that's one of the biggest differences. That last point, I can't really say as much for openshift and rancher, btw, as good as they are.

I just saw @Bstorm's awesome response to earlier comments, and am very happy with the way things have gone. THANK YOU!

We are still moving towards a solution for T194332: [Epic,builds-api,components-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push-to-deploy and build packs, but it is not taking the form of a deployment of an existing 3rd party FLOSS product at this time which makes this task moot.