- Setup a multi node kubernetes cluster
- Do things to it!
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • chasemp | T106475 Evaluate a 'cluster solution' for use on Tool Labs | |||
Resolved | yuvipanda | T107993 Evaluate kubernetes for use on Tool Labs |
Event Timeline
I've created the k8s-eval project and am in the process of setting up a 3 node etcd cluster to start with. See https://wikitech.wikimedia.org/wiki/Hiera:K8s-eval modeled after https://wikitech.wikimedia.org/wiki/Hiera:etcd but it's failing with:
Aug 05 09:15:24 k8s-master-01 etcd[19207]: 2015/08/05 09:15:24 etcd: listening for peers on http://k8s-master-01.k8s-eval.eqiad.wmflabs:2380 Aug 05 09:15:24 k8s-master-01 etcd[19207]: 2015/08/05 09:15:24 etcd: clientTLS: cert = /var/lib/etcd/ssl/certs/cert.pem, key = /var/lib/etcd/ssl/private_keys/server.key, ca = Aug 05 09:15:24 k8s-master-01 etcd[19207]: 2015/08/05 09:15:24 etcd: listening for client requests on https://k8s-master-01.k8s-eval.eqiad.wmflabs:2379 Aug 05 09:15:24 k8s-master-01 etcd[19207]: 2015/08/05 09:15:24 etcdserver: datadir is valid for the 2.0.1 format Aug 05 09:15:24 k8s-master-01 etcd[19207]: 2015/08/05 09:15:24 netutil: Resolving k8s-master-01.k8s-eval.eqiad.wmflabs:2380 to 10.68.17.159:2380 Aug 05 09:15:24 k8s-master-01 etcd[19207]: 2015/08/05 09:15:24 netutil: Resolving k8s-master-01.k8s-eval.eqiad.wmflabs:2380 to 10.68.17.159:2380 Aug 05 09:15:24 k8s-master-01 etcd[19207]: 2015/08/05 09:15:24 etcd: stopping listening for client requests on https://k8s-master-01.k8s-eval.eqiad.wmflabs:2379 Aug 05 09:15:24 k8s-master-01 etcd[19207]: 2015/08/05 09:15:24 etcd: stopping listening for peers on http://k8s-master-01.k8s-eval.eqiad.wmflabs:2380 Aug 05 09:15:24 k8s-master-01 etcd[19207]: 2015/08/05 09:15:24 etcd: k8s-master-01 has different advertised URLs in the cluster and advertised peer URLs list Aug 05 09:15:24 k8s-master-01 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
with the env variables being
Environment=ETCD_DATA_DIR=/var/lib/etcd/k8s-etcd Environment=ETCD_NAME=k8s-master-01 Environment="ETCD_INITIAL_CLUSTER_STATE=new" Environment="ETCD_INITIAL_CLUSTER=k8s-master-01=https://k8s-master-01.k8s-eval.eqiad.wmflabs:2380" Environment=ETCD_INITIAL_ADVERTISE_PEER_URLS=http://k8s-master-01.k8s-eval.eqiad.wmflabs:2380 Environment=ETCD_LISTEN_PEER_URLS=http://k8s-master-01.k8s-eval.eqiad.wmflabs:2380 Environment=ETCD_LISTEN_CLIENT_URLS=https://k8s-master-01.k8s-eval.eqiad.wmflabs:2379 Environment=ETCD_ADVERTISE_CLIENT_URLS=https://k8s-master-01.k8s-eval.eqiad.wmflabs:2379 # TLS certs, see https://github.com/coreos/etcd/blob/v2.0.10/Documentation/security.md # Also note that peer auth is currently broken. Environment=ETCD_CERT_FILE=/var/lib/etcd/ssl/certs/cert.pem Environment=ETCD_KEY_FILE=/var/lib/etcd/ssl/private_keys/server.key ExecStart=/usr/bin/etcd
Ah! https://wikitech.wikimedia.org/w/index.php?title=Hiera%3AK8s-eval&type=revision&diff=173110&oldid=173108 made it work, which might make sense considering that TLS for peers is still broken.
Ok, after some false starts there's a 3 node etcd cluster in there now \o/ I've clarified https://wikitech.wikimedia.org/wiki/Etcd a little bit, should add a bootstrapping section too
(Docs have been edited with a note of caution + more info). So that was super simple. Next step is to try to setup flannel.
Ok, so there is now an experimental 'k8s' puppet module, which is applied on all hosts in k8s-eval project. It has docker running with flannel overlay network on all three of them (untested).
Note that you might have to use brctl delbr docker0 and restart docker to get it to accept flannel - this is because of missing dependency chains that we ough to fix
So... for the toollabs webservice use case:
- Code will still be loaded off of NFS
- Each webservice will run in a container in a pod with a replication controller that keeps it up, providing a service.
- We'll provide precise and trusty docker containers that match the environment of our current trusty and precise nodes
- Some form of HTTP proxying would be needed.
Http proxying should be done with a set of proxying containers (run it into a pod, that is), I can work on getting those up if needed.
Proxying between pods is already managed by services, no? What we need is a https terminator / http proxy that does tools.wmflabs.org/<something> to a service of the same name. Not sure if we want to put the SSL cert in a pod?
I, on the other hand, am quite certain we absolutely should not. :-)
No cert under any of our domains can ever be deployed automatically to instances that may or may not have access by non-NDA people. Means *.wmflabs.org is right out, for instance.
Indeed, so we'll need a proxy / ssl termination instance. It'll probably be a lot simpler than our current setup of course since we can just use kubernetes services as destinations.
As for ACL's to prevent one user from accessing other users' pods and services, we could possibly isolate them in a namespace per user, and write a custom authenticator (https://github.com/kubernetes/kubernetes/blob/master/docs/admin/authentication.md) that uses... identd (which 'works' here because users don't have root on the bastions), and an authorization module that restricts user X to namespace X.
This is possibly a hack :) I'm unsure how exactly we can authenticate better, but there probably is one.
This allows us to actually directly expose the full kubernetes API to users for them to do as they want.
Reading through docs, it looks like we'll also need https://github.com/kubernetes/kubernetes/blob/release-1.0/docs/design/service_accounts.md and https://github.com/kubernetes/kubernetes/blob/release-1.0/docs/design/security_context.md to properly implement our workflow.
Ok, so we can have processes running as a specific uid, which is great. They still run as gid 0, which should hopefully be fixable. We can also use linux capabilities and selinux later on if we want to.
So plan:
- Have a ServiceAccount per Tool
- The ServiceAccount will have an associated security context that enforces uid
- Write an authentication plugin using identd
- Write an authorization plugin that makes sure that user X can access only very specific whitelisted things in namespace X
- Write a setup that creates kubernetes ServiceAccounts and namespaces whenever new tool accounts are created
- Write a webservice wrapper that calls the k8s API. This doesn't need any additional auth/authz steps, since those will be handled on the API side of things.
Ok, so that requires the following missing but planned features:
- Authentication plugins
- More powerful authorization plugins (currently they can't accept / reject based on the actual payload of the request, only based on the resource being requested)
- Being able to associate and enforce security context for all pods / rcs started by a service account
Meh, identd is a no-go since the Authenticator interface uses a Go HTTP Request object that doesn't expose the underlying socket, so I can't actually get the local / remote port pair info needed for identd. So terrible idea dies!!1
Back to square one on authentication :)
Question of understanding: I have only watched a Kubernetes talk once and IIRC the setup was a central file that defined how many containers of which services should be deployed and then it did that.
That would suggest a proxylistener-like service that is called by webservice start and amends the container definition to include x containers which run a web server for tool account y on port z, and on webservice stop changes x to 0. Where is the need for auth*? Do the LDAP users not exist in the containers?
Ah, nope. http://kubernetes.io/v1.0/docs/user-guide/overview.html has a nice short overview, but essentially our idea is to have a multi tenant setup where people can launch their own 'pods' (groups of containers) and use the full fledged API, with proper authentication and authorization. This is the 'secure' way to do it, since we don't want people messing with other people's pods. This is also the simplest way, since then our webservice wrapper can just call the API and have authentication handled for it, and doesn't have to do any security stuff itself. Just depending on the webservice wrapper to handle security for us is scary, since then anyone who can find a way to send requests to the API practically has root.
I understand that's your idea, but I don't understand why it's necessary. I think a proxylistener-like service is much easier reviewed for security than a full-fledged auth* plug-in.
Regarding auth*, http://kubernetes.io/v1.0/docs/admin/authentication.html says "Kubernetes uses client certificates, tokens, or http basic auth to authenticate users for API calls." So why not create client certificates for tool accounts (IIUC that they need to use the API in your model) in the same way as ~/replica.my.cnfs are created currently? (I always wanted that for MySQL + PostgreSQL + inter-tool communication because it doesn't require to create ~/postgresql.password, ~/catscan.password, ~/yet-another-service.password, etc.)
Yeah, looks like a Client Certificate generator + putting it on NFS might be the easiest way to go. I was trying to not use it because:
- NFS dependency
- In the past we've had users who chmod their replica.my.cnf to publicly accessible permissions. If they do that with the certificate, consequences would be a bit more disastrous.
But since it's a CA + client certs, this is probably fairly flexible, and we can do things like rotate the certs all the time, etc..
You don't need to involve NFS (directly). You can set up the certificate generator on any machine as a HTTP/whatever service, do the auth* via is_in_tools_project($IP) && identd($IP, $port) and then let the "client" write it to the local disk (which will probably be NFS :-)).
But a certificate generator certainly requires a process maintaining a certificate revocation list and distributing it to where it needs to be.
One simple thing that could work right now for you for authentication is to generate a password or a token per user, and put it in a file in each users directory, and also put that same credential in the APIserver's password and/or token file. And have some periodic process to add new ones for new users.
Having a namespace per user is a reasonable authorization solution, assuming users don't need to collaborate with each other.
You would just want the administrator to be able to create and delete namespaces, so you would want some periodic process that runs as the admin to ensure that a namespace for each user exists.
Another approach entirely that you could do is run OpenStack Keystone as a store of passwords and of definitions of which users are on which projects. This PR just merged to add authentication using OpenStack Keystone: https://github.com/kubernetes/kubernetes/pull/11798. And this issue is a discussion about writing a per-namespace authorizer which uses openstack keystone as a database of what users can use which projects (if users want to share access to each others projects): https://github.com/kubernetes/kubernetes/issues/11626
Sorry, this is a bit of an active area of development!
Some thoughts about service accounts and security contexts:
I saw in the other issue you said that each user's jobs run as the user that submitted them.
For batch jobs, this does make sense.
For customers that are running long-running services (web services, streaming data processing pipelines, etc), it may not make
sense to run the services with the same identity as the user that created them, since there might be multiple humans who are responsible for creating and updating the service's jobs. So you end up wanting a service account for the identity of the pods.
But you might not need that level of complexity.
I see that you noticed that pods have a "Security Context" which controls what unix uid the pod runs as.
You also figured out that we don't yet have a standard story for how to enforce what values are allowed.
However, you can write an "admission control plugin" to enforce this policy. An admission controller
is basically some middleware that runs in between authorizing a request, and persisting the object
that is being created/updated. An admission control plugin for you could do the following: when a request to create a pod arrives with authorized user U, then map U to its unix uid, and then write that uid to pod.spec.container[*].securityContext.runAsUser. That seems sufficiently useful to other people that we'd take that upstream.
(@Etune: In Toolforge, all web services (and almost all of the grid jobs) run as so-called tool accounts/service groups (cf. https://wikitech.wikimedia.org/wiki/Special:NovaServiceGroup). I. e., humans log in as for example scfc, than sudo to tools.typoscan, then use webservice start to start a web service that will run as tools.typoscan.)
Thanks for Chiming in @Etune! I wasn't aware of admission control plugins - that sounds like exactly what we need :)
And as @scfc mentions, in our case we should probably not have any human users and just have one service account per tool.
So, looks like we can get what we want with:
- Each 'tool' corresponds to a service account
- Password / Token / Cert auth, with autogeneration of credentials and autocreation of service accounts when a tool is created (from the web interface)
- Use an admission controller to make sure that a tool can access only its own namespace, and that the UID / Security context of pods / rcs a tool creates are restricted to run with the tool's UID / GID.
For the token / auth generation, we can use the same mechanism we have for generating mysql credentials.
Ca you say a little bit more about what a "tool" is, or point to docs, so I can make sure I understand your use case.
@yuvipanda Exciting! I'll be interested in how this plays out with my mediawiki-docker project.