Page MenuHomePhabricator

Handle sidecar containers in one-off Kubernetes jobs
Open, MediumPublic

Description

For T341553, we can start up a MediaWiki pod managed as part of a Job (which runs to completion and then terminates) instead of a Deployment (which stays up and serves indefinitely).

At present, those pods never complete. A pod is only Complete when all its containers terminate, and MediaWiki pods have a number of containers in addition to the app: service mesh proxy, Prometheus exporter, mcrouter, etc. After the maintenance script completes, and the "main" app container terminates, those "sidecar" containers (and therefore the pod, and therefore the job) linger forever.

What we need is a way to express that only the app container should count, and the pod should be wrapped up even if sidecar containers are still running -- as long as they're the only containers still running. KEP 753 proposes to add that functionality, but as I write this, it just got approved for Beta targeting Kubernetes v1.29 (expected December 2023) so we can try it when we're upgraded to that point.

In the meantime, there's an open-source controller https://github.com/otherguy/k8s-controller-sidecars that does the same job: you give your pod a pod.kubernetes.io/sidecars annotation listing the sidecar containers, and this controller does the needful. (Long run, we might have an interesting time keeping that annotation up-to-date if new sidecars are added and removed, but we can think about that later.)

On a review (there's not that much code there) it looks like it does what it says on the tin, so we can give it a try in prod and see if it solves the job lifecycle problem. I'll template that annotation behind the same values flag that we're using to switch from Deployment to Job, so it won't have any effect on the normal MW deployment's lifecycle.

Event Timeline

RLazarus triaged this task as Medium priority.Oct 5 2023, 10:37 PM

k8s-controller-sidecars uses exec to send SIGTERM to PID 1 of the containers (it executes "sh -c 'kill -s TERM 1'"). So this ultimately requires all pods to have a Debian base (which they probably have). I noticed that the example manifest in the repo also requests delete permissions for pods - I'm not sure why that should be required and I could not find code actually using it.

Another possible solution (depending on the actual implementation I guess) could be to enable sharing of the PID namespace for the mediawiki job pods. That way the "pause" process will become PID 1 and you should be able to SIGTERM the job has completed to tear down the whole pod.

Both good points, thanks.

Another fun fact is that the maximum length for a Kubernetes label is 63 characters, meaning this solution is only viable with fewer sidecar containers, shorter container names, or both. Current structure of the release name is mwscript-[username]-[script name]-[random 8 chars] which means our sidecars value would be

mediawiki-mwscript-rzl-version-xwyavd0b-httpd,
mediawiki-mwscript-rzl-version-xwyavd0b-httpd-exporter,
mediawiki-mwscript-rzl-version-xwyavd0b-php-fpm-exporter,
mediawiki-mwscript-rzl-version-xwyavd0b-mcrouter,
mediawiki-mwscript-rzl-version-xwyavd0b-mcrouter-exporter,
mediawiki-mwscript-rzl-version-xwyavd0b-tls-proxy,
mediawiki-mwscript-rzl-version-xwyavd0b-rsyslog

(newlines added for readability) which is 363 characters. So that won't work. Even if we changed our naming convention to stop repeating the pod name in the container name, it'd be

httpd,httpd-exporter,php-fpm-exporter,mcrouter,mcrouter-exporter,tls-proxy,rsyslog

which is 82 characters and doesn't fit either.

But we can fork the controller to use a different marker -- my current thought is to add a label like sidecar: true to each container, rather than putting it on the pod. Then it's a relatively minor adjustment to the controller's logic in handler.go. (It's also much more like the KEP 753 proposal, where sidecar behavior is configured on the container.)

Unfortunately ther is no such thing as an container label. Why not just got he other way round and use the label to specify the "primary container" of a Pod (considering all others sidecars)?

Doh. Okay, that's a good reason not to go that route. :) I'll give the "primary container" label a try, thanks.

@jijiki pointed out we already have a CronJob running with a mesh sidecar in production. That is from the tegola chart and it uses a wrapper script to ensure envoy is up before the job starts and terminates it after (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/tegola/+/refs/heads/wmf/v0.19.x/scripts/pregenerate-maps-tiles.sh). So that chart could benefit from a proper solution as well.

Yep, sounds like a similar fit.

I guess I didn't hit Submit on the last update -- it's not labels, it's annotations (which makes more sense anyway -- I even remember double-checking for that reason, but either there's a doc error somewhere or I just whiffed it).

Annotations don't have the length problem, so the sidecars controller should work fine as it is. I'm working on debianizing and re-dockerizing, so that we can give it a try, but if it works for mwscript it'll work for tegola too.

I still wonder why upstream choose to list all the sidecars instead of just declaring the "primary container". Not sure if it's worth investigating in a patch though.

Change 972535 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/docker-images/production-images@master] k8s-controller-sidecars: Initial release

https://gerrit.wikimedia.org/r/972535

Change 972535 merged by RLazarus:

[operations/docker-images/production-images@master] k8s-controller-sidecars: Initial release

https://gerrit.wikimedia.org/r/972535

@JMeybohm Can you help with an RBAC issue?

I'm adapting the manifest from upstream, my current attempt is here:

I get this from helmfile apply:

Error: rendered manifests contain a resource that already exists. Unable to continue with install: could not get information about the resource Role "sidecar-job-controller" in namespace "mw-script": roles.rbac.authorization.k8s.io "sidecar-job-controller" is forbidden: User "mw-script-deploy" cannot get resource "roles" in API group "rbac.authorization.k8s.io" in the namespace "mw-script"

So, first question, I switched from upstream's ClusterRole to using a Role on the mw-script namespace, on the assumption that the reduced scope would make this easier. Was that right, or should we just do this cluster-wide like upstream?

(Eventually we'll want this functionality to be available for everyone's Jobs and CronJobs -- I don't have strong feelings about whether that's managed by a controller instance per namespace, or a shared one with cluster-wide permissions. And either way it's okay to experiment with a solution for mw-script first and then generalize it once it's working.)

And broadly, what's the proper way to go about applying the role config?

The error arises because we don't allow regular deployers to create RBAC objects in the cluster. The solution depends a bit on what if we want to limit the access scope of the controller to particular namespaces or not. Personally I would say that we should come up with a limited access approach, e.g. we would need to be able to enable the controller on a per namespace level. For that to work you need to:

  • Change the Role back to a ClusterRole (it won't grant and additional permissions, that is actually done by the (Cluster)RoleBinding. Role vs. ClusterRole just defined where the Role lives (namespaced vs. global). As we need the exact same Role in each namespace we want to enable the controller in, it's fine (preferred even) to just have to define the Role once (as ClusterRole).
  • Deploy the Deployment and ServiceAccount to a dedicated "system namespace" (see helmfile.d/admin_ng/values/main.yaml#L15, documentation at helmfile.d/admin_ng/values/common.yaml#L4)
  • Extend the dreaded helmfile.d/admin_ng/helmfile_namespaces.yaml inventing a boolean flag to enable the controller per namespace. When that is true for a namespace, deploy a RoleBinding object like:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: sidecar-job-controller
  namespace: {{ $namespace }}
subjects:
  - kind: ServiceAccount
    name: sidecar-job-controller
    namespace: <THE DEDICATED CONTROLLER NAMESPACE>
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: sidecar-job-controller

All the above will take place in admin_ng, so you will deploy with admin permissions instead of a deploy user (meaning you will be able to create RBAC objects).

I'm not 100% sure this works though as the controller itself tried to register watchers for all namespaces (https://github.com/otherguy/k8s-controller-sidecars/blob/main/main.go#L46-L61). I'm hoping it won't error out but continue with those objects/watchers it is granted access to - but that you'd have to check.

Change 981703 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] admin_ng: Add namespace and ClusterRole for Job sidecar controller

https://gerrit.wikimedia.org/r/981703

Change 981704 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] admin_ng: Switch on enableJobSidecarController for mw-script

https://gerrit.wikimedia.org/r/981704

Super helpful explanation, thank you! https://gerrit.wikimedia.org/r/981703 should do the above, and https://gerrit.wikimedia.org/r/981704 adds the binding for the mw-script namespace. I'll deploy those like wikitech:Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d/admin_ng and that will let me finish experimenting with the helm charts for both the sidecar controller and the actual jobs.

I'm not 100% sure this works though as the controller itself tried to register watchers for all namespaces (https://github.com/otherguy/k8s-controller-sidecars/blob/main/main.go#L46-L61). I'm hoping it won't error out but continue with those objects/watchers it is granted access to - but that you'd have to check.

Yeah, good point. Fortunately it looks like a pretty straightforward Go patch to add a --namespace flag if need be, and pipe it through to the API call.

Yeah, good point. Fortunately it looks like a pretty straightforward Go patch to add a --namespace flag if need be, and pipe it through to the API call.

Maybe call it --namespaces and we can use it in any namespace where it might be needed - I reckon this technique will be useful for all charts that need a Job object.

I reckon this technique will be useful for all charts that need a Job object.

Agreed. Per discussion up-thread though, the plan is to run an instance per namespace, for least-privilege reasons. Obviously we'll make sure it's easy to switch it on, but each controller will only be looking at a single namespace.

Change 981703 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Add namespace and ClusterRole for Job sidecar controller

https://gerrit.wikimedia.org/r/981703

Change 981704 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Switch on enableJobSidecarController for mw-script

https://gerrit.wikimedia.org/r/981704

Change 982497 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] admin_ng: Add the sidecar-job-controller ServiceAccount

https://gerrit.wikimedia.org/r/982497

Change 982497 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Add the sidecar-job-controller ServiceAccount

https://gerrit.wikimedia.org/r/982497

Yeah, as foretold:

E1214 22:34:14.253396       1 reflector.go:148] /go/gitlab.wikimedia.org/repos/sre/k8s-controller-sidecars/k8s-controller-sidecars/controller.go:37: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:mw-script:sidecar-job-controller" cannot list resource "pods" in API group "" at the cluster scope

I'll make that change on our fork of the controller. There's a yak under here someplace, I'm sure of it.

Bummer...the change you proposed would require us to deploy one sidecar-controller per namespace (probably this is the yak you are looking for :-)) - which I don't think is ideal in terms of resource usage and deployment complexity.

There are probably two other options:

  1. Extent the controller to allow to run one controller per namespace, this is potentially a bit of work and has performance drawbacks as it's way more expensive then just one controller loop/cache/etc.
  1. Allow the controller get,list,watch for pods cluster wide to satisfy the requirements of a watcher for all namespaces but only allow pod/exec on a per namespace level. We could then provide a list of "enabled namespaces" to a central controller instance that drops events for objects not in the list of namespaces.

Oh, I misunderstood what you meant by "enable the controller on a per namespace level" above! I thought deploying one instance per namespace was what you had in mind.

Okay yes, in that case I'm with you. I think your option #2 is probably the way to go, I'll rework the patch.

Just to save a review roundtrip -- what does the correct RBAC setup look like in that case? Do we split the ClusterRole into two, one for pods with a corresponding ClusterRoleBinding in helmfile_rbac.yaml and one for pods/exec with a corresponding RoleBinding templated in helmfile_namespaces.yaml?

Change 983963 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] admin_ng: Split the sidecar-job-controller role into two

https://gerrit.wikimedia.org/r/983963

Oh, I misunderstood what you meant by "enable the controller on a per namespace level" above! I thought deploying one instance per namespace was what you had in mind.

Sorry, I probably wasn't super clear about that.

Okay, let me know if https://gerrit.wikimedia.org/r/983963 plus the most recent iteration of https://gitlab.wikimedia.org/repos/sre/k8s-controller-sidecars/-/merge_requests/1 is what you had in mind...

Yes, exactly what I had in mind. I left some nits in the changes.

Change 983963 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Split the sidecar-job-controller role into two

https://gerrit.wikimedia.org/r/983963

Change 988847 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] Helm chart for k8s-controller-sidecars

https://gerrit.wikimedia.org/r/988847

Change 988848 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] admin_ng: Install k8s-controller-sidecars

https://gerrit.wikimedia.org/r/988848

Change 988847 merged by jenkins-bot:

[operations/deployment-charts@master] Helm chart for k8s-controller-sidecars

https://gerrit.wikimedia.org/r/988847

Change 988848 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Install k8s-controller-sidecars

https://gerrit.wikimedia.org/r/988848

Change 1005212 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] k8s-controller-sidecars: Add missing namespaces

https://gerrit.wikimedia.org/r/1005212

Change 1005212 merged by jenkins-bot:

[operations/deployment-charts@master] k8s-controller-sidecars: Add missing namespace

https://gerrit.wikimedia.org/r/1005212

Change 1005219 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] k8s-controller-sidecars: Bump the pod's memory

https://gerrit.wikimedia.org/r/1005219

Change 1005219 merged by jenkins-bot:

[operations/deployment-charts@master] k8s-controller-sidecars: Bump the pod's memory

https://gerrit.wikimedia.org/r/1005219

Change 1006606 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] k8s-controller-sidecars: Add the other missing namespace

https://gerrit.wikimedia.org/r/1006606

Change 1006606 merged by jenkins-bot:

[operations/deployment-charts@master] k8s-controller-sidecars: Add the other missing namespace

https://gerrit.wikimedia.org/r/1006606

sidecar-controller regularly gets OOMKilled (Logstash, Grafana) despite it being configured to monitor only the mw-script namespace.

Since it's not in active use yet, I'm not pushing a change to ramp up its memory usage, but we should probably have a look at why it's spiking memory so much watching only one quasi-inactive namespace.

Thanks. At present the controller monitors all namespaces, but ignores pods other than in mw-script. So if I were estimating memory usage I'd base it on the total number of pod events in the cluster, not just in the namespace.

I still don't honestly expect it to need as much as it's got, but the OOMs do seem to roughly line up with spikes in pod creation and deletion. The right thing to do is probably filter the watch call, so the controller never finds out about pods it doesn't care about, instead of filtering them out after being notified. That's an easy code change for monitoring a single namespace, but I have to dig a little to see if it's still easy for monitoring several (which we don't do now but will eventually).

If that doesn't take care of it, we can still just double the memory, it isn't taking much.

(Also FWIW, this does now have real users.)

[...] The right thing to do is probably filter the watch call, so the controller never finds out about pods it doesn't care about, instead of filtering them out after being notified. That's an easy code change for monitoring a single namespace, but I have to dig a little to see if it's still easy for monitoring several (which we don't do now but will eventually).

Indeed, now I recall. It's unfortunately not possible to filter a watch call server side. IIRC we decided to do the watch on all Pods because it will only require one stream from the apiserver instead of N (one per namespace).

Change #1024362 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] sidecar-controller: Bump memory on wikikube

https://gerrit.wikimedia.org/r/1024362

I've uploaded a patch to bump the memory limit to 1G, since I've seen it spike up to 980M.

Change #1024362 merged by jenkins-bot:

[operations/deployment-charts@master] sidecar-controller: Bump memory on wikikube

https://gerrit.wikimedia.org/r/1024362