Page MenuHomePhabricator

Create a helm chart for the cloudnativepg postgresql operator
Open, In Progress, HighPublic

Description

We will need to deploy the cloudnativepg postgresql operator to the dse-k8s cluster as part of our project to T362788: Migrate Airflow to the dse-k8s cluster

In order to achieve this, we will need a helm chart for the operator.

The chosen operator per T362999 is cloudnativepg

The upstream project provides an official chart, which we could choose to evaluate: https://github.com/cloudnative-pg/charts/tree/main/charts/cloudnative-pg and use either as inspiration or to incorporate as-is.

They also provide a number of sample charts with which they aim to demonstrate some of the features such as continuous backup and point-in-time recovery: https://github.com/cloudnative-pg/charts/tree/main/charts/cluster

Event Timeline

Gehel triaged this task as High priority.May 17 2024, 12:44 PM
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.
Gehel moved this task from Scratch to 2024.05.06 - 2024.05.26 on the Data-Platform-SRE board.

I've started to have a look at https://github.com/cloudnative-pg/charts/blob/main/charts/cloudnative-pg/templates. The required RBAC list is ... intimidating.

A good half has to do with specific PG resources, but some RBAC, such as the ability to delete PVC, Services and Secrets, while understandable, might raise a few eyebrows, especially because these permissions are attached to ClusterRoles resources, and not just Roles.

A good half has to do with specific PG resources, but some RBAC, such as the ability to delete PVC, Services and Secrets, while understandable, might raise a few eyebrows, especially because these permissions are attached to ClusterRoles resources, and not just Roles.

Yes, understood. We had a similar issue with the spark-operator chart when we installed that. Although in this case we used the scaffold (because there wasn't really a policy allowing upstream charts at all, at the time.)
See, for example, similarly scary RBAC: https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/spark-operator/templates/rbac-spark-operator-elevated.yml#L18-L20

We might want to see if there are features that need certain RBAC entries that can be disabled with options. For example, some operators can install their own CRDs if they have rights to do so, but we might wish to disable these rights and install the CRDs ourselves. I'm happy to look through the list with you, if it helps.

I have opened an upstream chart review request and communicated to that effect in the k8s SIG IRC channel.

brouberol changed the task status from Open to In Progress.Fri, May 31, 7:50 AM

Change #1037731 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] Import the upstream cloudnative-pg chart for inspection

https://gerrit.wikimedia.org/r/1037731

Change #1037732 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] Disable chart dependency as we're not leveraging grafana dashboard creation

https://gerrit.wikimedia.org/r/1037732

Change #1037733 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] Add upstream version annotation

https://gerrit.wikimedia.org/r/1037733

Change #1037734 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] Enable cloudnative-pg-operator on the dse-k8s-eqiad k8s cluster

https://gerrit.wikimedia.org/r/1037734

Change #1037735 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] Add values overrides for the cloudnative-pg-operator chart

https://gerrit.wikimedia.org/r/1037735

Thanks for this and thanks for documenting the selection process in T362999. It's probably worth it to update the summary of that task with a quick note about the conclusion and chosen solution.

Notes

I use the following emoji notation on the headings of the various sections

statusemoji
undecided
OK
No

Overall comments

Overall the architecture of the operator seems well though out as well as nicely documented (including warnings and important notes), alongside explanations of kubernetes architecture and recommendations on how to properly structure a cluster

They also support DR, but we aren't going to be able to use such a thing in dse-k8s. It's nice it is there though.

One important question here. They make sure to point out repeatedly that sharing storage is discouraged and apparently insist on Local PVCs. Just to have it clear, you aren't intending to utilize this over Ceph, right?

CustomResourceDefinitions ❓

  • The CRDs alone is a 14.5k lines files (the overall file is ~900kB). While huge, it isn't that surprising (calico's CRDs by comparison are at 4k lines). They are trying to target multiple platforms and allow to work with a ton of various configuration tunables across multiple operatingsystem families on a almost 40 year old software. Thankfully it's if guarded. I would suggest we never turn that to on but rather ship the CRDs in their own chart (follow that is the exact same pattern we have with Calico, cfssl-issuer, knative etc). This is method #2 from chart_best_practices/custom_resource_definitions/ and avoids the caveats of method #1
  • I purposefully didn't review 14.5 lines, but a cursory look suggest they are trying to configure things like sysctl configuration et al, making me a bit wary, so I put a question mark here. I don't have a proposal though and even if we review the 14.5k lines of CRDs, we probably can't do much about it.

_helpers.tpl ✅

They are essentially creating what we have in our _helpers. Not surprising at all. The serviceaccount creation helper stuff is toggled via values.yaml, which is a plus and shows some attention to detail (it's helpers, they could have anyway)

configmaps/secrets ✅

Pretty basic YAML structure, 2 toggles, 1 to create or not configuration and 1 to decide whether it should be populated as a secret or not. The data is minimal, I see 10 toggables in https://cloudnative-pg.io/documentation/current/operator_conf/#available-options.

There is 1 slighly weird monitoring configmap that just gets data from values.yaml straight as YAML. In fact that structure is 385/555 lines of values.yaml. It feels a bit weird to expose all of this as is in values.yaml. Do we expect to be modifying it much?

monitoring ✅

There's a monitoring subchart that creates grafana dashboards. This ain't gonna work here, but it's behind a condition. I see you have a patch to remove it, I don't think it's necessary given the condition. Let it to false and document why we set it to false.

deployment ✅

Nothing weird here, but 2 questions. One is what is the scratch-data emptyDir that is used by the controller ? and one is is the manager appears to have leader election capabilities. Do we expect to need this?

mutating/validating web hooks ✅

Nothing to note here, there are 3 mutating ones, 2 for backups and 1 for cluster management and 4 validating ones, 2 for backups, 1 for cluster management and 1 for pool management.

podmonitor ❌

This ain't gonna work. It relies on a CRD monitoring.coreos.com/v1 they don't ship. It's ifguarded (which is good) but we probably don't want resources in our charts that use CRDs we don't ship. It would cause confusion to future readers. I 'd suggest to remove it from the chart.

service ✅

Nothing to note here, it's the service that is used to address the webhook

RBAC ❌

And we reach the expected difficult part. This ships

  • 1 ClusterRole that is named with a helper and suffixed with -edit. It defines rules for doing pretty much CUD (but not R) actions to all resources that are using CRDs shipped with the chart. Those rules are also added to standard edit and admin roles in clusters. There is no binding for this, the intent apparently is to add those rights to the standard roles
  • 1 ClusterRole that is the read-only complement of the one above suffixed -view. Aggregated again to edit, admin and view roles in the cluster. This last part means that standard deployers can view configuration of postgresql clusters. Again, no binding that I can see, the intent once more is so that standard roles get rights
  • 1 ServiceAccount (what the operator will run with presumably)
  • 1 ClusterRoleBinding that binds the serviceaccount mentioned above to a ClusterRole named via a helper (named same as the ones above but no suffix)
  • That 1 ClusterRole that ... can do preeeety much anything.
    • It can fetch/delete/modify configmaps AND secrets from all namespaces,
    • It can watch/list/get all namespaces
    • All can watch/list/get all nodes
    • Can create/delete/list/get/get status of all PVCs across all namespaces
    • Can create/delete/list/get all pods across all namespaces
    • Exec into any pod in namespace (what on earth does it even need this for?)
    • It can create/list/get/watch all serviceaccounts across all namespaces
    • Do anything to all services across the cluster
    • monitor CRDs
    • monitoring validating and mutating webhooks
    • Anything to any job across all namespaces
    • Mess with pod disruption budgets everywhere
    • AND ... the best part. It can create/delete/list/watch/update RoleBindings and Roles across the cluster. Privilege escalation as a service...

Thankfully, not shipping those rules is a toggle. But also, I am not even sure I want these to be in the repo. They will end up confusing someone (most probably me) at some point in the future)

My suggestion is to not ship those RBAC rules in a cluster that is meant to house any other workload next to the postgres databases. Instead, cut them out from the chart, modify them appropriately to limit them to the namespaces you wish and ship them as a separate thing.

Change #1049084 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] cloudnative-pg: create charts only containing the CRDs

https://gerrit.wikimedia.org/r/1049084

Change #1049085 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] cloudnative-pg: disable RBAC management within the chart

https://gerrit.wikimedia.org/r/1049085

Change #1049086 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] cloudnative-pg: move queries to configmap

https://gerrit.wikimedia.org/r/1049086

Change #1049087 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] cloudnative-pg: set image values

https://gerrit.wikimedia.org/r/1049087

Change #1049109 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] cloudnative-pg: allow the specification of watched namespaces

https://gerrit.wikimedia.org/r/1049109

@akosiaris I was reading the operator code and found out that you can specify which namespace it should watch and create resources into.

I've created a small patch to allow the namespaces to be listed from the operator configuration values. This way, I think we should be able to live without any ClusterRole and ClusterRoleBinding because we'd be able to create Role resources tied to the operator service account and the namespaces in which we'd deploy PG clusters.

Change #1049114 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] cloudnative-pg: add CI fixtures

https://gerrit.wikimedia.org/r/1049114

Change #1037735 abandoned by Brouberol:

[operations/deployment-charts@master] Add values overrides for the cloudnative-pg-operator chart

Reason:

https://gerrit.wikimedia.org/r/1037735

Change #1037732 abandoned by Brouberol:

[operations/deployment-charts@master] Disable chart dependency as we're not leveraging grafana dashboard creation

Reason:

https://gerrit.wikimedia.org/r/1037732

I had a second look at the RBAC, and some non-namespaced resources would still require a ClusterRole and associated binding:

  • node (RO)
  • namespace (RO)
  • mutatingwebhookconfigurations (RW)
  • validatingwebhookconfigurations (RW)
  • customresourcedefinitions (RW)

The rest of the permissions can be namespace-scoped. Within these ns-scoped permissions, we see the following block:

- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - rolebindings
  verbs:
  - create
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - roles
  verbs:
  - create
  - get
  - list
  - patch
  - update
  - watch

In the current configuration, these roles/rolebindings would only be scoped to the watched namespaces. However, I'd like to know whether we could do without.
To do that, I went to the operator code.

What I found was the following go function

// CreateRole create a role with the permissions needed by the instance manager
func CreateRole(cluster apiv1.Cluster, backupOrigin *apiv1.Backup) rbacv1.Role {
	rules := []rbacv1.PolicyRule{
		{
			APIGroups: []string{
				"",
			},
			Resources: []string{
				"configmaps",
			},
			Verbs: []string{
				"get",
				"watch",
			},
			ResourceNames: getInvolvedConfigMapNames(cluster),
		},
		{
			APIGroups: []string{
				"",
			},
			Resources: []string{
				"secrets",
			},
			Verbs: []string{
				"get",
				"watch",
			},
			ResourceNames: getInvolvedSecretNames(cluster, backupOrigin),
		},
		{
			APIGroups: []string{
				"postgresql.cnpg.io",
			},
			Resources: []string{
				"clusters",
			},
			Verbs: []string{
				"get",
				"list",
				"watch",
			},
			ResourceNames: []string{
				cluster.Name,
			},
		},
		{
			APIGroups: []string{
				"postgresql.cnpg.io",
			},
			Resources: []string{
				"clusters/status",
			},
			Verbs: []string{
				"get",
				"patch",
				"update",
				"watch",
			},
			ResourceNames: []string{
				cluster.Name,
			},
		},
		{
			APIGroups: []string{
				"postgresql.cnpg.io",
			},
			Resources: []string{
				"backups",
			},
			Verbs: []string{
				"list",
				"get",
				"delete",
			},
		},
		{
			APIGroups: []string{
				"postgresql.cnpg.io",
			},
			Resources: []string{
				"backups/status",
			},
			Verbs: []string{
				"get",
				"patch",
				"update",
			},
		},
		{
			APIGroups: []string{
				"",
			},
			Resources: []string{
				"events",
			},
			Verbs: []string{
				"create",
				"patch",
			},
		},
	}

	return rbacv1.Role{
		ObjectMeta: metav1.ObjectMeta{
			Namespace: cluster.Namespace,
			Name:      cluster.Name,
		},
		Rules: rules,
	}
}

The configmaps (RO), secrets (RO), clusters (RO) and clusters status (RW) permissions are all scoped to the resources of the cluster itself. The backups (RW), backups status (RW) and events (W) are not.

As the code does not fence the creation/patching of such roles with an if condition, I would tend do say that these permissions, scoped to both fixed namespaces and cluster resources, are acceptable, but I'm happy to hear y'all thoughts.