$ tofu apply OpenTofu used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols: + create OpenTofu will perform the following actions: # local_file.wikikube_config will be created + resource "local_file" "wikikube_config" { + content = (known after apply) + content_base64sha256 = (known after apply) + content_base64sha512 = (known after apply) + content_md5 = (known after apply) + content_sha1 = (known after apply) + content_sha256 = (known after apply) + content_sha512 = (known after apply) + directory_permission = "0777" + file_permission = "0777" + filename = "wikikube.config" + id = (known after apply) } # openstack_containerinfra_cluster_v1.wikikube will be created + resource "openstack_containerinfra_cluster_v1" "wikikube" { + api_address = (known after apply) + cluster_template_id = (known after apply) + coe_version = (known after apply) + container_version = (known after apply) + create_timeout = (known after apply) + created_at = (known after apply) + discovery_url = (known after apply) + docker_volume_size = (known after apply) + fixed_network = (known after apply) + fixed_subnet = (known after apply) + flavor = (known after apply) + floating_ip_enabled = (known after apply) + id = (known after apply) + keypair = (known after apply) + kubeconfig = (sensitive value) + labels = (known after apply) + master_addresses = (known after apply) + master_count = 1 + master_flavor = (known after apply) + name = "wikikube" + node_addresses = (known after apply) + node_count = 1 + project_id = (known after apply) + region = (known after apply) + stack_id = (known after apply) + updated_at = (known after apply) + user_id = (known after apply) } # openstack_containerinfra_clustertemplate_v1.wikikube_template will be created + resource "openstack_containerinfra_clustertemplate_v1" "wikikube_template" { + cluster_distro = (known after apply) + coe = "kubernetes" + created_at = (known after apply) + dns_nameserver = "8.8.8.8" + docker_storage_driver = "overlay2" + docker_volume_size = 80 + external_network_id = "wan-transport-eqiad" + fixed_network = "lan-flat-cloudinstances2b" + fixed_subnet = "cloud-instances2-b-eqiad" + flavor = "g4.cores8.ram32.disk20" + floating_ip_enabled = false + id = (known after apply) + image = "Fedora-CoreOS-38" + labels = { + "cloud_provider_enabled" = "true" + "container_runtime" = "containerd" + "containerd_tarball_sha256" = "1d86b534c7bba51b78a7eeb1b67dd2ac6c0edeb01c034cc5f590d5ccd824b416" + "containerd_version" = "1.6.20" + "hyperkube_prefix" = "docker.io/rancher/" + "kube_tag" = "v1.26.8-rancher1" } + master_flavor = "g4.cores2.ram4.disk20" + name = "wikikube" + network_driver = "flannel" + project_id = (known after apply) + region = (known after apply) + server_type = (known after apply) + updated_at = (known after apply) + user_id = (known after apply) } Plan: 3 to add, 0 to change, 0 to destroy. Do you want to perform these actions? OpenTofu will perform the actions described above. Only 'yes' will be accepted to approve. Enter a value: yes openstack_containerinfra_clustertemplate_v1.wikikube_template: Creating... openstack_containerinfra_clustertemplate_v1.wikikube_template: Creation complete after 5s [id=fb4b8790-5c0c-4092-bfad-3b7a48de09f0] openstack_containerinfra_cluster_v1.wikikube: Creating... openstack_containerinfra_cluster_v1.wikikube: Still creating... [10s elapsed] openstack_containerinfra_cluster_v1.wikikube: Still creating... [20s elapsed] openstack_containerinfra_cluster_v1.wikikube: Still creating... [30s elapsed] openstack_containerinfra_cluster_v1.wikikube: Still creating... [40s elapsed] openstack_containerinfra_cluster_v1.wikikube: Still creating... [50s elapsed] openstack_containerinfra_cluster_v1.wikikube: Still creating... [1m0s elapsed] ╷ │ Error: Error waiting for openstack_containerinfra_cluster_v1 d50c7032-5a62-4084-a27c-35f93653c746 to become ready: openstack_containerinfra_cluster_v1 is in an error state: Failed to create trustee or trust for Cluster: d50c7032-5a62-4084-a27c-35f93653c746 │ │ with openstack_containerinfra_cluster_v1.wikikube, │ on magnum.tf line 1, in resource "openstack_containerinfra_cluster_v1" "wikikube": │ 1: resource "openstack_containerinfra_cluster_v1" "wikikube" { │ ╵
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T53494 Use Beta cluster as a true canary for code deployments (epic) | |||
| Declined | None | T87220 Minimize infrastructure differences between Beta Cluster and production | |||
| Open | None | T276650 Re-consider setting up a Kubernetes cluster on the Beta cluster | |||
| Resolved | Spike | bd808 | T372498 Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu | ||
| Resolved | BUG REPORT | bd808 | T372365 OpenTofu fails to provision a Magnum managed k8s cluster in deployment-prep |
Event Timeline
I'm not quite sure how to start troubleshooting this at the moment... I guess I will start by digging around in logs.
https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu is the tofu config I was trying to apply. The only thing not committed to that repo is a secrets.auto.tfvars file with os_application_credential_id and os_application_credential_secret values take from application credentials for my user in the deployment-prep project.
Things are well and truly broken now:
$ tofu plan Planning failed. OpenTofu encountered an error while generating this plan. ╷ │ Error: Error building kubeconfig for openstack_containerinfra_cluster_v1 d50c7032-5a62-4084-a27c-35f93653c746: Error getting certificate authority: Resource not found: [GET https://openstack.eqiad1.wikimediacloud.org:29511/v1/certificates/d50c7032-5a62-4084-a27c-35f93653c746], error message: {"errors": [{"request_id": "", "code": "client", "status": 404, "title": "A key pair None could not be found", "detail": "A key pair None could not be found.", "links": []}]} │ │ with openstack_containerinfra_cluster_v1.wikikube, │ on magnum.tf line 1, in resource "openstack_containerinfra_cluster_v1" "wikikube": │ 1: resource "openstack_containerinfra_cluster_v1" "wikikube" { │ ╵
The missing d50c7032-5a62-4084-a27c-35f93653c746 cert is the same failure from the original tofu apply failure. If I remove the "openstack_containerinfra_cluster_v1" resource that failed in the original from my local state file then I can run tofu plan again.
tofu destroy gets stuck because it fails to run DELETE https://openstack.eqiad1.wikimediacloud.org:29511/v1/clustertemplates/fb4b8790-5c0c-4092-bfad-3b7a48de09f0 (the cert that apparently failed to provision originally).
Manual cleanup of tofu failure:
$ sudo wmcs-openstack coe cluster list +--------------------------------------+----------------+---------+------------+--------------+-----------------+---------------+ | uuid | name | keypair | node_count | master_count | status | health_status | +--------------------------------------+----------------+---------+------------+--------------+-----------------+---------------+ | fb7dd564-72d7-44ef-b892-c4439bb76429 | procbot-k8s-b | None | 1 | 1 | CREATE_COMPLETE | UNKNOWN | | 6d15f459-c71f-4c63-9044-b56cd7cbaef8 | quarry-124 | None | 2 | 1 | CREATE_COMPLETE | UNKNOWN | | 73b020ee-1695-4ca5-93bf-21a96be5ad4b | paws-127 | None | 5 | 1 | CREATE_COMPLETE | UNKNOWN | | c731028d-70be-48ee-a5fa-1881602c60ef | superset-126-2 | None | 2 | 1 | CREATE_COMPLETE | UNKNOWN | | c754199b-7d97-4af6-ab5a-c3520d74ddb5 | superset-127 | None | 2 | 1 | CREATE_COMPLETE | UNKNOWN | | d50c7032-5a62-4084-a27c-35f93653c746 | wikikube | None | 1 | 1 | CREATE_FAILED | None | +--------------------------------------+----------------+---------+------------+--------------+-----------------+---------------+ $ sudo wmcs-openstack coe cluster delete d50c7032-5a62-4084-a27c-35f93653c746 Request to delete cluster d50c7032-5a62-4084-a27c-35f93653c746 has been accepted. $ sudo wmcs-openstack coe cluster template list +--------------------------------------+------------------------+------+ | uuid | name | tags | +--------------------------------------+------------------------+------+ | b6df2852-eb2e-4d17-a93f-973a4ccabd64 | paws-k8s21 | None | | 79ab0387-13e3-4f19-a277-a0376ca675b7 | paws-k8s22 | None | | 15edb490-b952-4eb7-bbad-3b95326a9f91 | paws-k8s23 | None | | ff8e7dc6-dd1f-44a1-8d8b-4767f6c4eed3 | procbot-k8s-b-template | None | | 4c9c21a4-3fc7-4472-a2e7-3e4c22bef47e | tf-infra-test-123 | None | | 1b997ebc-5024-4e92-a5c2-d33a8a3b5a46 | tf-infra-test-123 | None | | 3578c3ab-bde0-49d5-b06c-0cdea28c389b | quarry-124 | None | | cbe58a34-a936-4c5c-ad83-92d4d2eb02ba | paws-127 | None | | 8c056e56-3e85-45d1-bf5b-8c70ee8690fd | superset-126-1 | None | | 54ea3f2f-1217-4426-a121-b54e9709307d | superset-126-1 | None | | f5f5dfdf-2dba-40de-9c47-6d62ff587bb5 | superset-126-2 | None | | fbe308ef-691a-4b6e-bfe7-2f984b878c92 | superset-127 | None | | fb4b8790-5c0c-4092-bfad-3b7a48de09f0 | wikikube | None | +--------------------------------------+------------------------+------+ $ sudo wmcs-openstack coe cluster template delete fb4b8790-5c0c-4092-bfad-3b7a48de09f0 Request to delete cluster template fb4b8790-5c0c-4092-bfad-3b7a48de09f0 has been accepted.
T332194: Cannot create magnum cluster looks to have been the same general problem ("Failed to create trustee or trust for Cluster"). Per T332194#8710538 I think I need to try changing to a new credential with "Unrestricted (dangerous)" permissions.
Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
The need for "Unrestricted (dangerous)" permission on the application credentials is now documented at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Magnum#Provisioning_with_OpenTofu