Page MenuHomePhabricator

OpenTofu fails to provision a Magnum managed k8s cluster in deployment-prep
Closed, ResolvedPublicBUG REPORT

Description

$ tofu apply

OpenTofu used the selected providers to generate the following execution plan.
Resource actions are indicated with the following symbols:
  + create

OpenTofu will perform the following actions:

  # local_file.wikikube_config will be created
  + resource "local_file" "wikikube_config" {
      + content              = (known after apply)
      + content_base64sha256 = (known after apply)
      + content_base64sha512 = (known after apply)
      + content_md5          = (known after apply)
      + content_sha1         = (known after apply)
      + content_sha256       = (known after apply)
      + content_sha512       = (known after apply)
      + directory_permission = "0777"
      + file_permission      = "0777"
      + filename             = "wikikube.config"
      + id                   = (known after apply)
    }

  # openstack_containerinfra_cluster_v1.wikikube will be created
  + resource "openstack_containerinfra_cluster_v1" "wikikube" {
      + api_address         = (known after apply)
      + cluster_template_id = (known after apply)
      + coe_version         = (known after apply)
      + container_version   = (known after apply)
      + create_timeout      = (known after apply)
      + created_at          = (known after apply)
      + discovery_url       = (known after apply)
      + docker_volume_size  = (known after apply)
      + fixed_network       = (known after apply)
      + fixed_subnet        = (known after apply)
      + flavor              = (known after apply)
      + floating_ip_enabled = (known after apply)
      + id                  = (known after apply)
      + keypair             = (known after apply)
      + kubeconfig          = (sensitive value)
      + labels              = (known after apply)
      + master_addresses    = (known after apply)
      + master_count        = 1
      + master_flavor       = (known after apply)
      + name                = "wikikube"
      + node_addresses      = (known after apply)
      + node_count          = 1
      + project_id          = (known after apply)
      + region              = (known after apply)
      + stack_id            = (known after apply)
      + updated_at          = (known after apply)
      + user_id             = (known after apply)
    }

  # openstack_containerinfra_clustertemplate_v1.wikikube_template will be created
  + resource "openstack_containerinfra_clustertemplate_v1" "wikikube_template" {
      + cluster_distro        = (known after apply)
      + coe                   = "kubernetes"
      + created_at            = (known after apply)
      + dns_nameserver        = "8.8.8.8"
      + docker_storage_driver = "overlay2"
      + docker_volume_size    = 80
      + external_network_id   = "wan-transport-eqiad"
      + fixed_network         = "lan-flat-cloudinstances2b"
      + fixed_subnet          = "cloud-instances2-b-eqiad"
      + flavor                = "g4.cores8.ram32.disk20"
      + floating_ip_enabled   = false
      + id                    = (known after apply)
      + image                 = "Fedora-CoreOS-38"
      + labels                = {
          + "cloud_provider_enabled"    = "true"
          + "container_runtime"         = "containerd"
          + "containerd_tarball_sha256" = "1d86b534c7bba51b78a7eeb1b67dd2ac6c0edeb01c034cc5f590d5ccd824b416"
          + "containerd_version"        = "1.6.20"
          + "hyperkube_prefix"          = "docker.io/rancher/"
          + "kube_tag"                  = "v1.26.8-rancher1"
        }
      + master_flavor         = "g4.cores2.ram4.disk20"
      + name                  = "wikikube"
      + network_driver        = "flannel"
      + project_id            = (known after apply)
      + region                = (known after apply)
      + server_type           = (known after apply)
      + updated_at            = (known after apply)
      + user_id               = (known after apply)
    }

Plan: 3 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  OpenTofu will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

openstack_containerinfra_clustertemplate_v1.wikikube_template: Creating...
openstack_containerinfra_clustertemplate_v1.wikikube_template: Creation complete after 5s [id=fb4b8790-5c0c-4092-bfad-3b7a48de09f0]
openstack_containerinfra_cluster_v1.wikikube: Creating...
openstack_containerinfra_cluster_v1.wikikube: Still creating... [10s elapsed]
openstack_containerinfra_cluster_v1.wikikube: Still creating... [20s elapsed]
openstack_containerinfra_cluster_v1.wikikube: Still creating... [30s elapsed]
openstack_containerinfra_cluster_v1.wikikube: Still creating... [40s elapsed]
openstack_containerinfra_cluster_v1.wikikube: Still creating... [50s elapsed]
openstack_containerinfra_cluster_v1.wikikube: Still creating... [1m0s elapsed]

│ Error: Error waiting for openstack_containerinfra_cluster_v1 d50c7032-5a62-4084-a27c-35f93653c746 to become ready: openstack_containerinfra_cluster_v1 is in an error state: Failed to create trustee or trust for Cluster: d50c7032-5a62-4084-a27c-35f93653c746

│   with openstack_containerinfra_cluster_v1.wikikube,
│   on magnum.tf line 1, in resource "openstack_containerinfra_cluster_v1" "wikikube":
│    1: resource "openstack_containerinfra_cluster_v1" "wikikube" {

Event Timeline

I'm not quite sure how to start troubleshooting this at the moment... I guess I will start by digging around in logs.

https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu is the tofu config I was trying to apply. The only thing not committed to that repo is a secrets.auto.tfvars file with os_application_credential_id and os_application_credential_secret values take from application credentials for my user in the deployment-prep project.

Things are well and truly broken now:

$ tofu plan
Planning failed. OpenTofu encountered an error while generating this plan.


│ Error: Error building kubeconfig for openstack_containerinfra_cluster_v1 d50c7032-5a62-4084-a27c-35f93653c746: Error getting certificate authority: Resource not found: [GET https://openstack.eqiad1.wikimediacloud.org:29511/v1/certificates/d50c7032-5a62-4084-a27c-35f93653c746], error message: {"errors": [{"request_id": "", "code": "client", "status": 404, "title": "A key pair None could not be found", "detail": "A key pair None could not be found.", "links": []}]}

│   with openstack_containerinfra_cluster_v1.wikikube,
│   on magnum.tf line 1, in resource "openstack_containerinfra_cluster_v1" "wikikube":
│    1: resource "openstack_containerinfra_cluster_v1" "wikikube" {

The missing d50c7032-5a62-4084-a27c-35f93653c746 cert is the same failure from the original tofu apply failure. If I remove the "openstack_containerinfra_cluster_v1" resource that failed in the original from my local state file then I can run tofu plan again.

tofu destroy gets stuck because it fails to run DELETE https://openstack.eqiad1.wikimediacloud.org:29511/v1/clustertemplates/fb4b8790-5c0c-4092-bfad-3b7a48de09f0 (the cert that apparently failed to provision originally).

Manual cleanup of tofu failure:

$ sudo wmcs-openstack coe cluster list
+--------------------------------------+----------------+---------+------------+--------------+-----------------+---------------+
| uuid                                 | name           | keypair | node_count | master_count | status          | health_status |
+--------------------------------------+----------------+---------+------------+--------------+-----------------+---------------+
| fb7dd564-72d7-44ef-b892-c4439bb76429 | procbot-k8s-b  | None    |          1 |            1 | CREATE_COMPLETE | UNKNOWN       |
| 6d15f459-c71f-4c63-9044-b56cd7cbaef8 | quarry-124     | None    |          2 |            1 | CREATE_COMPLETE | UNKNOWN       |
| 73b020ee-1695-4ca5-93bf-21a96be5ad4b | paws-127       | None    |          5 |            1 | CREATE_COMPLETE | UNKNOWN       |
| c731028d-70be-48ee-a5fa-1881602c60ef | superset-126-2 | None    |          2 |            1 | CREATE_COMPLETE | UNKNOWN       |
| c754199b-7d97-4af6-ab5a-c3520d74ddb5 | superset-127   | None    |          2 |            1 | CREATE_COMPLETE | UNKNOWN       |
| d50c7032-5a62-4084-a27c-35f93653c746 | wikikube       | None    |          1 |            1 | CREATE_FAILED   | None          |
+--------------------------------------+----------------+---------+------------+--------------+-----------------+---------------+
$ sudo wmcs-openstack coe cluster delete d50c7032-5a62-4084-a27c-35f93653c746
Request to delete cluster d50c7032-5a62-4084-a27c-35f93653c746 has been accepted.
$ sudo wmcs-openstack coe cluster template list
+--------------------------------------+------------------------+------+
| uuid                                 | name                   | tags |
+--------------------------------------+------------------------+------+
| b6df2852-eb2e-4d17-a93f-973a4ccabd64 | paws-k8s21             | None |
| 79ab0387-13e3-4f19-a277-a0376ca675b7 | paws-k8s22             | None |
| 15edb490-b952-4eb7-bbad-3b95326a9f91 | paws-k8s23             | None |
| ff8e7dc6-dd1f-44a1-8d8b-4767f6c4eed3 | procbot-k8s-b-template | None |
| 4c9c21a4-3fc7-4472-a2e7-3e4c22bef47e | tf-infra-test-123      | None |
| 1b997ebc-5024-4e92-a5c2-d33a8a3b5a46 | tf-infra-test-123      | None |
| 3578c3ab-bde0-49d5-b06c-0cdea28c389b | quarry-124             | None |
| cbe58a34-a936-4c5c-ad83-92d4d2eb02ba | paws-127               | None |
| 8c056e56-3e85-45d1-bf5b-8c70ee8690fd | superset-126-1         | None |
| 54ea3f2f-1217-4426-a121-b54e9709307d | superset-126-1         | None |
| f5f5dfdf-2dba-40de-9c47-6d62ff587bb5 | superset-126-2         | None |
| fbe308ef-691a-4b6e-bfe7-2f984b878c92 | superset-127           | None |
| fb4b8790-5c0c-4092-bfad-3b7a48de09f0 | wikikube               | None |
+--------------------------------------+------------------------+------+
$ sudo wmcs-openstack coe cluster template delete fb4b8790-5c0c-4092-bfad-3b7a48de09f0
Request to delete cluster template fb4b8790-5c0c-4092-bfad-3b7a48de09f0 has been accepted.

T332194: Cannot create magnum cluster looks to have been the same general problem ("Failed to create trustee or trust for Cluster"). Per T332194#8710538 I think I need to try changing to a new credential with "Unrestricted (dangerous)" permissions.

Apply complete! Resources: 3 added, 0 changed, 0 destroyed.

The need for "Unrestricted (dangerous)" permission on the application credentials is now documented at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Magnum#Provisioning_with_OpenTofu