Page MenuHomePhabricator

Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade
Closed, ResolvedPublicBUG REPORT

Description

I have been working on T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu for a week and a half or so. The project is using tofu to trigger OpenStack Magnum building a Kubernetes cluster. As reported in T372498#10065541 this automation was working at least on 2024-08-14.

On 2024-08-22 I decided to do a full test again by destroying the existing 2 node cluster and attempting to create a new one. Unfortunately every attempt to build a new cluster has failed. The tofu apply workflow is creating the cluster template and then the cluster resource. The Magnum automation then starts creating the instances that will make up the cluster. I can see with Horizon that the "master" node for the cluster is created, but the Magnum automation never recognizes the instance creation completing and eventually it fails the cluster build step.

Using tofu destroy to clean up after the failed cluster build is also requiring multiple attempts. Again via Horizon I can see the "master" instance change state (this time being deleted instead of created), but Magnum seems not to get the notification that the delete succeeded and instead eventually times out waiting for the delete to happen. A subsequent tofu destroy succeeds apparently because the system recognizes that the instance is gone before completing the cluster and template object removal.

The GitLab CI pipeline at https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/-/jobs/350333 shows a typical misbehavior matching the above description. When running under GitLab the tofu apply just keeps going until the job's time limit kills it. When run locally things seem to time out sooner with opentofu itself giving up. This is likely just some behavioral difference triggered by the TF_IN_AUTOMATION=1 envvar in the GitLab run.

I don't have any hard proof that this change in behavior was caused by T369044: Upgrade cloud-vps openstack to version 'Caracal', but the timing is suspicious.

Event Timeline

@Andrew, any idea about where I should start looking for hints about what might be going wrong here?

https://docs.openstack.org/magnum/2024.1/admin/troubleshooting-guide.html#heat-stacks

$ openstack stack list | grep beta
| e93cb3ea-9a9e-47e8-a435-a475706bf948 | beta-v126-zzqihxfjnjjb                | CREATE_FAILED | 2024-08-23T18:03:44Z | None         |
$ openstack stack resource list beta-v126-zzqihxfjnjjb | grep FAILED
| kube_masters                            | dc43e1f5-2f7b-4e69-bc2b-537561edba46 | OS::Heat::ResourceGroup                                                            | CREATE_FAILED   | 2024-08-23T18:04:19Z |
$ openstack stack resource show beta-v126-zzqihxfjnjjb kube_masters
+------------------------+-----------------------------------------------------+
| Field                  | Value                                               |
+------------------------+-----------------------------------------------------+
| attributes             | {'removed_rsrc_list': [], 'refs_map': None, 'refs': |
|                        | None, 'attributes': None}                           |
| creation_time          | 2024-08-23T18:04:19Z                                |
| description            |                                                     |
| links                  | [{'href': 'https://openstack.eqiad1.wikimediacloud. |
|                        | org:28004/v1/deployment-prep/stacks/beta-v126-      |
|                        | zzqihxfjnjjb/e93cb3ea-9a9e-47e8-a435-               |
|                        | a475706bf948/resources/kube_masters', 'rel':        |
|                        | 'self'}, {'href': 'https://openstack.eqiad1.wikimed |
|                        | iacloud.org:28004/v1/deployment-prep/stacks/beta-   |
|                        | v126-zzqihxfjnjjb/e93cb3ea-9a9e-47e8-a435-          |
|                        | a475706bf948', 'rel': 'stack'}, {'href': 'https://o |
|                        | penstack.eqiad1.wikimediacloud.org:28004/v1/deploym |
|                        | ent-prep/stacks/beta-v126-zzqihxfjnjjb-kube_masters |
|                        | -3dcad4ye6nm6/dc43e1f5-2f7b-4e69-bc2b-              |
|                        | 537561edba46', 'rel': 'nested'}]                    |
| logical_resource_id    | kube_masters                                        |
| physical_resource_id   | dc43e1f5-2f7b-4e69-bc2b-537561edba46                |
| required_by            | ['kube_cluster_config', 'api_address_lb_switch',    |
|                        | 'kube_cluster_deploy', 'etcd_address_lb_switch']    |
| resource_name          | kube_masters                                        |
| resource_status        | CREATE_FAILED                                       |
| resource_status_reason | Error: resources.kube_masters.resources[0].resource |
|                        | s.master_config_deployment: Deployment to server    |
|                        | failed: deploy_status_code: Deployment exited with  |
|                        | non-zero status code: 1                             |
| resource_type          | OS::Heat::ResourceGroup                             |
| updated_time           | 2024-08-23T18:04:19Z                                |
+------------------------+-----------------------------------------------------+
bd808 renamed this task from Provisioning of Kubernetes cluster via Magnum stopped working around time time of OpenStack upgrade to Provisioning of Kubernetes cluster via Magnum stopped working around time of OpenStack upgrade.Aug 23 2024, 9:34 PM

Deep in the guts of the heat agent it is trying to get json containing a cert, and failing to parse what it gets:

# Send csr to Magnum to have it signed
csr_req=$(python -c "import json; fp = open('${_CSR}'); print(json.dumps({'ca_cert_type': '$_CA_CERT_TYPE', 'cluster_uuid': '$CLUSTER_UUID', 'csr': fp.read()})); fp.close()")
curl $VERIFY_CA -X POST \
    -H "X-Auth-Token: $USER_TOKEN" \
    -H "OpenStack-API-Version: container-infra latest" \
    -H "Content-Type: application/json" \
    -d "$csr_req" \
    $MAGNUM_URL/certificates | python -c 'import sys, json; print(json.load(sys.stdin)["pem"])' > ${_CERT}

Which produces output like...

[2024-08-24 01:15:15,362] (heat-config) [DEBUG] + csr_req='{"ca_cert_type": "kubelet", "cluster_uuid": "e599ba8f-5b9b-4976-ae95-ab74faaf5f25", "csr": "-----BEGIN CERTIFICATE REQUEST-----\n<blob omitted>\n-----END CERTIFICATE REQUEST-----\n"}'
[2024-08-24 01:15:15,363] (heat-config) [DEBUG] + python -c 'import sys, json; print(json.load(sys.stdin)["pem"])'
[2024-08-24 01:15:15,363] (heat-config) [DEBUG] + curl -X POST -H 'X-Auth-Token: <blob omitted>' -H 'OpenStack-API-Version: container-infra latest' -H 'Content-Type: application/json' -d '{"ca_cert_type": "kubelet", "cluster_uuid": "e599ba8f-5b9b-4976-ae95-ab74faaf5f25", "csr": "-----BEGIN CERTIFICATE REQUEST-----\n<blob omitted>\n-----END CERTIFICATE REQUEST-----\n"}' https://openstack.eqiad1.wikimediacloud.org:29511/v1/certificates
[2024-08-24 01:15:15,374] (heat-config) [DEBUG] % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
[2024-08-24 01:15:15,375] (heat-config) [DEBUG] Dload  Upload   Total   Spent    Left  Speed
100  2201    0   257  100  1944      4     31  0:01:02  0:01:00  0:00:02    57     0      0 --:--:-- --:--:-- --:--:--     0
[2024-08-24 01:16:16,190] (heat-config) [DEBUG] Traceback (most recent call last):
[2024-08-24 01:16:16,191] (heat-config) [DEBUG] File "<string>", line 1, in <module>
[2024-08-24 01:16:16,196] (heat-config) [DEBUG] KeyError: 'pem'
[2024-08-24 01:16:16,212] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/9f420156-7c6a-460e-bf15-3d08597dfc9a. [1]

Testing by hand, that curl works some of the time and fails some of the time. So maybe we're just seeing a partial service outage + a very touchy agent

On the server side, a failed curl looks like this:

{"message": "Server-side error: \"Timed out waiting for a reply to message ID 643b87cfd20e40b7bf615bda77a84255\". Detail: \nTraceback (most recent call last):\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py\", line 500, in get\n    return self._queues[msg_id].get(block=True, timeout=timeout)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3.11/queue.py\", line 179, in get\n    raise Empty\n\n_queue.Empty\n\n\nDuring handling of the above exception, another exception occurred:\n\n\nTraceback (most recent call last):\n\n  File \"/usr/lib/python3/dist-packages/wsmeext/pecan.py\", line 82, in callfunction\n    result = f(self, *args, **kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/magnum/api/controllers/v1/certificate.py\", line 177, in post\n    new_cert = pecan.request.rpcapi.sign_certificate(cluster,\n               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/osprofiler/profiler.py\", line 159, in wrapper\n    result = f(*args, **kwargs)\n             ^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/magnum/conductor/api.py\", line 126, in sign_certificate\n    return self._call('sign_certificate', cluster=cluster,\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/magnum/common/rpc_service.py\", line 79, in _call\n    return self._client.call(self._context, method, *args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/rpc/client.py\", line 546, in call\n    return self.prepare().call(ctxt, method, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/rpc/client.py\", line 190, in call\n    result = self.transport._send(\n             ^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/transport.py\", line 123, in _send\n    return self._driver.send(target, ctxt, message,\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py\", line 782, in send\n    return self._send(target, ctxt, message, wait_for_reply, timeout,\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py\", line 771, in _send\n    result = self._waiter.wait(msg_id, timeout,\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py\", line 636, in wait\n    message = self.waiters.get(msg_id, timeout=timeout)\n              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py\", line 502, in get\n    raise oslo_messaging.MessagingTimeout(\n\noslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 643b87cfd20e40b7bf615bda77a84255\n", "asctime": "2024-08-24 02:09:28", "name": "wsme.api", "msg": "Server-side error: \"Timed out waiting for a reply to message ID 643b87cfd20e40b7bf615bda77a84255\". Detail: \nTraceback (most recent call last):\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py\", line 500, in get\n    return self._queues[msg_id].get(block=True, timeout=timeout)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3.11/queue.py\", line 179, in get\n    raise Empty\n\n_queue.Empty\n\n\nDuring handling of the above exception, another exception occurred:\n\n\nTraceback (most recent call last):\n\n  File \"/usr/lib/python3/dist-packages/wsmeext/pecan.py\", line 82, in callfunction\n    result = f(self, *args, **kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/magnum/api/controllers/v1/certificate.py\", line 177, in post\n    new_cert = pecan.request.rpcapi.sign_certificate(cluster,\n               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/osprofiler/profiler.py\", line 159, in wrapper\n    result = f(*args, **kwargs)\n             ^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/magnum/conductor/api.py\", line 126, in sign_certificate\n    return self._call('sign_certificate', cluster=cluster,\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/magnum/common/rpc_service.py\", line 79, in _call\n    return self._client.call(self._context, method, *args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/rpc/client.py\", line 546, in call\n    return self.prepare().call(ctxt, method, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/rpc/client.py\", line 190, in call\n    result = self.transport._send(\n             ^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/transport.py\", line 123, in _send\n    return self._driver.send(target, ctxt, message,\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py\", line 782, in send\n    return self._send(target, ctxt, message, wait_for_reply, timeout,\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py\", line 771, in _send\n    result = self._waiter.wait(msg_id, timeout,\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py\", line 636, in wait\n    message = self.waiters.get(msg_id, timeout=timeout)\n              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n  File \"/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py\", line 502, in get\n    raise oslo_messaging.MessagingTimeout(\n\noslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 643b87cfd20e40b7bf615bda77a84255\n", "args": [], "levelname": "ERROR", "levelno": 40, "pathname": "/usr/lib/python3/dist-packages/wsme/api.py", "filename": "api.py", "module": "api", "lineno": 230, "funcname": "format_exception", "created": 1724465368.8644986, "msecs": 864.0, "relative_created": 24315838.299036026, "thread": 140292791449280, "thread_name": "uWSGIWorker6Core29", "process_name": "MainProcess", "process": 683173, "traceback": null, "hostname": "cloudcontrol1007", "error_summary": "oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 643b87cfd20e40b7bf615bda77a84255", "context": {"user_name": null, "project_name": null, "domain_name": null, "user_domain_name": null, "project_domain_name": null, "user": null, "project_id": null, "system_scope": null, "project": null, "domain": null, "user_domain": null, "project_domain": null, "is_admin": true, "read_only": false, "show_deleted": false, "auth_token": null, "request_id": "req-47c93fc8-599d-4a98-8bf8-aba6dee55169", "global_request_id": null, "resource_uuid": null, "roles": [], "user_identity": "- - - - -", "is_admin_project": true, "auth_url": null, "user_domain_id": null, "user_id": null, "trust_id": null, "password": null, "all_tenants": false}, "extra": {}}

So... time for me to do a rabbitmq overhaul :(

I didn't do a rabbitmq overhaul. This is silly, but I restarted all the magnum-conductor agents and the cert signing became reliable. And, as far as I can tell, that was it -- tofu can now create a k8s cluster.

So the conductor was wedged in a way that was invisible to monitoring, I guess.

In codfw1dev, on the other hand, everything is failing because magnum thinks that the fedora base images are invalid. Maybe true! In the meantime, bd808 can you retest?

Change #1065934 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Openstack magnum: override default db settings to increase connection count

https://gerrit.wikimedia.org/r/1065934

The attached bug delays, but does not resolve the problem. It's a connection leak of some sort, affecting many users: https://bugs.launchpad.net/magnum/+bug/2067345

Change #1065988 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Magnum: hack in fixes to sqlalchemy use

https://gerrit.wikimedia.org/r/1065988

Change #1065934 abandoned by Andrew Bogott:

[operations/puppet@production] Openstack magnum: override default db settings to increase connection count

Reason:

This delays but does not resolve the problem.

https://gerrit.wikimedia.org/r/1065934

Change #1065988 merged by Andrew Bogott:

[operations/puppet@production] Magnum: hack in fixes to sqlalchemy use

https://gerrit.wikimedia.org/r/1065988

I've applied the suggested (but messy) patch suggested as part of the upstream bug. It seems to be helping.

I didn't do a rabbitmq overhaul. This is silly, but I restarted all the magnum-conductor agents and the cert signing became reliable. And, as far as I can tell, that was it -- tofu can now create a k8s cluster.

So the conductor was wedged in a way that was invisible to monitoring, I guess.

In the meantime, bd808 can you retest?

https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/-/jobs/351886 succeeded! The tofu apply step took about 21 minutes, but I think that was about how long things were taking before too.

2024-08-26T17:00:48.526566Z 01O Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
2024-08-26T17:00:48.526569Z 01O 
2024-08-26T17:00:48.736691Z 00O section_end:1724691648:step_script
2024-08-26T17:00:48.736701Z 00O+section_start:1724691648:cleanup_file_variables
2024-08-26T17:00:48.738336Z 00O+Cleaning up project directory and file based variables
2024-08-26T17:00:49.676208Z 00O section_end:1724691649:cleanup_file_variables
2024-08-26T17:00:49.676228Z 00O+
2024-08-26T17:00:49.705707Z 00O Job succeeded

Thanks for digging into this @Andrew.

Looks like some of the images for the 1.26 deploy of magnum k8s vanished from upstream, so tofu sees the deploy as good, but kube-system has a few image pull failures. 1.27 seems to work now. There is a little note in T373360 and a little discussion in T373093 on the issue. At any rate you should probably update. Quarry is updated to 1.27, so it's code is good to replicate from. PAWS has a pr open for updating but it hasn't gone out yet https://github.com/toolforge/paws/pull/451