Prove that gitops automation works by provisioning a tiny Kubernetes cluster (1 control node, 1 worker node) and a small bastion server in the zuul project using OpenTofu.
Description
Details
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| haproxy: Add module to provision HAProxy service | repos/releng/zuul/tofu-provisioning!16 | bd808 | work/bd808/haproxy | main | |
| puppetserver: Manage Project Puppet settings | repos/releng/zuul/tofu-provisioning!13 | bd808 | work/bd808/project-puppet | main | |
| puppetserver: Provision a project local puppetserver | repos/releng/zuul/tofu-provisioning!11 | bd808 | work/bd808/puppetserver | main | |
| bastion: base64 encode ssh private key | repos/releng/zuul/tofu-provisioning!9 | bd808 | work/bd808/ssh-provision | main | |
| tofu: add a bastion and a web proxy | repos/releng/zuul/tofu-provisioning!7 | bd808 | work/bd808/bastion | main |
Event Timeline
The Kubernetes cluster is stuck on T396935: Magnum created instances failing to talk to OpenStack user_data service at the moment.
I got to a new failure point.
$ sudo wmcs-openstack stack resource list b899cc99-bec8-47d7-8eb5-b5f31027bfb4 +---------------+----------------------+---------------+-----------------+------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | +---------------+----------------------+---------------+-----------------+------------------+ | secgroup_rule | 15266470-1905-4113- | OS::Neutron:: | CREATE_COMPLETE | 2025-06- | | _tcp_kube_min | 98b7-3ad6cbc9703d | SecurityGroup | | 16T17:39:57Z | | ion_pods_cidr | | Rule | | | | secgroup_rule | 3ab08684-8ae6-48c3- | OS::Neutron:: | CREATE_COMPLETE | 2025-06- | | _udp_kube_min | 879b-eaa3d6ba612a | SecurityGroup | | 16T17:39:57Z | | ion | | Rule | | | | kube_minions | 0919918e-08ae-4744- | OS::Heat::Res | CREATE_FAILED | 2025-06- | | | ab81-ae8e0394cf6c | ourceGroup | | 16T17:39:57Z | | etcd_address_ | b5f90da8-dcd6-4779- | Magnum::ApiGa | CREATE_COMPLETE | 2025-06- | | lb_switch | 9756-4e80999e823e | tewaySwitcher | | 16T17:39:57Z | | worker_nodes_ | 71f657f7-6df5-49f4- | OS::Nova::Ser | CREATE_COMPLETE | 2025-06- | | server_group | 9cd0-3a53af27943d | verGroup | | 16T17:39:57Z | | secgroup_rule | faa9d7e8-c494-4fc8- | OS::Neutron:: | CREATE_COMPLETE | 2025-06- | | _udp_kube_min | 826a-8e8027d56160 | SecurityGroup | | 16T17:39:57Z | | ion_pods_cidr | | Rule | | | | secgroup_rule | 0c5a48d5-6128-44fa- | OS::Neutron:: | CREATE_COMPLETE | 2025-06- | | _tcp_kube_min | 86d9-05c7f3dbc0a4 | SecurityGroup | | 16T17:39:57Z | | ion | | Rule | | | | secgroup_kube | b8956d31-5f4a-46f6- | OS::Neutron:: | CREATE_COMPLETE | 2025-06- | | _minion | 901d-589123c95937 | SecurityGroup | | 16T17:39:57Z | | api_address_f | e34ab906-af20-4097- | Magnum::Float | CREATE_COMPLETE | 2025-06- | | loating_switc | 8b0d-aa2e23528f75 | ingIPAddressS | | 16T17:39:57Z | | h | | witcher | | | | api_address_l | 3819d25d-d49e-4dca- | Magnum::ApiGa | CREATE_COMPLETE | 2025-06- | | b_switch | 9393-bc0e14948d9e | tewaySwitcher | | 16T17:39:57Z | | kube_cluster_ | c4bd20d0-c59c-4e26- | OS::Heat::Sof | CREATE_COMPLETE | 2025-06- | | deploy | bb87-d70a7d9ffa8c | twareDeployme | | 16T17:39:57Z | | | | nt | | | | kube_cluster_ | 0b727aeb-6926-4c8c- | OS::Heat::Sof | CREATE_COMPLETE | 2025-06- | | config | 96da-5ecb5f46851d | twareConfig | | 16T17:39:57Z | | kube_masters | 60860d10-59ec-4ba6- | OS::Heat::Res | CREATE_COMPLETE | 2025-06- | | | 8244-8fcafdab9a10 | ourceGroup | | 16T17:39:57Z | | etcd_lb | c4bfc4a4-12a6-4557- | file:///usr/l | CREATE_COMPLETE | 2025-06- | | | b631-35b7bde3bb7b | ib/python3/di | | 16T17:39:57Z | | | | st-packages/m | | | | | | agnum/drivers | | | | | | /common/templ | | | | | | ates/lb_etcd. | | | | | | yaml | | | | master_nodes_ | 5b0df656-6dd6-4d62- | OS::Nova::Ser | CREATE_COMPLETE | 2025-06- | | server_group | b685-df576f91ee56 | verGroup | | 16T17:39:57Z | | secgroup_kube | b5a70905-c778-40de- | OS::Neutron:: | CREATE_COMPLETE | 2025-06- | | _master | b29d-a5c9561ea14c | SecurityGroup | | 16T17:39:57Z | | api_lb | 91c0516f-642b-439e- | file:///usr/l | CREATE_COMPLETE | 2025-06- | | | b30e-1f691f032a9c | ib/python3/di | | 16T17:39:57Z | | | | st-packages/m | | | | | | agnum/drivers | | | | | | /common/templ | | | | | | ates/lb_api.y | | | | | | aml | | | | network | 7b30dafe-41ef-4b41- | file:///usr/l | CREATE_COMPLETE | 2025-06- | | | 9470-0c59149082ac | ib/python3/di | | 16T17:39:57Z | | | | st-packages/m | | | | | | agnum/drivers | | | | | | /common/templ | | | | | | ates/network. | | | | | | yaml | | | +---------------+----------------------+---------------+-----------------+------------------+ $ sudo wmcs-openstack stack resource show b899cc99-bec8-47d7-8eb5-b5f31027bfb4 kube_minions +------------------------+-----------------------------------------------------+ | Field | Value | +------------------------+-----------------------------------------------------+ | updated_time | 2025-06-16T17:39:57Z | | creation_time | 2025-06-16T17:39:57Z | | logical_resource_id | kube_minions | | resource_name | kube_minions | | physical_resource_id | 0919918e-08ae-4744-ab81-ae8e0394cf6c | | resource_status | CREATE_FAILED | | resource_status_reason | OverLimit: resources.kube_minions.resources[0].reso | | | urces.docker_volume: | | | VolumeSizeExceedsAvailableQuota: Requested volume | | | or snapshot exceeds allowed gigabytes quota. | | | Requested 80G, quota is 80G and 80G has been | | | consumed. (HTTP 413) (Request-ID: | | | req-633dd966-395c-4489-bdc3-29600b34e0ff) | | resource_type | OS::Heat::ResourceGroup | | links | [{'href': 'https://openstack.eqiad1.wikimediacloud. | | | org:28004/v1/c26d9d326bdf464fa1025939ded7e5a2/stack | | | s/zuul-k8s-v127-t4s2nsgmy6at/b899cc99-bec8-47d7- | | | 8eb5-b5f31027bfb4/resources/kube_minions', 'rel': | | | 'self'}, {'href': 'https://openstack.eqiad1.wikimed | | | iacloud.org:28004/v1/c26d9d326bdf464fa1025939ded7e5 | | | a2/stacks/zuul-k8s-v127-t4s2nsgmy6at/b899cc99-bec8- | | | 47d7-8eb5-b5f31027bfb4', 'rel': 'stack'}, {'href': | | | 'https://openstack.eqiad1.wikimediacloud.org:28004/ | | | v1/admin/stacks/zuul-k8s-v127-t4s2nsgmy6at-kube_min | | | ions-h4k2oqa5ppnd/0919918e-08ae-4744-ab81- | | | ae8e0394cf6c', 'rel': 'nested'}] | | required_by | [] | | description | | | attributes | {'refs': None, 'refs_map': None, 'attributes': | | | None, 'removed_rsrc_list': []} | +------------------------+-----------------------------------------------------+
Looks like the default template wants us to have 80G of volume quota per Kubernetes node.
I got past the quota problem temporarily by adjusting the Magnum template. The 2 node test cluster almost provisioned this time. Things blew up when OpenTofu was trying to create a kubeconfig for the new cluster:
│ Error: Error building kubeconfig for openstack_containerinfra_cluster_v1 4eab7c1e-bad5-44c8-8178-1ca256a81588: Error getting certificate authority: Expected HTTP response code [200] when accessing [GET https://openstack.eqiad1.wikimediacloud.org:29511/v1/certificates/4eab7c1e-bad5-44c8-8178-1ca256a81588], but got 406 instead: {"errors": [{"request_id": "", "code": "", "status": 406, "title": "Not Acceptable", "detail": "Invalid service type for OpenStack-API-Version header", "links": []}]}
│
│ with openstack_containerinfra_cluster_v1.k8s_v127,
│ on magnum.tf line 8, in resource "openstack_containerinfra_cluster_v1" "k8s_v127":
│ 8: resource "openstack_containerinfra_cluster_v1" "k8s_v127" {bd808 opened https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/6
Changes implemented while working on getting tofu apply to work
bd808 merged https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/6
Changes implemented while working on getting tofu apply to work
bd808 opened https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/7
tofu: add a bastion and a web proxy
bd808 merged https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/7
tofu: add a bastion and a web proxy
The project now has:
- 1 Kubernetes control plane instance
- 1 Kubernetes worker instance
- 1 web proxy exposing the Kubernetes control plane at https://zuul-k8s.wmcloud.org
- 1 bastion host
I think the main thing left to figure out here is how to provision kubeconfig credentials on the bastion host to make things easier to debug in the cluster.
bd808 merged https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/8
Provision files on bastion via ssh publisher
Note to self:
[23:59] < bd808> oh poop. now I have gitlab CI constraints to work around :/ masked and protected variables cannot contain whitespace and I have an ssh private key to get into the pipeline runtime :/ [23:59] <thcipriani> base64 [00:00] < bd808> yeah, that will work. I just need to adjust some things.
bd808 opened https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/9
bastion: base64 encode ssh private key
bd808 merged https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/9
bastion: base64 encode ssh private key
@taavi suggested that if the cluster used IPv6 addresses it would be possible to talk to it from the production network without using the Cloud VPS https proxy service. I have been attempting to provision a Magnum cluster with IPv6, but I fear that it is not currently possible. The problem appears to be that the Magnum template needs both the fixed network and fixed subnet to attach instances to. When I set these to VXLAN/IPv6-dualstack and vxlan-dualstack-ipv6 the instances do not seem to be able to connect to the OpenStack user_data service at http://169.254.169.254/openstack/latest/user_data. I haven't been able to find a way to pass both the IPv6 and IPv4 subnets to the template.
Maybe the next best thing would be to implement a proxy service (HAProxy?) to sit between an IPv4 k8s cluster and the prod network? I do like the idea of avoiding the shared https proxy by using IPv6 addressing instead.
If Magnum doesn't support dual-stack clusters then I think I consider that a bug that should be fixed separately.
Ok. This looks like the thing that I need to do next. Of course things are not as simple as "just do that with tofu". There is a good pattern to follow in https://gitlab.wikimedia.org/repos/cloud/metricsinfra/tofu-provisioning, but to apply it in the zuul project I am also going to need to introduce a project local puppetserver so I can do ops/puppet.git related work without getting blocked on upstream merges. So the next block of work here is something like:
- add tofu to provision a puppetserver
- add tofu to make instances use the local puppetserver
- write a profile::zuul::haproxy module (bikeshedding will probably happen on the namespace there)
- add tofu to make an IPv6 addressable haproxy instance using the new Puppet manifest
- profit!
I filed T397994: [tofu-cloudvps] Document using `cloudvps_puppet_project` to manage project-wide and instance specific puppet classes and hiera settings. Until there is a fix for that, the Project Puppet settings will need to be managed via Hiera.
bd808 opened https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/11
puppetserver: Provision a project local puppetserver
bd808 merged https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/11
puppetserver: Provision a project local puppetserver
bd808 opened https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/13
puppetserver: Manage Project Puppet settings
bd808 merged https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/13
puppetserver: Manage Project Puppet settings
Change #1166006 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):
[operations/puppet@production] zuul: Add profile::zuul::haproxy for Cloud VPS project
bd808 opened https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/16
haproxy: Add module to provision HAProxy service
Change #1166263 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):
[operations/puppet@production] gitlab: Allow WMCS runners to talk to puppet-enc.cloudinfra
bd808 merged https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/16
haproxy: Add module to provision HAProxy service
Progress!
$ ping6 -c3 k8s-api.svc.zuul.eqiad1.wikimedia.cloud PING6(56=40+8+8 bytes) 2001:470:b:530:24da:9d14:4e1d:40e0 --> 2a02:ec80:a000:1::2e8 16 bytes from 2a02:ec80:a000:1::2e8, icmp_seq=0 hlim=54 time=79.734 ms 16 bytes from 2a02:ec80:a000:1::2e8, icmp_seq=1 hlim=54 time=76.439 ms 16 bytes from 2a02:ec80:a000:1::2e8, icmp_seq=2 hlim=54 time=77.780 ms --- k8s-api.svc.zuul.eqiad1.wikimedia.cloud ping6 statistics --- 3 packets transmitted, 3 packets received, 0.0% packet loss round-trip min/avg/max/std-dev = 76.439/77.984/79.734/1.353 ms $ curl -6k https://k8s-api.svc.zuul.eqiad1.wikimedia.cloud:6443 { "kind": "Status", "apiVersion": "v1", "metadata": {}, "status": "Failure", "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"", "reason": "Forbidden", "details": {}, "code": 403 }
These tests were run from my local laptop. The k8s-api.svc.zuul.eqiad1.wikimedia.cloud service name points at an HAProxy instance operating in Layer 4 mode. That reverse proxy connects via leastconn balancing to the active Magnum managed Kubernetes cluster master nodes. I haven't found a way to add additional SANS to the x508 certificate that is generated for the service, so client validation of the certificate isn't really possible at this point.
Change #1166263 merged by Dzahn:
[operations/puppet@production] gitlab: Allow WMCS runners to talk to puppet-enc.cloudinfra
ferm restarted on all wmcs runners and verified they have the iptables rule now for enc-1.cloudinfra.eqiad1.wikimedia.cloud tcp dpt:https.
bd808 opened https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/17
kubernetes: build a v1.28 cluster
bd808 merged https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/merge_requests/17
kubernetes: build a v1.28 cluster
Change #1166006 merged by Dzahn:
[operations/puppet@production] zuul: Add profile::zuul::haproxy for Cloud VPS project
New upstream bug reports for https://github.com/terraform-provider-openstack/terraform-provider-openstack:
- Update of openstack_containerinfra_cluster_v1 fails with unexpected state 'CREATE_COMPLETE' #1939
- Update of openstack_containerinfra_cluster_v1 that reduces worker count fails #1941
I have proposed a patch for the first one that seems to work locally. The second will require more research. A quick look at the code doesn't immediately explain how the Horizon dashboard requests a downscale differently.
TODO: request the config changes needed from WMCS to allow the zuul project to use more Ceph IOPS like T406271: Grant gitlab-runners-staging access to fast-iops volume type and a 4xiops instance flavor.