Page MenuHomePhabricator

[tofu-infra] "tofu plan" failing in codfw
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

$ ssh cloudcumin1001
fnegri@cloudcumin1001:~$ sudo cookbook wmcs.openstack.tofu

What happens?:

Planning failed. OpenTofu encountered an error while generating this plan.

╷
│ Error: Error retrieving openstack_dns_zone_v2 7a2210a4-3a55-48d8-9713-0256d7d9bc1b: Expected HTTP response code [200] when accessing [GET https://openstack.codfw1dev.wikimediacloud.org:29001/v2/zones/7a2210a4-3a55-48d8-9713-0256d7d9bc1b], but got 504 instead: {"code": 504, "type": "timeout", "request_id": "req-1306d84a-e117-4bc1-ba3b-5441e5730424"}
│
│   with module.project["cloudinfra-codfw1dev"].openstack_dns_zone_v2.zone["1.0.0.0.0.0.1.a.0.8.c.e.2.0.a.2.ip6.arpa."],
│   on modules/project/dns.tf line 21, in resource "openstack_dns_zone_v2" "zone":
│   21: resource "openstack_dns_zone_v2" "zone" {
│
╵
╷
│ Error: Error retrieving openstack_dns_zone_v2 4c754100-1790-4858-a583-9de93c9e8b3d: Expected HTTP response code [200] when accessing [GET https://openstack.codfw1dev.wikimediacloud.org:29001/v2/zones/4c754100-1790-4858-a583-9de93c9e8b3d], but got 504 instead: {"code": 504, "type": "timeout", "request_id": "req-1285f282-c142-435d-959d-467757f87652"}
│
│   with module.project["cloudinfra-codfw1dev"].openstack_dns_zone_v2.zone["codfw1dev.wikimedia.cloud."],
│   on modules/project/dns.tf line 21, in resource "openstack_dns_zone_v2" "zone":
│   21: resource "openstack_dns_zone_v2" "zone" {
│
╵
╷
│ Error: Error retrieving openstack_dns_zone_v2 1d558d0d-999c-4547-ad9a-e6bcdf125f4e: Expected HTTP response code [200] when accessing [GET https://openstack.codfw1dev.wikimediacloud.org:29001/v2/zones/1d558d0d-999c-4547-ad9a-e6bcdf125f4e], but got 504 instead: {"code": 504, "type": "timeout", "request_id": "req-620ca6cb-69e2-42be-bc99-d385ecfe44c7"}
│
│   with module.project["cloudinfra-codfw1dev"].openstack_dns_zone_v2.zone["svc.codfw1dev.wikimedia.cloud."],
│   on modules/project/dns.tf line 21, in resource "openstack_dns_zone_v2" "zone":
│   21: resource "openstack_dns_zone_v2" "zone" {

EDIT: This is now failing with a different error (see comments below):

root@cloudcontrol2005-dev:/srv/tofu-infra# TF_LOG=1 tofu init
[...]
2026-01-07T09:56:14.586Z [DEBUG] backend-s3: HTTP Response Received: aws.region=codfw1dev-r aws.s3.bucket=admin:tofu-state aws.s3.key=repos/cloud/cloud-vps/tofu-infra rpc.method=HeadObject rpc.service=S3 rpc.system=aws-api tf_aws.sdk=aws-sdk-go-v2 tf_aws.signing_region="" http.response.header.date="Wed, 07 Jan 2026 09:56:14 GMT" http.response.body="" http.duration=62 http.response.header.x_amz_request_id=tx00000b360a5a3ceb59297-00695e2dbe-123016c-default http.response.header.content_type=application/xml http.response.header.content_security_policy="default-src; font-src 'self'; img-src 'self' data:; style-src 'self' 'unsafe-inline'" http.status_code=403 http.response_content_length=210 http.response.header.accept_ranges=bytes
2026-01-07T09:56:14.587Z [DEBUG] backend-s3: request failed with unretryable error https response error StatusCode: 403, RequestID: tx00000b360a5a3ceb59297-00695e2dbe-123016c-default, HostID: , api error Forbidden: Forbidden: aws.region=codfw1dev-r aws.s3.bucket=admin:tofu-state aws.s3.key=repos/cloud/cloud-vps/tofu-infra rpc.method=HeadObject rpc.service=S3 rpc.system=aws-api tf_aws.sdk=aws-sdk-go-v2
Error refreshing state: operation error S3: HeadObject, https response error StatusCode: 403, RequestID: tx00000b360a5a3ceb59297-00695e2dbe-123016c-default, HostID: , api error Forbidden: Forbidden

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcaro triaged this task as Medium priority.Nov 19 2025, 8:34 AM

Now it's failing with a different error, 403 accessing the statefile with S3: HeadObject

Error refreshing state: operation error S3: HeadObject, https response error StatusCode: 403, RequestID: tx000004ec40b36966978a4-0069246be7-b2b35c-default, HostID: , api error Forbidden: Forbidden

I checked in cloudcontrol2005-dev and the credentials in /etc/tofu.env are correct, I can use them with awscli but listing the file randomly fails with argument of type 'NoneType' is not iterable:

root@cloudcontrol2005-dev:/srv/tofu-infra# aws s3 ls s3://tofu-state/repos/cloud/cloud-vps/tofu-infra
2025-11-10 15:46:41     198443 tofu-infra
root@cloudcontrol2005-dev:/srv/tofu-infra# aws s3 ls s3://tofu-state/repos/cloud/cloud-vps/tofu-infra
2025-11-10 15:46:41     198443 tofu-infra
root@cloudcontrol2005-dev:/srv/tofu-infra# aws s3 ls s3://tofu-state/repos/cloud/cloud-vps/tofu-infra

argument of type 'NoneType' is not iterable
root@cloudcontrol2005-dev:/srv/tofu-infra# aws s3 ls s3://tofu-state/repos/cloud/cloud-vps/tofu-infra
2025-11-10 15:46:41     198443 tofu-infra
root@cloudcontrol2005-dev:/srv/tofu-infra# aws s3 ls s3://tofu-state/repos/cloud/cloud-vps/tofu-infra
2025-11-10 15:46:41     198443 tofu-infra
root@cloudcontrol2005-dev:/srv/tofu-infra# aws s3 ls s3://tofu-state/repos/cloud/cloud-vps/tofu-infra

argument of type 'NoneType' is not iterable
fnegri renamed this task from [tofu-infra] tofu failing to retrieve DNS zones on codfw to [tofu-infra] "tofu plan" failing in codfw.Nov 24 2025, 2:59 PM

Tests suggest that the 'NoneType' error happens 100% of the time when cloudcontrol2005-dev is the radosgw backend, and 0% of the time with either other cloudcontrol as the backend.

Change #1211782 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] codfw1dev cloudlb: try 'source' balance method

https://gerrit.wikimedia.org/r/1211782

Change #1211782 merged by Andrew Bogott:

[operations/puppet@production] codfw1dev cloudlb: try 'source' balance method

https://gerrit.wikimedia.org/r/1211782

Tests suggest that the 'NoneType' error happens 100% of the time when cloudcontrol2005-dev is the radosgw backend, and 0% of the time with either other cloudcontrol as the backend.

I no longer think this is true; new tests today show intermittent failures on different api servers. So now I'm thinking this is in the ceph layer someplace.

This is probably unrelated, but it /is/ a concern with Ceph and trixie (right now the ceph hosts themselves are running bookworm but the radosgw is on Trixie.)

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/MIAKAJH7V2U7KWXU4LPZSEZOJTBYRJ6H/

This is probably unrelated, but

The fact that eqiad1 is now running all the same versions without issue makes it even less likely that this is result of that upstream issue on Trixie.

fnegri raised the priority of this task from Medium to High.Jan 7 2026, 10:02 AM

Raising to "High" as this is preventing us from using tofu in codfw, and is also causing an error when running the cookbook wmcs.vps.create_project, that tries to run tofu on both eqiad and codfw. We could change the cookbook to only run tofu in eqiad, but I think that would be an ugly workaround.

I can use them with awscli but listing the file randomly fails with argument of type 'NoneType' is not iterable:

I tried this again, and it's now always failing with the same error:

root@cloudcontrol2005-dev:/srv/tofu-infra# source /etc/tofu.env
root@cloudcontrol2005-dev:/srv/tofu-infra# aws s3 ls s3://tofu-state/repos/cloud/cloud-vps/tofu-infra

argument of type 'NoneType' is not iterable
fnegri updated the task description. (Show Details)

the swift API is working fine... I can run this and get a file in both deployments:

openstack object save tofu-state repos/cloud/cloud-vps/tofu-infra --os-cloud tofu --file foo.txt

So, the file exists at least!

I think this is the real message, and the NoneType exception is due to being unable to parse the error message.

2026-01-09 21:44:22,782 - MainThread - botocore.parsers - DEBUG - Response body:
b'<?xml version="1.0" encoding="UTF-8"?><Error><Code>AccessDenied</Code><Message></Message><RequestId>tx00000fdc426d9ba989f73-00696176b6-1230195-default</RequestId><HostId>1230195-default-default</HostId></Error>'

That error appears even when radosgw is turned off entirely? That suggests that the problem is in the catalog lookup.

At least part of this was related to the 'swift' service user having different role memberships on the 'service' project. In eqiad1 it had:

  • member
  • admin
  • keystonevalidate
  • reader

And in codfw1dev it had:

  • keystonevalidate
  • reader

That should never have worked. Radosgw uses the swift service user to validate tokens, and the 's3tokens_validate' rule requires either the 'service' or 'admin' role. I've now updated role assignments to correspond between eqiad1 and codfw1dev and I'm getting further.

This doesn't really explain the intermittent success earlier, or the change from intermittent failure to constant failure.

fnegri changed the task status from Open to In Progress.Jan 12 2026, 9:39 AM
fnegri assigned this task to Andrew.

@Andrew assigning to you as you're working on it :)

I'm worried that there's another bug hiding behind this one but for now everything seems to be working.