Page MenuHomePhabricator

tofu-provisioning: Failed to install provider
Closed, ResolvedPublicBUG REPORT

Description

The GitLab pipelines for the tofu-provisioning repo are frequently failing with

│ Error: Failed to install provider
│ 
│ Error while installing terraform-provider-openstack/openstack v3.0.0: could
│ not query provider registry for
│ registry.opentofu.org/terraform-provider-openstack/openstack: failed to
│ retrieve authentication checksums for provider: the request failed after 2
│ attempts, please try again later: Get
│ "https://release-assets.githubusercontent.com/github-production-release-asset/93446101/fd7160c0-4575-4746-b957-b15f98397df3?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-09-26T14%3A25%3A51Z&rscd=attachment%3B+filename%3Dterraform-provider-openstack_3.0.0_SHA256SUMS&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2025-09-26T13%3A25%3A42Z&ske=2025-09-26T14%3A25%3A51Z&sks=b&skv=2018-11-09&sig=Q7l7AqeOG2UDISGpbVW7vpWyzadIgI6P%2B5Bf1%2Bpd0lA%3D&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc1ODg5NDgwMSwibmJmIjoxNzU4ODk0NTAxLCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdGlvbi5ibG9iLmNvcmUud2luZG93cy5uZXQifQ.1WbH0zUKfAkpHLmSyA7Z-pVFFARvOsyfG59SqQ8H3GA&response-content-disposition=attachment%3B%20filename%3Dterraform-provider-openstack_3.0.0_SHA256SUMS&response-content-type=application%2Foctet-stream":
│ net/http: request canceled while waiting for connection (Client.Timeout
│ exceeded while awaiting headers)

A first attempted fix was T403028: toolforge tofu-provisioning: Cache terraform-provider-openstack binary somewhere but that did not work because caches are not currently working in our GitLab instance (T365772: Configure cache store for Gitlab WMCS runners).

An upstream discussion about this issue, suggesting the root cause could be rate limiting from GitHub:
https://github.com/orgs/community/discussions/8535

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
fnegri triaged this task as High priority.Sep 26 2025, 2:03 PM

Maybe this new OpenTofu feature could help (haven't read the details yet):

OCI Registry Support - Distribute providers and modules through container registries, perfect for air-gapped environments

I found some interesting things:

  • I can reproduce the failure with curl from the GitLab worker VM (runner-1033.gitlab-runners.eqiad1.wikimedia.cloud), running curl inside our custom docker image docker-registry.svc.toolforge.org/tofu-provisioning:20250512
  • I can NOT reproduce the failure with curl running in the same worker, but outside of Docker
  • I can NOT reproduce the failure with curl running in the same worker, inside a vanilla debian image

With our custom image:

root@runner-1033:~# docker run --rm -it docker-registry.svc.toolforge.org/tofu-provisioning:20250512 bash
root@310addea10a5:/# for i in {1..20}; do curl 'https://release-assets.githubusercontent.com/github-production-release-asset/93446101/fd7160c0-4575-4746-b957-b15f98397df3?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-10-07T10%3A41%3A38Z&rscd=attachment%3B+filename%3Dterraform-provider-openstack_3.0.0_SHA256SUMS&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2025-10-07T09%3A41%3A13Z&ske=2025-10-07T10%3A41%3A38Z&sks=b&skv=2018-11-09&sig=ZGn1MR60oJbZWP2J08f5ArPuaPs9FR%2F0K%2B3VW7Ad2YI%3D&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc1OTgzMTQwMywibmJmIjoxNzU5ODMxMTAzLCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdGlvbi5ibG9iLmNvcmUud2luZG93cy5uZXQifQ.OEgxlMlY3vloB8lqP1RFqxQiv0VMsLrw9R8u4szMH_4&response-content-disposition=attachment%3B%20filename%3Dterraform-provider-openstack_3.0.0_SHA256SUMS&response-content-type=application%2Foctet-stream' -s -o /dev/null --connect-timeout 1 && echo SUCCESS || echo FAILURE; done
FAILURE
FAILURE
SUCCESS
FAILURE
SUCCESS
SUCCESS
FAILURE
SUCCESS
FAILURE
FAILURE
SUCCESS
FAILURE
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
FAILURE

Outside of Docker:

root@runner-1033:~# for i in {1..20}; do curl 'https://release-assets.githubusercontent.com/github-production-release-asset/93446101/fd7160c0-4575-4746-b957-b15f98397df3?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-10-07T10%3A41%3A38Z&rscd=attachment%3B+filename%3Dterraform-provider-openstack_3.0.0_SHA256SUMS&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2025-10-07T09%3A41%3A13Z&ske=2025-10-07T10%3A41%3A38Z&sks=b&skv=2018-11-09&sig=ZGn1MR60oJbZWP2J08f5ArPuaPs9FR%2F0K%2B3VW7Ad2YI%3D&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc1OTgzMTQwMywibmJmIjoxNzU5ODMxMTAzLCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdGlvbi5ibG9iLmNvcmUud2luZG93cy5uZXQifQ.OEgxlMlY3vloB8lqP1RFqxQiv0VMsLrw9R8u4szMH_4&response-content-disposition=attachment%3B%20filename%3Dterraform-provider-openstack_3.0.0_SHA256SUMS&response-content-type=application%2Foctet-stream' -s -o /dev/null --connect-timeout 1 && echo SUCCESS || echo FAILURE; done
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS

In a vanilla debian image:

root@runner-1033:~# docker run --rm -it debian
root@06b7b08748af:/# apt update && apt install curl
[...]
root@4facf8c3ce6a:/# for i in {1..20}; do curl 'https://release-assets.githubusercontent.com/github-production-release-asset/93446101/fd7160c0-4575-4746-b957-b15f98397df3?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-10-07T10%3A41%3A38Z&rscd=attachment%3B+filename%3Dterraform-provider-openstack_3.0.0_SHA256SUMS&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2025-10-07T09%3A41%3A13Z&ske=2025-10-07T10%3A41%3A38Z&sks=b&skv=2018-11-09&sig=ZGn1MR60oJbZWP2J08f5ArPuaPs9FR%2F0K%2B3VW7Ad2YI%3D&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc1OTgzMTQwMywibmJmIjoxNzU5ODMxMTAzLCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdGlvbi5ibG9iLmNvcmUud2luZG93cy5uZXQifQ.OEgxlMlY3vloB8lqP1RFqxQiv0VMsLrw9R8u4szMH_4&response-content-disposition=attachment%3B%20filename%3Dterraform-provider-openstack_3.0.0_SHA256SUMS&response-content-type=application%2Foctet-stream' -s -o /dev/null --connect-timeout 1 && echo SUCCESS || echo FAILURE; done
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS
SUCCESS

I can also reproduce the error using docker-registry.wikimedia.org/bookworm or debian:bookworm.

I can NOT reproduce it using docker-registry.wikimedia.org/trixie, so that seems the way to fix this... But I don't understand why the problem is not present outside of docker, given the worker VM is running Bookworm.

Change #1194167 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] aptrepo: Add tofu package to trixie

https://gerrit.wikimedia.org/r/1194167

fnegri changed the task status from Open to In Progress.Oct 7 2025, 11:05 AM
fnegri claimed this task.

Change #1194167 merged by FNegri:

[operations/puppet@production] aptrepo: Add tofu package to trixie

https://gerrit.wikimedia.org/r/1194167

fnegri removed fnegri as the assignee of this task.EditedOct 7 2025, 5:08 PM

I merged the patch above, but reprepro is failing due to a problem with another package:

root@apt1002:~# reprepro --component thirdparty/tofu checkupdate trixie-wikimedia
aptmethod error receiving 'https://deb.nodesource.com/node_22.x/dists/trixie/InRelease':
'404  Not Found [IP: 2606:4700:10::ac42:96a9 443]'
aptmethod error receiving 'https://deb.nodesource.com/node_22.x/dists/trixie/Release':
'404  Not Found [IP: 2606:4700:10::ac42:96a9 443]'
aptmethod error receiving 'https://deb.nodesource.com/node_22.x/dists/trixie/Release.gpg':
'404  Not Found [IP: 2606:4700:10::ac42:96a9 443]'
There have been errors!

I'm on PTO from tomorrow, so I'll unassign myself from this task, if somebody else wants to pick it up while I'm away.

Mentioned in SAL (#wikimedia-operations) [2025-10-07T17:26:42Z] <taavi> taavi@apt1002 ~ $ sudo -i reprepro -C thirdparty/tofu update trixie-wikimedia # T405742

I pushed a new Trixie-based image: docker-registry.svc.toolforge.org/tofu-provisioning:20251014b

Things look better, but I'm still getting some failures, for example this job:

│ Error: Failed to install provider
│ 
│ Error while installing terraform-provider-openstack/openstack v3.3.2: could
│ not query provider registry for
│ registry.opentofu.org/terraform-provider-openstack/openstack: failed to
│ retrieve authentication checksums for provider: request failed after 2
│ attempts: Get
│ "https://release-assets.githubusercontent.com/github-production-release-asset/93446101/8c6e6894-d8ab-4d2b-aa9e-a29460edc89c?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-10-14T13%3A05%3A13Z&rscd=attachment%3B+filename%3Dterraform-provider-openstack_3.3.2_SHA256SUMS&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2025-10-14T12%3A04%3A17Z&ske=2025-10-14T13%3A05%3A13Z&sks=b&skv=2018-11-09&sig=8BKUqRTj6tuG5FWZtY4ZR3ebkdw167Kdm5FnL0pyg%2B0%3D&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc2MDQ0NDY5MCwibmJmIjoxNzYwNDQ0MzkwLCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdGlvbi5ibG9iLmNvcmUud2luZG93cy5uZXQifQ.Ft-E8PUItFHD04nHZSmSTFh3pOBL_VjYkok5b8j8908&response-content-disposition=attachment%3B%20filename%3Dterraform-provider-openstack_3.3.2_SHA256SUMS&response-content-type=application%2Foctet-stream":
│ net/http: request canceled while waiting for connection (Client.Timeout
│ exceeded while awaiting headers)
╵

I can reproduce the failures using curl in the new image, and also using curl in the base image (docker-registry.wikimedia.org/trixie).

There is a clear difference between the old and the new image though:

root@runner-1033:~# docker run --rm -it docker-registry.svc.toolforge.org/tofu-provisioning:20250512 bash
root@d8de7bcb964e:/# TESTURL='...'
root@d8de7bcb964e:/# for i in {1..50}; do curl $TESTURL -s -o /dev/null --connect-timeout 1 && echo -n '.' || echo -n 'F'; done; echo ''
...F......F....FFF.F.FF....F.......F.....F.....F.F

root@runner-1033:~# docker run --rm -it docker-registry.svc.toolforge.org/tofu-provisioning:20251014b bash
root@030bb1020b54:/# TESTURL='...'
root@030bb1020b54:/# for i in {1..50}; do curl $TESTURL -s -o /dev/null --connect-timeout 1 && echo -n '.' || echo -n 'F'; done; echo ''
.F................................................

I found something interesting: adding --net=host fixes the issue, regardless of which docker image I'm using.

Minimal reproduction scenario:

root@runner-1033:~# docker run --rm -it docker-registry.svc.toolforge.org/tofu-provisioning:20250512 bash
root@69ec99fd53f5:/# TESTURL='https://github.com/terraform-provider-openstack/terraform-provider-openstack/releases/download/v3.3.2/terraform-provider-openstack_3.3.2_SHA256SUMS'
root@69ec99fd53f5:/# for i in {1..50}; do curl $TESTURL -s -o /dev/null -L --connect-timeout 1 && echo -n '.' || echo -n 'F'; done; echo ''
F..F.F.FFFFF.FF.FF..F....FFF.....FFF........F....F

root@runner-1033:~# docker run --rm -it --net=host docker-registry.svc.toolforge.org/tofu-provisioning:20250512 bash
root@runner-1033:/# TESTURL='https://github.com/terraform-provider-openstack/terraform-provider-openstack/releases/download/v3.3.2/terraform-provider-openstack_3.3.2_SHA256SUMS'
root@runner-1033:/# for i in {1..50}; do curl $TESTURL -s -o /dev/null -L --connect-timeout 1 && echo -n '.' || echo -n 'F'; done; echo ''
..................................................

After more testing, it seems to be a good'ol MTU issue:

root@runner-1033:~# docker network create --opt com.docker.network.driver.mtu=1450 test1
root@runner-1033:~# docker run --rm -it --net=test1 docker-registry.svc.toolforge.org/tofu-provisioning:20250512 bash
root@3916009a2d00:/# cat /sys/class/net/eth0/mtu
1450
root@2d57f7308f73:/# TESTURL='https://github.com/terraform-provider-openstack/terraform-provider-openstack/releases/download/v3.3.2/terraform-provider-openstack_3.3.2_SHA256SUMS'
root@2d57f7308f73:/# for i in {1..50}; do curl $TESTURL -s -o /dev/null -L --connect-timeout 1 && echo -n '.' || echo -n 'F'; done; echo ''
..................................................

root@runner-1033:~# docker network create test2
root@runner-1033:~# docker run --rm -it --net=test2 docker-registry.svc.toolforge.org/tofu-provisioning:20250512 bash
root@fad6ff64ff80:/# cat /sys/class/net/eth0/mtu
1500
root@fad6ff64ff80:/# TESTURL='https://github.com/terraform-provider-openstack/terraform-provider-openstack/releases/download/v3.3.2/terraform-provider-openstack_3.3.2_SHA256SUMS'
root@fad6ff64ff80:/# for i in {1..50}; do curl $TESTURL -s -o /dev/null -L --connect-timeout 1 && echo -n '.' || echo -n 'F'; done; echo ''
...FF.F.F..F.FF.F...F.F....F.F...F.F...F.FFF.....F

Change #1196493 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] hiera: gitlab::runner::docker set MTU to 1450

https://gerrit.wikimedia.org/r/1196493

Awesome find!
what's the mtu of the vm itself? Is there a discrepancy?

Awesome find!
what's the mtu of the vm itself? Is there a discrepancy?

yep, for some reason ens3 has 1450:

root@runner-1033:~# ip a | grep mtu
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
4: br-53f860614683: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
6: veth6392f54@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 
62770: veth0bdddb7@if62769: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-53f860614683 state UP group default

I suspect that it might be related to the vxlan/overlay network sort of thing from the current neutron setup

Not all Cloud VPS VMs have the MTU set to 1450. From a quick search, it looks like it's set to that value for all VMs with both an IPv4 and an IPv6 address. In VMs that only have IPv4, MTU is set to 1500.

Change #1196493 merged by FNegri:

[operations/puppet@production] hiera: gitlab::runner::docker set MTU to 1450

https://gerrit.wikimedia.org/r/1196493

Change #1196929 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] docker::network allow custom MTU value

https://gerrit.wikimedia.org/r/1196929

Change #1196929 merged by FNegri:

[operations/puppet@production] docker::network allow custom MTU value

https://gerrit.wikimedia.org/r/1196929

The last patch I merged today is working as expected, but the gitlab-runner Docker network needs to be recreated to get the new MTU setting.

I just did it on runner-1033.gitlab-runners.eqiad1.wikimedia.cloud as a test, I will now do the same on all runner-XXXX hosts:

# run-puppet-agent
# systemctl restart docker.service # to read the new config file
# systemctl stop buildkitd.service # this is using the gitlab-runner network
# docker stop {container_id} # if there are other containers running in the host using the gitlab-runner network
# docker network rm gitlab-runner
# run-puppet-agent # this will recreate the network and restart buildkitd

Mentioned in SAL (#wikimedia-cloud) [2025-10-20T10:46:59Z] <dhinus> cumin "O{project:gitlab-runners name:runner-}" "systemctl restart docker.service" T405742

Mentioned in SAL (#wikimedia-cloud) [2025-10-20T10:48:52Z] <dhinus> cumin "O{project:gitlab-runners name:runner-}" "docker network rm gitlab-runner" T405742

fnegri moved this task from In progress to Done on the cloud-services-team (FY2025/2026-Q1-Q2) board.

All runner-XXXX hosts in the gitlab-runners project have the new settings, and a new gitlab-runners Docker network.

I'm no longer seeing failures in the Gitlab pipelines for tofu-provisioning, so I will finally mark this task as Resolved. 🎉

Change #1214507 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] devtools hiera: set gitlab::runner::docker set MTU to 1450

https://gerrit.wikimedia.org/r/1214507

Change #1214507 abandoned by Jelto:

[operations/puppet@production] devtools hiera: set gitlab::runner::docker set MTU to 1450

Reason:

not needed anymore

https://gerrit.wikimedia.org/r/1214507

Change #1214513 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab-runners hiera: remove custom MTU

https://gerrit.wikimedia.org/r/1214513

Change #1214513 merged by Jelto:

[operations/puppet@production] gitlab-runners hiera: remove custom MTU

https://gerrit.wikimedia.org/r/1214513