Page MenuHomePhabricator

Mirrored Docker Hub images are not working in GitLab pipelines
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue:

  • Run a GitLab pipeline that works with Docker Hub images

What happens?:
The pipeline fails with an error like:

ERROR: Job failed: prepare environment: waiting for pod running: pulling image "docker-hub-mirror.cloud.releng.team/library/python:3.11": image pull failed: Back-off pulling image "docker-hub-mirror.cloud.releng.team/library/python:3.11". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Details

Other Assignee
jnuche

Event Timeline

https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/jobs/342362 is an example pipeline failure.

Trying to pull the image locally reveals it's getting a 500:

$ podman pull docker-hub-mirror.cloud.releng.team/library/python:3.11
Trying to pull docker-hub-mirror.cloud.releng.team/library/python:3.11...
Error: initializing source docker://docker-hub-mirror.cloud.releng.team/library/python:3.11: reading manifest 3.11 in docker-hub-mirror.cloud.releng.team/library/python: received unexpected HTTP status: 500 Internal Server Error

Can confirm, here's what I see in the logs:

time="2024-08-15T14:40:45.958387552Z" level=error msg="response completed with error" err.code=unknown err.detail="filesystem: mkdir /var/lib/registry/docker/registry
/v2/repositories/library/python/_manifests/tags/3.11/index/sha256/a23661e4d5dacf56028a800d3af100397a99b120d0f0de5892db61437fd9eb6c: no space left on device" err.messa
ge="unknown error" go.version=go1.20.8 http.request.host="127.0.0.1:5000" http.request.id=ae07e2e6-ebd4-403b-af60-9015f485a41f http.request.method=GET http.request.remoteaddr=X.X.X.X http.request.uri="/v2/library/python/manifests/3.11" http.request.useragent="containers/5.32.0 (github.com/containers/image)" http.response.contenttype="application/json; charset=utf-8" http.response.duration=162.041392ms http.response.status=500 http.response.written=303 vars.name="library/python" vars.referen
ce=3.11

And, sure enough, this volume is full:

$ kubectl --namespace gitlab-runner exec docker-hub-mirror-0 -- df -h | grep registry
                          9.7G      9.7G         0 100% /var/lib/registry
thcipriani assigned this task to dduvall.
thcipriani updated Other Assignee, added: jnuche.
thcipriani added subscribers: jnuche, dduvall.

Should be fixed now and I see successful reads and writes in the logs.


More details for the curious (and future me).

A full persistent volume attached to the docker-hub-mirror caused issues for the docker-hub-mirror.

Verified this with:

kubectl --namespace gitlab-runner logs docker-hub-mirror-0
kubectl --namespace gitlab-runner exec docker-hub-mirror-0 -- df -h | grep registry

After re-running terraform with --replace module.docker-hub-mirror.helm_release.docker-hub-mirror—docker-hub-mirror got back to normal.

tl;dr:

terraform force-unlock -f <lock-id> # ← not required in the normal case
terraform plan --replace module.docker-hub-mirror.helm_release.docker-hub-mirror
terraform apply --replace module.docker-hub-mirror.helm_release.docker-hub-mirror

Of course, the process of figuring out those commands was non-linear. This was the process @jnuche and @dduvall went through.

Terraform is run from a gitlab pipeline. On the initial pipeline run, Terraform did not realize it needed to adjust the docker-hub deployment, so on that first run the plan stage was a no-op and the pipeline was cancelled.

@dduvall added a way to pass a replace flag to Terraform in the job to force a redeployment.

But cancelling the initial run left Terraform locked:

│ Error: Error releasing the state lock
│ 
│ Error message: Unexpected HTTP response code 401
│ 
│ Terraform acquires a lock when accessing your state to prevent others
│ running Terraform to potentially modify the state at the same time. An
│ error occurred while releasing this lock. This could mean that the lock
│ did or did not release properly. If the lock didn't release properly,
│ Terraform may not be able to run future commands since it'll appear as if
│ the lock is held.

So Dan added a way to run terraform force-unlock <lock-id> for a list of lock-ids.

After that, a rerun got everything back to normal.