Page MenuHomePhabricator

GitLab CI jobs failing with "You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit"
Open, LowPublic

Description

Pulling docker image rust:latest ...
WARNING: Failed to pull image with policy "always": Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit (manager.go:237:0s)
ERROR: Job failed: failed to pull image "rust:latest" with specified policies [always]: Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit (manager.go:237:0s)

All of the jobs in the pipeline failed, except one: https://gitlab.wikimedia.org/repos/mwbot-rs/mwbot/-/pipelines/10975 - I've never seen this before, so I'm wondering if something changed or if we need better image caching or mirroring to avoid this in the future as usage increases.

Event Timeline

I also started getting this more on my runners for mwcli today.
I had implemented a pull through docker mirror just for mwcli usage before, but now even that is running into problems.

Generally speaking I think this is what we want to aim for? https://docs.gitlab.com/ee/user/packages/dependency_proxy/
Or a shared pull through mirror for runners

Another example https://gitlab.wikimedia.org/repos/releng/cli/-/jobs/59272

Running with gitlab-runner 15.7.2 (0e7679e6)
  on runner-1030.gitlab-runners.eqiad1.wikimedia.cloud m4MQFjvT
Preparing the "docker" executor
00:05
Using Docker executor with image golang:1.18 ...
Pulling docker image golang:1.18 ...
WARNING: Failed to pull image with policy "always": Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit (manager.go:237:0s)
ERROR: Job failed: failed to pull image "golang:1.18" with specified policies [always]: Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit (manager.go:237:0s)

These limits reset after 6 hours, at some point this will get unblocked, until it happens again.

See also https://docs.gitlab.com/runner/executors/docker.html#set-the-if-not-present-pull-policy

My mwcli runners make use of pull_policy = ["always", "if-not-present"]
This means in the case that a fresh pull is not possible, due to for example a rate limit, a local copy can still be used if it exists.

I believe currently "always" will be used by default for the shared runners.
As a result only 100 image pulls from docker hub can happen in a 6 hour period for the shared runners (maybe shared across all of them? not sure)

Addshore triaged this task as High priority.Feb 9 2023, 1:12 PM
Addshore added a project: mwcli.

Looks like this is still happening to me the following day.
I guess this is blocking all merges in mwcli, mwbot-rs and anyone else that uses docker hub images in CI?

I tried to use pull_policy: if-not-present in my jobs to work around the issue.
but....

ERROR: Preparation failed: failed to pull image 'docker-registry.wikimedia.org/golang1.18:1.18-1-20230129': pull_policy ([IfNotPresent]) defined in GitLab pipeline config is not one of the allowed_pull_policies ([])

https://gitlab.wikimedia.org/repos/releng/cli/-/merge_requests/292
https://gitlab.wikimedia.org/repos/releng/cli/-/jobs/60318

As I have 2 custom runners as well for my project, for now I have disabled shared and project runners and marked my 2 tagged runners as accepting untagged jobs.
So for now I am mostly unblocked by this, but things in my ci will be running slowly!

mwbot-rs CI is still busted and I'm not sure how to fix it :/

I think the options are

  • The shared runners need some tweaks to fix this.
  • You login to docker hub as part of your CI so you have your own personalized limit
  • You create your own runners that are already authenticated
  • I could add you on to one of the mwcli runners to keep your project moving.

Change 888828 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] gitlab_runner: Set pull_policy = ["always", "if-not-present"]

https://gerrit.wikimedia.org/r/888828

I think the options are

  • The shared runners need some tweaks to fix this.

OK, submitted a patch for the pull_policy as you suggested. This seems like a bandaid to me though, the dependency proxy you linked seems like a much better long-term solution.

  • You login to docker hub as part of your CI so you have your own personalized limit

Not sure how this would work, since pulling the image is the first thing that happens?

  • You create your own runners that are already authenticated
  • I could add you on to one of the mwcli runners to keep your project moving.

Appreciate the offer, if this goes on much longer I'll take you up on it. I'm also a bit confused why other GitLab users aren't running into this? Maybe they're all just using docker-registry.wikimedia.org images, or just not complaining. 🤔

Yeah, you'll only get this issue if you use docker hub images.
I think there needs to be a pull through proxy for all gitlab runners, that is then also authenticated with some wmf docker hub account.
That'd mostly fix this unless we start using a very diverse group of images from docker hub.

Change 888828 merged by Jelto:

[operations/puppet@production] gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners

https://gerrit.wikimedia.org/r/888828

I made these comments on the wrong task. Copying here before removing there.

This is pretty much the same problem as T106452: Composer activity from Cloud VPS hosts can be rate limited by GitHub was. Can we instead/also setup default credentials like we did in Jenkins to fix the GitHub rate limit problem for pulling Composer packages? ... which reminds me... we are going to have that problem again soon enough in GitLab if we don't somehow propagate the fix to the new CI platform.

I have a Public Repo Read-only token for use with docker login that I would like to share as soon as we figure out how to share such things in the GitLab world. Per https://docs.gitlab.com/ee/ci/secrets/ it looks like a Hashicorp Vault service is needed.

https://docs.gitlab.com/ee/ci/docker/using_docker_images.html#configure-a-job says that setting a CI/CD variable in the GitLab UI named DOCKER_AUTH_CONFIG with a value that is a .docker/config.json file's "auths" member should configure the runner to pull using that authentication data. I think we could simply set this globally and all repos would inherit it. I have a user and token pair that are only allowed to make read-only actions on public repos as mentioned in T329216#8620201 that can be used for this. The worst thing that could happen is the account being disabled by Docker.

The worst thing that could happen is the account being disabled by Docker.

Sadly though according to upstream docs this would only get us an extra 100 pulls per 6 hours. If we wanted more (and we very likely would) we would need to start paying Docker.com for access. :/

Change 888828 merged by Jelto:

[operations/puppet@production] gitlab_runner: Set pull_policy = ["always", "if-not-present"] on WMCS runners

https://gerrit.wikimedia.org/r/888828

FWIW this helped a bit in that some jobs now run, except a majority of mwbot-rs jobs are still failing because those runners don't have the rust image pulled yet I guess. Seems like the caching proxy is really the only option.

For starters, perhaps we can create an account on docker.io for Release Engineering and generate a public puller token to use on WMCS runners. According to the GitLab CI runner docs, the docker executor should respect the ~/.docker/config.json for the runner's user. In this scenario, the creds would not be leaked to jobs.

I ran a few tests, and it seems that the Docker registry returns a docker-ratelimit-source header in the response which seems to indicate which entity the rate limit is attached to for a given request. When making a request anonymously, the header value is my IP. When making a request using either of two tokens generated for a single account, the head value is the same UUID. This suggests that rate limiting is not double counted for IPs so long as pull requests are auth'd, and that the auth'd rate limit is associated with accounts, not access tokens.

So if we create an account (or a small pool of accounts) for WMCS runners and configure the creds on the puppetmaster (we might need to refactor some puppet to have configured creds written to ~/.docker/config.json), we can at the very least avoid incrementing usage of the IP based limit which should mitigate impact across VPS users. This could be done in addition to configuration of a registry proxy.

This hasn't been an issue for a few months now in mwbot-rs so untagging - thanks for all the work so far on making this better :)

thcipriani lowered the priority of this task from High to Low.Aug 16 2023, 4:33 PM
thcipriani added a subscriber: thcipriani.

Problem should be fixed on the Digital Ocean K8s runners. It can still happen on the WMCS runners. Maybe the solution is to ramp up use of k8s runners—not planning on that at this moment.