Page MenuHomePhabricator

DNS/networking not working on Trusted Runners
Closed, ResolvedPublic

Description

With the new network setup on Trusted Runners for buildkitd (791655) CI jobs have networking issues now. One example job for a Trusted Runner:

https://gitlab.wikimedia.org/repos/releng/gitlab-runner-test/-/jobs/21405

Failed with:

fatal: unable to access 'https://gitlab.wikimedia.org/repos/releng/gitlab-runner-test.git/': Could not resolve host: gitlab.wikimedia.org

I guess the new bridge network needs some more configuration to allow outgoing traffic (http, dns, ...).

Event Timeline

Oy. I'll have a look. @Dzahn do you know if this has anything to do with the new ferm rules?

I don't have access, so someone else will have to take a closer look.

I don't have access, so someone else will have to take a closer look.

See T308350: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group).

Based on that, it seems like members of contint-roots should be in gitlab-roots, and that works on the production GitLab hosts, but I'm guessing gitlab-roots doesn't itself grant access to the runners. (Unless this is somehow a side effect of this issue?)

It's not a side-effect of the issue, it's that we applied gitlab-roots to gitlab servers but not so far to the gitlab_runner role.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/809018

What's interesting here is the error is "Could not resolve host: gitlab.wikimedia.org" and this was very close to T311290 "DNS CI is broken".

Yea, I could still reproduce the error though. Just clicked on rebuild and it still fails like before.

"When you deploy a container on your network, if it cannot find a DNS server defined in /etc/resolv.conf, by default it will take on the DNS configured for the host machine. "

This is not the case here. Our host machine uses:

nameserver 10.3.0.1

while the nameserver inside the buildkitd container is set to 127.0.0.11

buildkit@64bf26ae4edb:/$ cat /etc/resolv.conf 
search eqiad.wmnet
nameserver 127.0.0.11

In this context see:

https://github.com/moby/moby/issues/28188#issuecomment-266150831

https://www.techrepublic.com/article/how-to-define-dns-in-docker-containers/

possible solution A: "change the resolver IP in docker-machine VM's resolv.conf from 10.0.2.3 to 8.8.8.8 or any other external IP."

This does not go with the "docker network create" command but would have to be added as --dns to wherever the "docker run" command starts the buildkit container.

possible solution B: "specify the subnet explicitly for the overlay network; docker network create -d overlay --subnet 192.168.10.0/24 ov1"

In puppet/modules/docker/manifests/network.pp we have:

 /usr/bin/docker network create \
                --driver='${driver}' \
                --subnet='${subnet}' \
..

and then we have:

hieradata/role/common/gitlab_runner.yaml:profile::gitlab::runner::docker_subnet: '172.20.0.0/16'

but that's not the overlay network.

This was all from gitlab-runner1002 in eqiad.

On a codfw runner, the one used for the original test here is gitlab-runner2004, the DNS server is set to:

10.3.0.1


I then did the following test:

  • ssh to gitlab-runner2004.codfw.wmnet
  • docker ls .. ; docker -u root -it exec /bin/bash adf4 to get in the container as root
  • sed -i 's/10.3.0.1/208.80.154.238/g' /etc/resolv.conf try to use the IP of ns0.wikimedia.org
  • restart the build job

This changed the error from the simple "Could not resolve host: gitlab.wikimedia.org" to:

WARNING: Failed to pull image with policy "always": Error response from daemon: Get "https://docker-registry.wikimedia.org/v2/": dial tcp: lookup docker-registry.wikimedia.org on 208.80.154.238:53: no such host (manager.go:203:0s) ERROR: Job failed: failed to pull image "docker-registry.wikimedia.org/buster:20220109" with specified policies [always]: Error response from daemon: Get "https://docker-registry.wikimedia.org/v2/": dial tcp: lookup docker-registry.wikimedia.org on 208.80.154.238:53: no such host (manager.go:203:0s)

Then tried to use 1.1.1.1 (Cloudflare):

ERROR: Job failed: failed to pull image "docker-registry.wikimedia.org/buster:20220109" with specified policies [always]: Error response from daemon: Get "https://docker-registry.wikimedia.org/v2/": dial tcp: lookup docker-registry.wikimedia.org on 1.1.1.1:53: read udp 10.192.48.71:46370->1.1.1.1:53: i/o timeout (manager.go:203:24s)

Change 809085 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] Allow DNS requests from GitLab runner containers

https://gerrit.wikimedia.org/r/809085

"When you deploy a container on your network, if it cannot find a DNS server defined in /etc/resolv.conf, by default it will take on the DNS configured for the host machine. "

This is not the case here. Our host machine uses:

nameserver 10.3.0.1

while the nameserver inside the buildkitd container is set to 127.0.0.11

buildkit@64bf26ae4edb:/$ cat /etc/resolv.conf 
search eqiad.wmnet
nameserver 127.0.0.11

That's expected when using a custom docker network.

See https://docs.docker.com/config/containers/container-networking/#dns-services

whereas containers that use a custom network use Docker’s embedded DNS server, which forwards external DNS lookups to the DNS servers configured on the host

Change 809085 merged by Dzahn:

[operations/puppet@production] Allow DNS requests from GitLab runner containers

https://gerrit.wikimedia.org/r/809085

A big thank you to @taavi for providing https://gerrit.wikimedia.org/r/c/operations/puppet/+/809085.

I compiled and deployed that, watched it write the ferm change to allow DNS requests from containers,

fixed /etc/resolv.conf on runner2004 host machine. restarted docker to recreate the buildkit container, clicked the rebuild button in the web UI and now:

Running with gitlab-runner 14.10.1 (f761588f)
  on gitlab-runner2004.codfw.wmnet JH3rkqws
Preparing the "docker" executor 00:02
Using Docker executor with image docker-registry.wikimedia.org/buster:20220109 ...
Pulling docker image docker-registry.wikimedia.org/buster:20220109 ....
...
Reinitialized existing Git repository in /builds/repos/releng/gitlab-runner-test/.git/
Checking out 68a12710 as main...
Skipping Git submodules setup
Executing "step_script" stage of the job script 00:01
Using docker image ...with digest docker-registry.wikimedia.org/buster...
$ echo "Compiling the code..."
Compiling the code...
$ echo "Compile complete."
Compile complete.
Cleaning up project directory and file based variables 00:01
Job succeeded

testing DNS resolution inside container:

root@gitlab-runner2004:/home/dzahn# docker exec -it -u root 22c009621ded bash
...
root@22c009621ded:/# nslookup gitlab.wikimedia.org
bash: nslookup: command not found
..
root@22c009621ded:/# apt-get install bind9-dnsutils
E: Unable to locate package bind9-dnsutils
...
root@22c009621ded:/# apt-get update
root@22c009621ded:/# apt-get install bind9-dnsutils
..
root@22c009621ded:/# host gitlab.wikimedia.org
gitlab.wikimedia.org has address 208.80.154.145
gitlab.wikimedia.org has IPv6 address 2620:0:861:2:208:80:154:145

...
root@gitlab-runner2004:/home/dzahn# systemctl restart docker  (to clean up)
Dzahn claimed this task.

I ran into this issue again today while attempting to build an image via buildkitd.

See https://gitlab.wikimedia.org/repos/releng/gitlab-runner-test/-/jobs/21691

[...]
Running on runner-jh3rkqws-project-182-concurrent-0 via gitlab-runner2004...
Getting source from Git repository
Fetching changes with git depth set to 50...
Reinitialized existing Git repository in /builds/repos/releng/gitlab-runner-test/.git/
fatal: unable to access 'https://gitlab.wikimedia.org/repos/releng/gitlab-runner-test.git/': Could not resolve host: gitlab.wikimedia.org
Cleaning up project directory and file based variables
ERROR: Job failed: exit code 1

Change 809650 had a related patch set uploaded (by Dduvall; author: Dduvall):

[operations/puppet@production] gitlab_runner: Allow internal docker DNS traffic

https://gerrit.wikimedia.org/r/809650

Change 809650 merged by Dzahn:

[operations/puppet@production] gitlab_runner: Allow internal docker DNS traffic

https://gerrit.wikimedia.org/r/809650

deployed the latest change, restarted docker on gitlab-runner2004, restarted the job and got a "Job succeeded".

(just that it also succeeded for me yesterday and then broke again, but this still seems like it did it. Rebuilding only worked after deploying and restarting docker)

Mentioned in SAL (#wikimedia-operations) [2022-06-29T20:21:10Z] <mutante> restarting docker on all 6 gitlab-runners via cumin T311241

currently the issue here is not DNS anymore.

but it is now: 'This job is stuck because you don't have any active runners online or available with any of these tags assigned to them: protected '

even though we see the de-registered and re-registered runner tagged as protected

currently the issue here is not DNS anymore.

but it is now: 'This job is stuck because you don't have any active runners online or available with any of these tags assigned to them: protected '

even though we see the de-registered and re-registered runner tagged as protected

When GitLab Runner are de-registered and re-registered, all additional assigned projects are lost. So the Runner has access to the group/project which belongs to the registration token only. For the Trusted Runners this means they have access to repos/releng/gitlab-trusted-runner/ after re-registering. gitlab-runner-test is not assigned to the re-registered Trusted Runner automatically.

So the script in repos/releng/gitlab-trusted-runner/ has to be executed again after re-registering a Runner. This is not ideal, especially because there is no CI for that currently.

I will try to create a CI pipeline for that. Maybe it's also possible to create a scheduled CI job which adds the Runners on a fixed interval automatically.

This is somehow related to the work in T311746, which may refactor the config and registration workflow for Runners.

I run the script in repos/releng/gitlab-trusted-runner/ manually:

$ add-project.py apply

gitlab-runner1002.eqiad.wmnet:
+project with id 182 repos / releng / Gitlab Runner Test
+project with id 75 Jelto / test-project

gitlab-runner1003.eqiad.wmnet:
+project with id 182 repos / releng / Gitlab Runner Test
+project with id 75 Jelto / test-project

gitlab-runner1004.eqiad.wmnet:
+project with id 182 repos / releng / Gitlab Runner Test
+project with id 75 Jelto / test-project

gitlab-runner2002.codfw.wmnet:
+project with id 182 repos / releng / Gitlab Runner Test
+project with id 75 Jelto / test-project

gitlab-runner2003.codfw.wmnet:
+project with id 182 repos / releng / Gitlab Runner Test
+project with id 75 Jelto / test-project

gitlab-runner2004.codfw.wmnet:
+project with id 182 repos / releng / Gitlab Runner Test
+project with id 75 Jelto / test-project

Trusted Runners should be available again to repos / releng / Gitlab Runner Test.

Thanks for explaining that, @Jelto ! I was pulling my hair out the other day trying to troubleshoot.

Would it be possible to move that script and project list into puppet and have it apply changes to other attributes as well? The problem I'm running into with T311746: Changes to modules/gitlab_runner/templates/config-template.toml.erb have no effect on existing runners is that some attributes are only changeable via (de/re)-registration (tag-list, run-untagged, locked, access-level) and based on what you've described, re-registration seems to break things (even temporarily).

However, if we can refactor the puppet code to apply changes through a combination of direct runner configuration (removing the use of config templates) and applying that script, we should be able to cover changes to nearly everything without (de/re)-registering. (The only outlier at that point would be the default docker image which doesn't seem re-configurable.)

(The only outlier at that point would be the default docker image which doesn't seem re-configurable.)

Nevermind. That is in the runner config.

Change 812264 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: Allow DNS requests from GitLab runner containers in WMCS

https://gerrit.wikimedia.org/r/812264

I'm seeing dns issues for jobs on Trusted Runners again: (example here https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/jobs). Similar error message:

fatal: unable to access 'https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner.git/': Could not resolve host: gitlab.wikimedia.org

Mock-up Trusted Runners on the Test instance see the same error too. I created a patch similar to the one for prod Trusted Runners. This is also blocking the Security Readiness Review T304514.

Would it be possible to use the docker default network again? I'm not sure about the actual benefits here in using a dedicated network.

deployed the latest change, restarted docker on gitlab-runner2004, restarted the job and got a "Job succeeded".

(just that it also succeeded for me yesterday and then broke again, but this still seems like it did it. Rebuilding only worked after deploying and restarting docker)

I've done some more tests on one of the Trusted Runners gitlab-runner1004. CI jobs on Trusted Runners seem to work after restarting docker manually. They stop working after the next puppet run.
Puppet shows a change when docker was restarted:

Notice: /Stage[main]/Ferm/Service[ferm]/ensure: ensure changed 'stopped' to 'running' (corrective)
Info: /Stage[main]/Ferm/Service[ferm]: Unscheduling refresh on Service[ferm]

The related DOCKER* firewall chains and rules differ between the restart and the puppet run.
I`ll try to upload a patch to fix the firewall configuration somewhere in gitlab_runner::firewall and profile::gitlab::runner::allowed_services.

[...]
Would it be possible to use the docker default network again? I'm not sure about the actual benefits here in using a dedicated network.

@dduvall What do you think about this? What was the reason to introduce a custom docker network here?

Jelto triaged this task as High priority.Jul 19 2022, 1:33 PM

Change 812264 merged by Jelto:

[operations/puppet@production] gitlab_runner: Allow DNS requests from GitLab runner containers in WMCS

https://gerrit.wikimedia.org/r/812264

DNS issues on Trusted Runners should be fixed now. Mocked/test Trusted Runners in WMCS devtools still have DNS issues.

When buildkitd was added to the Trusted Runners, the network_mode parameter was set in the config-template.toml (see). As discussed multiple times (see also T311746), the Runners have to be de-registered and re-registered when changes to the template happen. The Trusted Runners were missing the network_mode setting and were using the wrong default network. I de-registered and re-registered all Runners and DNS issues are gone.

De-registered and re-registered Trusted Runners in WMCS devtools did not fix DNS issues. Even with network_mode set and DNS allowed in ferm the same error appears Could not resolve host: gitlab.devtools.wmcloud.org. Behavior of WMCS Trusted Runners is also a bit different. They do not work even when docker was restarted recently and puppet did not update the ferm rules yet. I guess WMCS Runners use a slightly different configuration and network setup. I'll try to find the issue as soon as possible, because this is blocking T304514.

[...]
Would it be possible to use the docker default network again? I'm not sure about the actual benefits here in using a dedicated network.

@dduvall What do you think about this? What was the reason to introduce a custom docker network here?

The primary reason for the custom docker network is to have buildkitd be discoverable by name from within executor containers (i.e. jobs). Docker requires a custom network to be used in order to provide container/network DNS via its proxy at 127.0.0.11 which it configures as the default resolver for all containers on a given custom network.

If we can provide a consistent method of discovery for buildkitd another way, moving back to the default network is fine by me. I just didn't see another (better) way to do that.

Change 816133 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: add workaround for DNS issues in WMCS, fix images

https://gerrit.wikimedia.org/r/816133

Change 816133 merged by Jelto:

[operations/puppet@production] gitlab_runner: add workaround for DNS issues in WMCS, fix images

https://gerrit.wikimedia.org/r/816133

Jelto lowered the priority of this task from High to Medium.Jul 22 2022, 12:10 PM

The primary reason for the custom docker network is to have buildkitd be discoverable by name from within executor containers (i.e. jobs). Docker requires a custom network to be used in order to provide container/network DNS via its proxy at 127.0.0.11 which it configures as the default resolver for all containers on a given custom network.

If we can provide a consistent method of discovery for buildkitd another way, moving back to the default network is fine by me. I just didn't see another (better) way to do that.

Okay thanks for the context, that makes sense! I think we should be fine with using a custom network. No rollback is needed. Trusted Runners work fine now after de-registering and re-registering.

For WMCS I was not able to find a proper fix. Containers in the gitlab-runner docker network were not able to do DNS requests. I guess there is something different with the network stack, IP ranges and docker config. But I did not find anything here. As a workaround (see change above) I moved the containers back to bridge default network, to unblock T304514. However this will break communication to buildkitd on the Trusted Test Runner in WCMS. So we should find a proper fix for that, especially to create parity with production again.

DNS is actually workig now. We will follow-up with a separate task to implement firewall rules and talk about egress.