Page MenuHomePhabricator

GitLab Runners in WMCS are offline
Closed, ResolvedPublic

Description

GitLab Runners in WMCS are showing as offline in the GitLab UI and are also unreachable via SSH. Timing indicates that this is related to T342621: eqiad1: cloudlb: transition DNS clients (VMs) to the new BGP-based recursor VIP.

Firewall rules have been updated in https://gerrit.wikimedia.org/r/c/operations/puppet/+/956463 but application on the hosts is failing.

Event Timeline

Change 956784 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: change docker_subnet in WMCS

https://gerrit.wikimedia.org/r/956784

1runner-1021.gitlab-runners.eqiad1.wikimedia.cloud
2runner-1022.gitlab-runners.eqiad1.wikimedia.cloud
3runner-1023.gitlab-runners.eqiad1.wikimedia.cloud
4runner-1024.gitlab-runners.eqiad1.wikimedia.cloud
5runner-1025.gitlab-runners.eqiad1.wikimedia.cloud
6runner-1026.gitlab-runners.eqiad1.wikimedia.cloud
7runner-1027.gitlab-runners.eqiad1.wikimedia.cloud
8runner-1028.gitlab-runners.eqiad1.wikimedia.cloud
9runner-1029.gitlab-runners.eqiad1.wikimedia.cloud
10runner-1030.gitlab-runners.eqiad1.wikimedia.cloud
11gitlab-runner-1002.devtools.eqiad1.wikimedia.cloud
12gitlab-runner-1003.devtools.eqiad1.wikimedia.cloud

Change 956784 merged by Jelto:

[operations/puppet@production] gitlab_runner: change docker_subnet in WMCS

https://gerrit.wikimedia.org/r/956784

I hadn't realised we had a potential clash here. Unsure exactly what the answer is.

Assuming the affected machines running docker containers are VMs on 172.16.0.0/24 you can potentially add a work-around to improve the situation until Jelto's above patch is rolled out / working everywhere, by adding static routes for the unreachable IPs via the gateway, i.e.

ip route add  172.20.255.1/32 via 172.16.0.1

The more-specific mask on that route would take precedence over the range assigned to the local docker0 bridge and traffic should get to the affected (non-docker) hosts using 172.20.x.x addressing.

aborrero claimed this task.

Solved with:

user@laptop:~$ ssh -o StrictHostKeyChecking=no root@gitlab-runner-1003.devtools.eqiad1.wikimedia.cloud "ip route delete 172.20.0.0/16 ; run-puppet-agent"

on the bunch of affected hosts.

Thanks @aborrero for fixing all WMCS runners!

Additional to the workaround it was necessary to delete and re-create the gitlab-runner docker network. The following command was used:

systemctl stop docker-resource-monitor.service ; systemctl stop buildkitd.service ; docker network rm gitlab-runner ; run-puppet-agent

The network looks good an all WMCS runners now:

docker network inspect gitlab-runner