Page MenuHomePhabricator

wikikube-worker2080.codfw.wmnet can't auth to registry
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
Pulling 'docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-09-04-074735-publish'...                                                                                                              
Error response from daemon: unauthorized: authentication required
Impact

This node could not start any pod from restricted images

Notes

I'm creating this ticket to document why it was failing, as it is not obvious.

Cause

During the renumber/rename/reimage campaing in T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets, a host with a non-up to date iDRAC, mw2260, was in the process of being renamed to wikikube-worker2079. The puppet changes were merged, but the sre.hosts.rename cookbook failed and the DNS change was never done.

This caused puppet to fail on registry servers with the following error

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, DNS lookup failed for wikikube-worker2079.codfw.wmnet Resolv::DNS::Resource::IN::A (file: /etc/puppet/modules/docker_registry_ha/manifests/web.pp, line: 91, column: 77) on node registry2004.codfw.wmnet

The reason it made wikikube-worker2080.codfw.wmnet fail to auth (and would have for every node afterwards) is because of the auth mechanism for the docker-registry. In order to authenticate, a host must first be denied access. This is done in part by building a deny list of hosts. The puppet failure above meant the denylist stopped being updated, causing newly reimaged nodes to fail authentication.

Event Timeline

Clement_Goubert changed the task status from Open to In Progress.
Clement_Goubert triaged this task as High priority.

Change #1070544 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: Set wikikube-worker2079 insetup

https://gerrit.wikimedia.org/r/1070544

Change #1070544 merged by Clément Goubert:

[operations/puppet@production] kubernetes: Set wikikube-worker2079 insetup

https://gerrit.wikimedia.org/r/1070544

Change #1070547 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: Remove wikikube-worker2079 references

https://gerrit.wikimedia.org/r/1070547

Change #1070547 merged by Clément Goubert:

[operations/puppet@production] kubernetes: Remove wikikube-worker2079 references

https://gerrit.wikimedia.org/r/1070547

Pulling restricted images now works from wikikube-worker2080, resolving.

Change #1070610 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] Remove wikikube-worker2088.codfw.wmnet from wikikube

https://gerrit.wikimedia.org/r/1070610

Change #1070610 abandoned by Clément Goubert:

[operations/puppet@production] Remove wikikube-worker2088.codfw.wmnet from wikikube

Reason:

Fixed by making the rename work in Ia820a7f8bff039a7e095a1369848695f7a76db9d

https://gerrit.wikimedia.org/r/1070610