Error
Pulling 'docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2024-09-04-074735-publish'... Error response from daemon: unauthorized: authentication required
Impact
This node could not start any pod from restricted images
Notes
I'm creating this ticket to document why it was failing, as it is not obvious.
Cause
During the renumber/rename/reimage campaing in T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets, a host with a non-up to date iDRAC, mw2260, was in the process of being renamed to wikikube-worker2079. The puppet changes were merged, but the sre.hosts.rename cookbook failed and the DNS change was never done.
This caused puppet to fail on registry servers with the following error
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, DNS lookup failed for wikikube-worker2079.codfw.wmnet Resolv::DNS::Resource::IN::A (file: /etc/puppet/modules/docker_registry_ha/manifests/web.pp, line: 91, column: 77) on node registry2004.codfw.wmnet
The reason it made wikikube-worker2080.codfw.wmnet fail to auth (and would have for every node afterwards) is because of the auth mechanism for the docker-registry. In order to authenticate, a host must first be denied access. This is done in part by building a deny list of hosts. The puppet failure above meant the denylist stopped being updated, causing newly reimaged nodes to fail authentication.