Page MenuHomePhabricator

deployment-prep is broken following 2025-02-06 WMCS reboots
Closed, ResolvedPublic

Description

Keyholder needs to be rearmed following https://wikitech.wikimedia.org/wiki/Keyholder as scap jobs are failing (see https://integration.wikimedia.org/ci/job/beta-scap-sync-world/192680/console) with

15:06:06 15:06:06 sudo -u mwdeploy -n -- /usr/bin/rsync -l deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud::common/wikiversions*.{json,php} /srv/mediawiki (ran as mwdeploy@deployment-mediawiki13.deployment-prep.eqiad1.wikimedia.cloud) returned [255]: Load key "/etc/keyholder.d/mwdeploy": Permission denied
15:06:06 mwdeploy@deployment-mediawiki13.deployment-prep.eqiad1.wikimedia.cloud: Permission denied (publickey).

etc. for each host

and if I try to do anything then I get

There seems to be a problem with your login session; this action has been canceled as a precaution against session hijacking. Please resubmit the form.

Can someone please check things work and fix / start what needs doing

Event Timeline

RhinosF1 triaged this task as High priority.Feb 6 2025, 2:42 PM

15:14:33 <@andrewbogott> RhinosF1: ok, I can't arm keyholder on that host with the scap password or the deploy-service password or the mwdeploy password

15:31:32 <wmf-insecte> Yippee, build fixed!
15:31:33 <wmf-insecte> Project beta-update-databases-eqiad build #82385: FIXED in 11 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/82385/

Sessions is now the only thing I can see obviously broken

Puppet is broken in a way I haven't seen before on deployment-sessionstore06.deployment-prep.eqiad1.wikimedia.cloud:

root@deployment-sessionstore06:~# puppet agent -tv
Info: Using environment 'production'
Error: Connection to https://deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud:8140/puppet/v3 failed, trying next route: Request to https://deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud:8140/puppet/v3 failed after 0.001 seconds: Failed to open TCP connection to deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud:8140 (getaddrinfo: Temporary failure in name resolution)
Wrapped exception:
Failed to open TCP connection to deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud:8140 (getaddrinfo: Temporary failure in name resolution)
Error: No more routes to fileserver
Info: Loading facts
Error: Facter: Error while resolving custom fact fact='ipaddress', resolution='<anonymous>': no implicit conversion of nil into String
Error: Connection to https://deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud:8140/puppet/v3 failed, trying next route: Request to https://deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud:8140/puppet/v3 failed after 0.001 seconds: Failed to open TCP connection to deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud:8140 (getaddrinfo: Temporary failure in name resolution)
Wrapped exception:
Failed to open TCP connection to deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud:8140 (getaddrinfo: Temporary failure in name resolution)
Error: Could not retrieve catalog from remote server: No more routes to puppet
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Error: Connection to https://deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud:8140/puppet/v3 failed, trying next route: Request to https://deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud:8140/puppet/v3 failed after 0.0 seconds: Failed to open TCP connection to deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud:8140 (getaddrinfo: Temporary failure in name resolution)
Wrapped exception:
Failed to open TCP connection to deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud:8140 (getaddrinfo: Temporary failure in name resolution)
Error: Could not send report: No more routes to report

This is what I see in objectcache.log.

deployment-mediawiki13 enwiki 1.44.0-alpha objectcache WARNING: Error fetching URL "http://sessionstore.svc.deployment-prep.eqiad1.wikimedia.cloud:8080/sessions/v1/enwiki%3AMWSession%3A<etc>": (curl error: 7) Couldn't connect to server

Puppet is broken in a way I haven't seen before on deployment-sessionstore06.deployment-prep.eqiad1.wikimedia.cloud:

The instance is missing a default route for some reason and so can't resolve any DNS names.

Mentioned in SAL (#wikimedia-releng) [2025-02-06T16:20:44Z] <bd808> Rebooted deployment-sessionstore06 (T385803)

bd808 claimed this task.

Things appear to be working now. The networking issues on deployment-sessionstore06 were "fixed" by the additional reboot.