Page MenuHomePhabricator

PuppetFailure - releases2003
Closed, ResolvedPublic

Description

Common information

  • alertname: PuppetFailure
  • cluster: misc
  • instance: releases2003:9100
  • job: node
  • prometheus: ops
  • severity: critical
  • site: codfw
  • source: prometheus
  • team: collaboration-services

Firing alerts


Event Timeline

LSobanski renamed this task from PuppetFailure to PuppetFailure - releases2003.Apr 14 2025, 7:32 AM
LSobanski added subscribers: Arnoldokoth, LSobanski.

@Arnoldokoth - related to the upgrade?

@LSobanski Yes and no. The direct cause seems to be the structure of the Puppet code at this point in time but it was only discovered after re-imaging the server.

Change #1135994 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: fix puppet error, systemd override requires systemd service

https://gerrit.wikimedia.org/r/1135994

Change #1136039 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: switch jenkins deployment method on contint to scap

https://gerrit.wikimedia.org/r/1136039

Change #1136039 abandoned by Dzahn:

[operations/puppet@production] ci: switch jenkins deployment method on contint to scap

https://gerrit.wikimedia.org/r/1136039

Change #1135796 had a related patch set uploaded (by Dzahn; author: AOkoth):

[operations/puppet@production] releases: invert use_scap3_deployment for jenkins

https://gerrit.wikimedia.org/r/1135796

Change #1135796 abandoned by Dzahn:

[operations/puppet@production] releases: invert use_scap3_deployment for jenkins

https://gerrit.wikimedia.org/r/1135796

Mentioned in SAL (#wikimedia-operations) [2025-04-14T22:34:54Z] <mutante> deploy1003 - scap install-world -l release2003.codfw.wmnet T391590

Per comments on the gerrit patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135994 tried to run scap to get the Debian package installed.

(On releases hosts scap is supposed to install a Debian package and puppet will NOT manage the jenkins service, which I personally see as an anti-pattern, but to just move on and get this unblocked..)

[deploy1003:~] $ scap install-world -l releases2003.codfw.wmnet
..
to install scap on the newly reimaged host.. this part looks like it worked)

..
[releases2003:~] $ scap pull
22:35:01 Copying from deployment.codfw.wmnet:/srv/mediawiki-staging to releases2003.codfw.wmnet:/srv/mediawiki
22:35:01 Started rsync common
sudo: unknown user mwdeploy
...

The pull command just works for MediaWiki but not other software.

..

[deploy1003:~] $ scap deploy -v -l releases2003.codfw.wmnet
Deployment configuration not found.
For `scap deploy` to work, the current directory must be the top level of a git repo containing a scap/scap.cfg file.
..

[deploy1003:~] $ file /srv/deployment/jenkins
/srv/deployment/jenkins: cannot open `/srv/deployment/jenkins' (No such file or directory)
..

Docs at https://wikitech.wikimedia.org/wiki/Scap#scap_deploy don't say where else than /srv/deployment this should be expected.

Mentioned in SAL (#wikimedia-operations) [2025-04-15T15:16:24Z] <dzahn@deploy1003> Started deploy [releng/jenkins-deploy@c274545] (releasing): T391590

Mentioned in SAL (#wikimedia-operations) [2025-04-15T15:17:08Z] <dzahn@deploy1003> Finished deploy [releng/jenkins-deploy@c274545] (releasing): T391590 (duration: 01m 14s)

[deploy1003:/srv/deployment/releng/jenkins-deploy] $ scap deploy --environment releasing -f -l releases2003.codfw.wmnet
Log message (press enter for none): T391590
15:16:24 Started deploy [releng/jenkins-deploy@c274545] (releasing)
15:16:24 Deploying Rev: HEAD = c274545bcee8c2bd79c6928dbb77820c62c35541
15:16:24 Started deploy [releng/jenkins-deploy@c274545] (releasing): T391590
15:16:24 
== DEFAULT ==
:* releases2003.codfw.wmnet
15:16:27 releng/jenkins-deploy: fetch stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) ng/jenkins-deploy: fetch stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
15:16:29 releng/jenkins-deploy: config_deploy stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
15:16:32 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'releng/jenkins-deploy', '--force', '-g', 'default', 'promote', '--refresh-config'] (ran as deploy-jenkins@releases2003.codfw.wmnet) returned [1]: Registering scripts in directory '/srv/deployment/releng/jenkins-deploy-cache/revs/c274545bcee8c2bd79c6928dbb77820c62c35541/scap/scripts'
registered script? '/srv/deployment/releng/jenkins-deploy-cache/revs/c274545bcee8c2bd79c6928dbb77820c62c35541/scap/scripts/generate_casc_jobs.sh' True
registered script? '/srv/deployment/releng/jenkins-deploy-cache/revs/c274545bcee8c2bd79c6928dbb77820c62c35541/scap/scripts/update_jenkins.sh' True
Linking config files at: /srv/deployment/releng/jenkins-deploy-cache/revs/c274545bcee8c2bd79c6928dbb77820c62c35541/.git/config-files
Registering scripts in directory '/srv/deployment/releng/jenkins-deploy-cache/revs/c274545bcee8c2bd79c6928dbb77820c62c35541/scap/scripts'
registered script? '/srv/deployment/releng/jenkins-deploy-cache/revs/c274545bcee8c2bd79c6928dbb77820c62c35541/scap/scripts/generate_casc_jobs.sh' True
registered script? '/srv/deployment/releng/jenkins-deploy-cache/revs/c274545bcee8c2bd79c6928dbb77820c62c35541/scap/scripts/update_jenkins.sh' True
Executing check 'update_jenkins'
Check 'update_jenkins' failed: Reading package lists...
Building dependency tree...
Reading state information...
Package jenkins is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'jenkins' has no installation candidate


15:16:32 releng/jenkins-deploy: promote and handle_service stage(s): 100% (in-flight: 0; ok: 0; fail: 1; left: 0) |
15:16:32 1 targets had deploy errors
15:16:32 1 targets failed
15:16:32 1 of 1 default targets failed, exceeding limit
Rollback all deployed groups? [Y/n]: Y
15:17:05 
== DEFAULT ==
:* releases2003.codfw.wmnet
15:17:08 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'releng/jenkins-deploy', '--force', '-g', 'default', 'rollback', '--refresh-config'] (ran as deploy-jenkins@releases2003.codfw.wmnet) returned [1]: Registering scripts in directory '/srv/deployment/releng/jenkins-deploy-cache/revs/c274545bcee8c2bd79c6928dbb77820c62c35541/scap/scripts'
registered script? '/srv/deployment/releng/jenkins-deploy-cache/revs/c274545bcee8c2bd79c6928dbb77820c62c35541/scap/scripts/generate_casc_jobs.sh' True
registered script? '/srv/deployment/releng/jenkins-deploy-cache/revs/c274545bcee8c2bd79c6928dbb77820c62c35541/scap/scripts/update_jenkins.sh' True
Unhandled error:
deploy-local failed: <RuntimeError> there is no previous revision to rollback to (scap version: 4.153.0) (duration: 00m 00s)

15:17:08 releng/jenkins-deploy: rollback stage(s): 100% (in-flight: 0; ok: 0; fail: 1; left: 0) |
15:17:08 1 targets had deploy errors
15:17:08 1 targets failed
15:17:08 default deploy successful
15:17:08 Finished deploy [releng/jenkins-deploy@c274545] (releasing): T391590 (duration: 01m 14s)
15:17:08 Finished deploy [releng/jenkins-deploy@c274545] (releasing) (duration: 00m 43s)

Change #1136765 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: ensure systemd service dir exists before override

https://gerrit.wikimedia.org/r/1136765

Change #1135994 abandoned by Dzahn:

[operations/puppet@production] jenkins: fix puppet error, systemd override requires systemd service

Reason:

continue at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136765

https://gerrit.wikimedia.org/r/1135994

Change #1136765 merged by Dzahn:

[operations/puppet@production] jenkins: ensure systemd service dir exists before override

https://gerrit.wikimedia.org/r/1136765

The puppet failure is resolved.

But jenkins still needs to be scap-deployed to unblock upgrading releases2003.

@Arnoldokoth The issue you originally ran into when running puppet the first time should not happen again. I tested by manually deleting the manually created directory again and running puppet. It re-created it.

This does not solve the "jenkins deployed by scap" issue but since this ticket was about the puppet error.. it's resolved.

To avoid spamming the umbrella task T384959, can we reuse that task for the releases* upgrade or alternatively file a dedicated task? There is a long tail of issues related to OS upgrades :)

From P75040 above:

E: Package 'jenkins' has no installation candidate

That is because the upstream Jenkins package is only imported in bullseye-wikimedia. repro needs some configuration and the update guide would need a refresh https://wikitech.wikimedia.org/wiki/Jenkins#Upgrading

To avoid spamming the umbrella task T384959, can we reuse that task for the releases* upgrade or alternatively file a dedicated task? There is a long tail of issues related to OS upgrades :)

Agreed. I made T392127