Page MenuHomePhabricator

Cloud VPS: 2024-10-22 cloud-wide puppet problem related to java update
Closed, ResolvedPublic

Description

Today there was a cloud-wide puppet problem related to puppet-enc:

aborrero@bastion-restricted-eqiad1-3:~$ sudo run-puppet-agent
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed when searching for node bastion-restricted-eqiad1-3.bastion.eqiad1.wikimedia.cloud: Exception while executing '/usr/local/bin/puppet-enc': Cannot run program "/usr/local/bin/puppet-enc" (in directory "."): error=0, Failed to exec spawn helper: pid: 872033, exit value: 1
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Event Timeline

aborrero changed the task status from Open to In Progress.Oct 22 2024, 8:47 AM
aborrero triaged this task as Unbreak Now! priority.
aborrero moved this task from Backlog to Doing on the User-aborrero board.

the puppet-enc API seems to be up and running:

aborrero@cloudinfra-cloudvps-puppetserver-1:~$ curl https://puppet-enc.cloudinfra.wmcloud.org/v1/cloudinfra/node/cloudinfra-cloudvps-puppetserver-1
hiera:
  acmechief_host: cloudinfra-acme-chief-02.cloudinfra.eqiad1.wikimedia.cloud
  profile::puppet::agent::dns_alt_names:
  - puppet
  - puppetmaster.cloudinfra.wmflabs.org
  profile::puppet::agent::force_puppet7: true
  profile::puppetserver::autosign: /usr/local/sbin/validatelabsfqdn.py
  profile::puppetserver::git::repos:
    labs/private:
      branch: master
      hooks:
        post-checkout: puppet:///modules/profile/puppetserver/git/operations/hooks/deploy-code.sh
        post-commit: puppet:///modules/profile/puppetserver/git/operations/hooks/deploy-code.sh
        post-merge: puppet:///modules/profile/puppetserver/git/operations/hooks/deploy-code.sh
        pre-commit: puppet:///modules/puppetmaster/git/private/pre-commit
        pre-merge: puppet:///modules/puppetmaster/git/private/pre-merge
        pre-rebase: puppet:///modules/puppetmaster/git/private/pre-rebase
      link: /etc/puppet/private
    operations/puppet:
      branch: production
      hooks:
        post-checkout: puppet:///modules/profile/puppetserver/git/operations/hooks/deploy-code.sh
        post-commit: puppet:///modules/profile/puppetserver/git/operations/hooks/deploy-code.sh
        post-merge: puppet:///modules/profile/puppetserver/git/operations/hooks/deploy-code.sh
        pre-commit: puppet:///modules/puppetmaster/git/pre-commit
        pre-merge: puppet:///modules/puppetmaster/git/pre-merge
        pre-rebase: puppet:///modules/puppetmaster/git/pre-rebase
  profile::puppetserver::java_max_mem: 26g
  profile::puppetserver::server_certname: puppetmaster.cloudinfra.wmflabs.org
  puppetmaster: cloudinfra-internal-puppetserver-1.cloudinfra.eqiad1.wikimedia.cloud
roles:
- role::puppetserver::cloud_vps_global

The puppetserver service seems to have some errors:

aborrero@cloudinfra-cloudvps-puppetserver-1:~$ sudo journalctl -u puppetserver --since today
[...]
Oct 22 06:57:59 cloudinfra-cloudvps-puppetserver-1 java[866423]: jspawnhelper version 17.0.13+11-Debian-2deb12u1
Oct 22 06:57:59 cloudinfra-cloudvps-puppetserver-1 java[866423]: This command is not for general use and should only be run as the result of a call to
Oct 22 06:57:59 cloudinfra-cloudvps-puppetserver-1 java[866423]: ProcessBuilder.start() or Runtime.exec() in a java application
Oct 22 06:58:10 cloudinfra-cloudvps-puppetserver-1 java[866426]: Incorrect number of arguments: 2
[...]

Mentioned in SAL (#wikimedia-cloud) [2024-10-22T09:04:58Z] <arturo> restart puppetserver service in T377803

Mentioned in SAL (#wikimedia-cloud) [2024-10-22T09:05:53Z] <arturo> restart puppetserver service for T377803

aborrero lowered the priority of this task from Unbreak Now! to High.Oct 22 2024, 9:07 AM

restarting the puppetserver.service unit in the corresponding puppetserver VM seems to fix the problem. Lowering priority.

There was an unattended java upgrade today:

aborrero@cloudinfra-cloudvps-puppetserver-1:~$ sudo tail /var/log/apt/history.log
[...]
Start-Date: 2024-10-22  06:56:59
Commandline: /usr/bin/unattended-upgrade
Upgrade: openjdk-17-jre-headless:amd64 (17.0.12+7-2~deb12u1, 17.0.13+11-2~deb12u1)
End-Date: 2024-10-22  06:57:02

aborrero@tools-puppetserver-01:~$ sudo tail /var/log/apt/history.log
[...]
Start-Date: 2024-10-22  06:54:32
Commandline: /usr/bin/unattended-upgrade
Upgrade: openjdk-17-jre-headless:amd64 (17.0.12+7-2~deb12u1, 17.0.13+11-2~deb12u1)
End-Date: 2024-10-22  06:54:36

This is likely the root cause of the problem.

aborrero renamed this task from Cloud VPS: cloud-wide puppet problem related to puppet-enc 2024-10-22 to Cloud VPS: 2024-10-22 cloud-wide puppet problem related to java update.Oct 22 2024, 9:22 AM

chat on IRC #wikimedia-sre channel:

11:25 <elukey> arturo: o/ yes we are aware of this issue, when openjdk is installed we need to immediately restart puppetserver :(
11:26 <elukey> IIUC a restart fixed it right?
11:26 <arturo> elukey: apparently yes
11:26 <moritzm> yeah, you should exempt openjdk-17 from unattended-upgrades for the puppetserver role
11:26 <moritzm> puppetserver needs an immediate restart after the upgrade
11:27 <moritzm> I think it's related to jruby, jruby-compiled artefacts generated by two different JREs don't mix

Change #1082201 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] wmcs: puppetserver: introduce apt pin for openjdk

https://gerrit.wikimedia.org/r/1082201

Change #1082201 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] wmcs: puppetserver: introduce apt pin for openjdk

https://gerrit.wikimedia.org/r/1082201

aborrero claimed this task.