Page MenuHomePhabricator

Update devtools project puppetmaster
Closed, ResolvedPublic

Description

All of cloud vps is being upgraded to puppet7 with new puppet infra. Each puppetmaster needs to be replaced with a version 7 puppetmaster, and then VMs upgraded to puppet7.

Your project contains the following v5 puppetmaster:

puppetmaster-1001.devtools.eqiad1.wikimedia.cloud

Please take a moment to consider whether or not you stlil need this project puppetmaster. If you do, migrate with the following steps. Do not hesitate to ask for help from @Andrew or @taavi on IRC if you run into trouble.

In order to migrate:

  1. Make sure you have available quota to create a new g3.cores1.ram2.disk20 VM. If you need more space please open a quota ticket.
  1. Create a 5GB cinder volume (named <projectname>-puppetserver or similar) and mount it as /srv on the existing puppetmaster. Them on the existing puppetmaster:
$ sudo cp -a /var/lib/git /srv
$ mkdir /srv/puppet
$ sudo cp -a /var/lib/puppet/server /srv/puppet
  1. Unmount and detach the cinder volume
  2. Create a new VM for the v7 puppet server, using a flavor with at least 2GB of RAM and Debian Bookworm
  3. Mount the previously-created cinder volume at /srv on the new server
  4. Make the new VM a puppetserver by following directions at https://wikitech.wikimedia.org/wiki/Help:Project_puppetserver#Step_1:_Setup_a_puppetserver.

Puppet classes:

role::puppetserver::cloud_vps_project

hiera:

profile::puppet::agent::force_puppet7: true
puppetmaster: puppet
  1. Adjust ownership on the new puppetserver:
$ sudo chown -R gitpuppet /srv/git; chgrp -R gitpuppet /srv/git
$ sudo chown -R puppet /srv/puppet; chgrp -R puppet /srv/puppet
$ sudo run-puppet-agent; sudo run-puppet-agent
$ sudo systemctl restart puppetserver
$ sudo puppetserver-deploy-code 
  1. Assuming that puppet is now running cleanly on the new puppetserver, move existing VMs to the new host with the hiera setting
puppetmaster: <new puppetserver fqdn>
  1. Finally, update clients of the new puppetserver with the hiera setting
profile::puppet::agent::force_puppet7: true

Debian Buster hosts will complain about not being able to install puppet7 but the warning is harmless for now.

Event Timeline

brennen added a project: User-brennen.
brennen subscribed.

Tentatively: I can take a crack at this.

Andrew updated the task description. (Show Details)

Thank you for looking at this @brennen . It's appreciated.

I made a separate ticket for the other buster machines in devtools, that aren't the puppetmaster but should also be replaced.

I configured a new puppetmaster-1003 through to step 7 above.

Ran into some issues (detailed below) switch phabricator-bullseye over, and then thought to check whether the existing puppetmaster-1001 is standalone. Indeed it is, and seems to use devtools-puppetdb1001.devtools.eqiad1.wikimedia.cloud. I can dig further into this next week, but guessing I'm going to have some stupid questions to ask.


Switching phabricator-bullseye to this puppetmaster:

brennen@phabricator-bullseye:~$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for phabricator-bullseye.devtools.eqiad1.wikimedia.cloud
Info: Applying configuration version '(8127cb8f7f) Cole White - opensearch: ensure cluster_wide curator job absent'
Notice: /Stage[main]/Puppet::Agent/Concat[/etc/puppet/puppet.conf]/File[/etc/puppet/puppet.conf]/content: 
--- /etc/puppet/puppet.conf     2023-11-21 21:07:37.616972595 +0000
+++ /tmp/puppet-file20240328-3215344-865r1j     2024-03-28 23:21:11.410794474 +0000
@@ -11,8 +11,8 @@
 factpath = $vardir/lib/facter
 
 [agent]
-server = puppetmaster-1001.devtools.eqiad1.wikimedia.cloud
-ca_server = puppetmaster-1001.devtools.eqiad1.wikimedia.cloud
+server = puppetmaster-1003.devtools.eqiad1.wikimedia.cloud
+ca_server = puppetmaster-1003.devtools.eqiad1.wikimedia.cloud
 daemonize = false
 http_connect_timeout = 60
 http_read_timeout = 960

Info: Computing checksum on file /etc/puppet/puppet.conf
Info: /Stage[main]/Puppet::Agent/Concat[/etc/puppet/puppet.conf]/File[/etc/puppet/puppet.conf]: Filebucketed /etc/puppet/puppet.conf to puppet with sum 85cc255649c4aed52416d023a7752fb0
Notice: /Stage[main]/Puppet::Agent/Concat[/etc/puppet/puppet.conf]/File[/etc/puppet/puppet.conf]/content: content changed '{md5}85cc255649c4aed52416d023a7752fb0' to '{md5}8b314ab36b7cbf93ac7a2aa36d6fcd7a'
Notice: Applied catalog in 7.25 seconds

However on a second run:

...
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, DNS lookup failed for ns-recursor0.openstack.eqiad1.wikimediacloud.org Resolv::DNS::Resource::IN::A (file: /srv/puppet_code/environments/production/manifests/realm.pp, line: 77, column: 9) on node phabricator-bullseye.devtools.eqiad1.wikimedia.cloud
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Errors persist after switching back to the old puppetmaster-1001.

It sounds like the clone of puppet.git in /srv/git/operations/puppet is outdated. Can you check that the production branch is checked out and is up-to-date with the latest commits in Gerrit?

Yeah, it looks like /srv/git/operations/puppet was checked out to an old testing branch from Gerrit. That's now fixed, although it doesn't have any effect on the error on phabricator-bullseye.

That did the trick. Second run:

brennen@phabricator-bullseye:/etc/puppet$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for phabricator-bullseye.devtools.eqiad1.wikimedia.cloud
Info: Applying configuration version '(8127cb8f7f) Cole White - opensearch: ensure cluster_wide curator job absent'
Notice: Applied catalog in 6.21 seconds
Error 500 on SERVER: Server Error: Could not find class role::puppetserver::standalone for puppetmaster-1003.devtools.eqiad1.wikimedia.cloud on node puppetmaster-1003.devtools.eqiad1.wikimedia.cloud

I'm not sure this is the cause of the problem, but is there any reason to have your new puppetserver manage itself rather than use the central puppetserver? My instructions up top say to add the hiera 'puppetmaster: puppet' which would avoid weird chicken/egg issues.

Error 500 on SERVER: Server Error: Could not find class role::puppetserver::standalone for puppetmaster-1003.devtools.eqiad1.wikimedia.cloud on node puppetmaster-1003.devtools.eqiad1.wikimedia.cloud

As the task description says, the role name for a Puppet 7 per-project server is role::puppetserver::cloud_vps_project.

Mentioned in SAL (#wikimedia-cloud) [2024-04-11T18:23:00Z] <mutante> - shutting down puppetmaster-1001 on buster - should now be replaced by puppetmaster-1003 on bookworm (thanks brennen) T360964 T360470

brennen moved this task from Doing to Done or Declined on the User-brennen board.

I changed puppetmaster-1003 to role::puppetserver::cloud_vps_project, and puppet runs now seem to be working there. @Dzahn is updating other boxen in the project to use the new puppetmaster. We'll consider this one resolved.

Change #1019109 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] cloud/devtools: switch default puppetmaster from 1001 to 1003

https://gerrit.wikimedia.org/r/1019109

Change #1019109 merged by Dzahn:

[operations/puppet@production] cloud/devtools: switch default puppetmaster from 1001 to 1003

https://gerrit.wikimedia.org/r/1019109

I went through the other instances in this project and switched to new puppetmaster in:

  • default hiera setting in repo
  • web hiera setting on instance level
  • manual edit of puppet agent config to point to new server and fix puppet runs

One issue: on deploy-1004 (which itself is on buster!) we are getting "Failed to generate additional resources using 'eval_generate': Error 500 on SERVER: Server Error: Not authorized to call search on /file_metadata/volatile/GeoIP" when pointing it to the new master.

Switching it back to old puppetmaster-1001 shows the issue going away. But since we changed the default server, a succesful puppet run will switch it back to the new master.

Also this is probably because buster doesn't mix with puppet 7.

Notice: puppet7 is not available on buster.  forcing this is likely going to cause issue.
Notice: /Stage[main]/Profile::Puppet::Agent/Notify[puppet7 is not available on buster.  forcing this is likely going to cause issue.]/message: defined 'message' as "\u{1B}[31mpuppet7 is not available on buster.  forcing this is likely going to cause issue.\u{1B}[0m"

It's not really a buster thing -- the puppet code for geoip is entirely different in the puppetserver manifests vs. the old puppetmaster manifests.

I wouldn't expect geoip to work on cloud-vps anyway, but on deployment-prep I fixed that issue with the following hack:

commit 8b6b54fe43b77628f439850514ef97b93d487fd4 (HEAD -> production, tag: snapshot-202404122043)
Author: root <root@deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud>
Date:   Thu Mar 14 20:05:11 2024 +0000

    [BETA HACK] Changes to profile::puppetserver::volatile
    
    This allows a stubbed-out version of this class to apply on a
    deployment-prep puppetserver. Probably better than hunting down
    every use of this elsewhere.

diff --git a/modules/profile/manifests/puppetserver/volatile.pp b/modules/profile/manifests/puppetserver/volatile.pp
index 87f8434487f..dc3c6cba8d9 100644
--- a/modules/profile/manifests/puppetserver/volatile.pp
+++ b/modules/profile/manifests/puppetserver/volatile.pp
@@ -15,18 +15,13 @@ class profile::puppetserver::volatile (
     # Should be defined in the private repo.
     Hash[String, Any]         $ip_reputation_config  = lookup('profile::puppetserver::volatile::ip_reputation_config'),
     Array[String]             $ip_reputation_proxies = lookup('profile::puppetserver::volatile::ip_reputation_proxies'),
-
+    Hash[String, Any]         $extra_mounts  = lookup('profile::puppetserver::extra_mounts'),
 ){
-    include profile::puppetserver
-    unless $profile::puppetserver::extra_mounts.has_key('volatile') {
+    unless $extra_mounts.has_key('volatile') {
         fail("Must define a volatile entry in profile::puppetserver::extra_mounts to use ${title}")
     }
     include profile::puppetserver::git
-    unless $profile::puppetserver::git::repos.has_key('private') {
-        fail("Must define a private entry in profile::puppetserver::git::repos to use ${title}")
-    }
-    $private_repo_path = "${profile::puppetserver::git::basedir}/private"
-    $base_path            = $profile::puppetserver::extra_mounts['volatile']
+    $base_path            = $extra_mounts['volatile']
     $geoip_destdir        = "${base_path}/GeoIP"
     $geoip_destdir_ipinfo = "${base_path}/GeoIPInfo"
 
@@ -44,15 +39,6 @@ class profile::puppetserver::volatile (
     # Needed by update-netboot-image
     ensure_packages('pax')
 
-    class { 'external_clouds_vendors':
-        user         => 'root',
-        manage_user  => false,
-        outfile      => "${base_path}/external_cloud_vendors/public_clouds.json",
-        # TODO: when puppet 7 production set to $profile::puppetserver::enable_ca
-        conftool     => false,
-        http_proxy   => $http_proxy,
-        private_repo => $private_repo_path,
-    }
     class { 'ip_reputation_vendors':
         ensure         => stdlib::ensure(!$ip_reputation_proxies.empty()),
         user           => 'root',

I've no doubt that there's a better solution but it probably requires a realm check.

Puppet runs on some machines which use the new puppetmaster in devtools fail, here example of gitlab-runner-1002.devtools.eqiad1.wikimedia.cloud:

gitlab-runner-1002:~$ sudo run-puppet-agent
Info: Using environment 'production'
Error: Connection to https://puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140/puppet/v3 failed, trying next route: Request to https://puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140/puppet/v3 failed after 3.069 seconds: Failed to open TCP connection to puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "puppetmaster-1003.devtools.eqiad1.wikimedia.cloud" port 8140)
Wrapped exception:
Failed to open TCP connection to puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "puppetmaster-1003.devtools.eqiad1.wikimedia.cloud" port 8140)
Error: No more routes to fileserver
Info: Loading facts
Error: Connection to https://puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140/puppet/v3 failed, trying next route: Request to https://puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140/puppet/v3 failed after 2.498 seconds: Failed to open TCP connection to puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "puppetmaster-1003.devtools.eqiad1.wikimedia.cloud" port 8140)
Wrapped exception:
Failed to open TCP connection to puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "puppetmaster-1003.devtools.eqiad1.wikimedia.cloud" port 8140)
Error: Could not retrieve catalog from remote server: No more routes to puppet
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Error: Connection to https://puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140/puppet/v3 failed, trying next route: Request to https://puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140/puppet/v3 failed after 3.063 seconds: Failed to open TCP connection to puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "puppetmaster-1003.devtools.eqiad1.wikimedia.cloud" port 8140)
Wrapped exception:
Failed to open TCP connection to puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140 (No route to host - connect(2) for "puppetmaster-1003.devtools.eqiad1.wikimedia.cloud" port 8140)
Error: Could not send report: No more routes to report

See active alerts in devtools: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DPuppetAgentNoResources&q=project%3Ddevtools

That's new. It worked before and "no route to host" sounds like the machine is down or networking.

The latest puppetserver code is prone to gobbling RAM; I'd check for oom messages and see about using profile::puppetserver::java_max_mem

Also, just so i'm clear... your new puppet7 server is still named 'puppetmaster<something>' rather than 'puppetserver<something'>, is that right? (In all other projects the new servers don't use the 'puppetmaster' name so I can tell the difference between new and old).

Also, just so i'm clear... your new puppet7 server is still named 'puppetmaster<something>' rather than 'puppetserver<something'>, is that right? (In all other projects the new servers don't use the 'puppetmaster' name so I can tell the difference between new and old).

Yeah, I think based on the "update puppetmaster" wording of this task I just incremented the naming of the old box (puppetmaster-1001 → puppetmaster-1003 - I broke something or another along the way and had to try twice).

Currently can't ssh to puppetmaster-1003 from external either - also with "no route to host". I'll try rebooting it via Horizon.

[?2004hroot@puppetmaster-1003:~# [  899.382711] /dev/sdb: Can't open blockdev

"soft reboot" resurrected it, can ssh to it again.

running puppet on the puppetmaster shows:

Error: Cannot create /srv/puppet/server; parent directory /srv/puppet does not exist

but also the puppet run finishes nevertheless

Unfortunately this hasn't fixed the issue for agents yet, for example gitlab-runner-1002 now says:

Failed to open TCP connection to puppetmaster-1003.devtools.eqiad1.wikimedia.cloud:8140 (Connection refused

This is because the service hasn't started due to:

Warning: /Stage[main]/Puppetserver/Service[puppetserver]: Skipping because of failed dependencies

also:

Error: '/usr/local/bin/puppetserver-deploy-code' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Puppetserver::Git/Exec[puppetserver-deploy-code]/returns: change from 'notrun' to ['0'] failed: '/usr/local/bin/puppetserver-deploy-code' returned 1 instead of one of [0] (corrective)
Exception in thread "main" java.lang.IllegalStateException: Unable to borrow JRubyInstance from pool
	at puppetlabs.services.jruby_pool_manager.impl.jruby_internal$eval25425$borrow_from_pool_BANG__STAR___25430$fn__25431.invoke(jruby_internal.clj:313)
	at puppetlabs.services.jruby_pool_manager.impl.jruby_internal$eval25425$borrow_from_pool_BANG__STAR___25430.invoke(jruby_internal.clj:300)
	at puppetlabs.services.jruby_pool_manager.impl.jruby_internal$eval25472$borrow_from_pool_with_timeout__25477$fn__25478.invoke(jruby_internal.clj:348)
	at puppetlabs.services.jruby_pool_manager.impl.jruby_internal$eval25472$borrow_from_pool_with_timeout__25477.invoke(jruby_internal.clj:337)
	at puppetlabs.services.jruby_pool_manager.impl.instance_pool$eval28315$fn__28328.invoke(instance_pool.clj:48)
	at puppetlabs.services.protocols.jruby_pool$eval26341$fn__26375$G__26318__26382.invoke(jruby_pool.clj:3)
	at puppetlabs.services.jruby_pool_manager.jruby_core$eval26891$borrow_from_pool_with_timeout__26896$fn__26897.invoke(jruby_core.clj:222)
	at puppetlabs.services.jruby_pool_manager.jruby_core$eval26891$borrow_from_pool_with_timeout__26896.invoke(jruby_core.clj:209)
	at puppetlabs.services.config.puppet_server_config_core$eval37399$get_puppet_config__37404$fn__37405$fn__37406.invoke(puppet_server_config_core.clj:107)
	at puppetlabs.services.config.puppet_server_config_core$eval37399$get_puppet_config__37404$fn__37405.invoke(puppet_server_config_core.clj:107)
	at puppetlabs.services.config.puppet_server_config_core$eval37399$get_puppet_config__37404.invoke(puppet_server_config_core.clj:102)
	at puppetlabs.services.config.puppet_server_config_service$reify__37434$service_fnk__5716__auto___positional$reify__37445.init(puppet_server_config_service.clj:25)
	at puppetlabs.trapperkeeper.services$eval5514$fn__5515$G__5502__5518.invoke(services.clj:9)
	at puppetlabs.trapperkeeper.services$eval5514$fn__5515$G__5501__5522.invoke(services.clj:9)
	at puppetlabs.trapperkeeper.internal$eval16416$run_lifecycle_fn_BANG___16423$fn__16424.invoke(internal.clj:196)
	at puppetlabs.trapperkeeper.internal$eval16416$run_lifecycle_fn_BANG___16423.invoke(internal.clj:179)
	at puppetlabs.trapperkeeper.internal$eval16445$run_lifecycle_fns__16450$fn__16451.invoke(internal.clj:229)
	at puppetlabs.trapperkeeper.internal$eval16445$run_lifecycle_fns__16450.invoke(internal.clj:206)
	at puppetlabs.trapperkeeper.internal$eval17087$build_app_STAR___17096$fn$reify__17108.init(internal.clj:614)
	at puppetlabs.trapperkeeper.internal$eval17137$boot_services_for_app_STAR__STAR___17144$fn__17145$fn__17147.invoke(internal.clj:648)
	at puppetlabs.trapperkeeper.internal$eval17137$boot_services_for_app_STAR__STAR___17144$fn__17145.invoke(internal.clj:647)
	at puppetlabs.trapperkeeper.internal$eval17137$boot_services_for_app_STAR__STAR___17144.invoke(internal.clj:641)
	at clojure.core$partial$fn__5910.invoke(core.clj:2647)
	at puppetlabs.trapperkeeper.internal$eval16490$initialize_lifecycle_worker__16501$fn__16502$fn__16665$state_machine__13652__auto____16690$fn__16693.invoke(internal.clj:249)
	at puppetlabs.trapperkeeper.internal$eval16490$initialize_lifecycle_worker__16501$fn__16502$fn__16665$state_machine__13652__auto____16690.invoke(internal.clj:249)
	at clojure.core.async.impl.ioc_macros$run_state_machine.invokeStatic(ioc_macros.clj:978)
	at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:977)
	at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invokeStatic(ioc_macros.clj:982)
	at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:980)
	at clojure.core.async$ioc_alts_BANG_$fn__13899.invoke(async.clj:421)
	at clojure.core.async$do_alts$fn__13830$fn__13833.invoke(async.clj:288)
	at clojure.core.async.impl.channels.ManyToManyChannel$fn__7557$fn__7558.invoke(channels.clj:99)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at clojure.core.async.impl.concurrent$counted_thread_factory$reify__7426$fn__7427.invoke(concurrent.clj:29)
	at clojure.lang.AFn.run(AFn.java:22)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.jruby.embed.InvokeFailedException: org.jruby.exceptions.RuntimeError: (RuntimeError) Got 1 failure(s) while initializing: File[/srv/puppet/server/ssl]: change from 'absent' to 'directory' failed: Cannot create /srv/puppet/server/ssl; parent directory /srv/puppet/server does not exist
	at org.jruby.embed.internal.EmbedRubyObjectAdapterImpl.doInvokeMethod(EmbedRubyObjectAdapterImpl.java:253)
	at org.jruby.embed.internal.EmbedRubyObjectAdapterImpl.callMethod(EmbedRubyObjectAdapterImpl.java:162)
	at org.jruby.embed.ScriptingContainer.callMethod(ScriptingContainer.java:1464)
	at com.puppetlabs.jruby_utils.jruby.InternalScriptingContainer.callMethodWithArgArray(InternalScriptingContainer.java:43)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:167)
	at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:102)
	at puppetlabs.services.jruby.jruby_puppet_core$eval27353$get_initialize_pool_instance_fn__27358$fn__27359$fn__27360.invoke(jruby_puppet_core.clj:141)
	at puppetlabs.services.jruby_pool_manager.impl.jruby_internal$eval25225$create_pool_instance_BANG___25234$fn__25237.invoke(jruby_internal.clj:256)
	at puppetlabs.services.jruby_pool_manager.impl.jruby_internal$eval25225$create_pool_instance_BANG___25234.invoke(jruby_internal.clj:225)
	at puppetlabs.services.jruby_pool_manager.impl.jruby_agents$eval25651$add_instance__25656$fn__25660.invoke(jruby_agents.clj:52)
	at puppetlabs.services.jruby_pool_manager.impl.jruby_agents$eval25651$add_instance__25656.invoke(jruby_agents.clj:47)
	at puppetlabs.services.jruby_pool_manager.impl.jruby_agents$eval25678$prime_pool_BANG___25683$fn__25687.invoke(jruby_agents.clj:76)
	at puppetlabs.services.jruby_pool_manager.impl.jruby_agents$eval25678$prime_pool_BANG___25683.invoke(jruby_agents.clj:61)
	at puppetlabs.services.jruby_pool_manager.impl.instance_pool$eval28315$fn__28324$fn__28325.invoke(instance_pool.clj:16)
	at puppetlabs.trapperkeeper.internal$shutdown_on_error_STAR_.invokeStatic(internal.clj:403)
	at puppetlabs.trapperkeeper.internal$shutdown_on_error_STAR_.invoke(internal.clj:378)
	at puppetlabs.trapperkeeper.internal$shutdown_on_error_STAR_.invokeStatic(internal.clj:388)
	at puppetlabs.trapperkeeper.internal$shutdown_on_error_STAR_.invoke(internal.clj:378)
	at puppetlabs.trapperkeeper.internal$eval16950$shutdown_service__16955$fn$reify__16957$service_fnk__5716__auto___positional$reify__16962.shutdown_on_error(internal.clj:448)
	at puppetlabs.trapperkeeper.internal$eval16874$fn__16892$G__16866__16900.invoke(internal.clj:411)
	at puppetlabs.trapperkeeper.internal$eval16874$fn__16892$G__16865__16909.invoke(internal.clj:411)
	at clojure.core$partial$fn__5908.invoke(core.clj:2642)
	at clojure.core$partial$fn__5908.invoke(core.clj:2641)
	at puppetlabs.services.jruby_pool_manager.impl.jruby_agents$eval25625$send_agent__25630$fn__25631$agent_fn__25632.invoke(jruby_agents.clj:41)
	at clojure.core$binding_conveyor_fn$fn__5823.invoke(core.clj:2050)
	at clojure.lang.AFn.applyToHelper(AFn.java:154)
	at clojure.lang.RestFn.applyTo(RestFn.java:132)
	at clojure.lang.Agent$Action.doRun(Agent.java:114)
	at clojure.lang.Agent$Action.run(Agent.java:163)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	... 1 more
Caused by: org.jruby.exceptions.RuntimeError: (RuntimeError) Got 1 failure(s) while initializing: File[/srv/puppet/server/ssl]: change from 'absent' to 'directory' failed: Cannot create /srv/puppet/server/ssl; parent directory /srv/puppet/server does not exist
	at RUBY.use(/usr/lib/ruby/vendor_ruby/puppet/settings.rb:1140)
	at RUBY.apply(/usr/lib/ruby/vendor_ruby/puppet/resource/catalog.rb:248)
	at RUBY.use(/usr/lib/ruby/vendor_ruby/puppet/settings.rb:1130)
	at RUBY.initialize_puppet(uri:classloader:/puppetserver-lib/puppet/server/puppet_config.rb:91)
	at RUBY.initialize(uri:classloader:/puppetserver-lib/puppet/server/master.rb:39)
	at org.jruby.RubyClass.new(org/jruby/RubyClass.java:911)

well, "parent directory /srv/puppet/server does not exist" sounds at least fixable, but wouldn't this affect ALL puppet servers once they get rebooted?

# @param ssldir_on_srv used on cloud-vps; it allows storing certs on a detachable volume
if $ssldir_on_srv {                      
    $ssl_dir = '/srv/puppet/server/ssl'  
} else {                                 
    $ssl_dir = $separate_ssldir.bool2str('/var/lib/puppet/server/ssl', '/var/lib/puppet/ssl')
}
hieradata/common/profile/puppetserver.yaml:profile::puppetserver::ssldir_on_srv: false
hieradata/cloud.yaml:profile::puppetserver::ssldir_on_srv: true

^^

Mentioned in SAL (#wikimedia-cloud) [2024-04-15T17:52:19Z] <mutante> - added profile::labs::cindermount::srv to puppetmaster-1003 in horizon to get missing cinder volume - T360470

After the above created /srv/puppet the service could start again and puppet runs on clients work again (with the exception of deploy* and the old puppetmaster)

puppetmaster-1003 is down again :/ tried to soft reboot it...

edit: working again after reboot

Screenshot from 2024-04-17 07-37-40.png (294×1 px, 421 KB)

^ afraid it's not stable yet. seems down again. soft rebooted

Mentioned in SAL (#wikimedia-cloud) [2024-04-17T21:18:38Z] <mutante> - resizing puppetmaster-1003 from g3.cores1.ram2.disk20 to g3.cores2.ram4.disk20 - T360470

Things have been working better since we gave it more resources. Closing again for now.

Mentioned in SAL (#wikimedia-cloud) [2024-04-24T19:41:21Z] <mutante> deleting instance puppetmaster-1001 that was > 4 years old, on buster and I had shutdown a couple days ago. replaced by puppetmaster-1003 (bookworm, puppetserver) T360964 T360470

Mentioned in SAL (#wikimedia-cloud) [2024-05-02T23:52:24Z] <mutante> switching puppetmaster for deploy-1006 back to local project puppetmaster; rm -rf /var/lib/puppet/ssl that still referred to puppetmaster-1001, signing new request on puppetmaster-1003 T360470 T363415

One note: the puppet sync was broken due to permission issues on the directory since end of March. Some folders were owned by root and not gitpuppet (in both labs private and puppet). I fixed the permissions using chown gitpuppet:gitpuppet -R /srv/git/operations/puppet/. Puppet is happy again on the devtools puppet server. Thanks to @taavi again for the help.

Make sure to use the correct user in the future, like sudo -u gitpuppet git status.

Jelto and taavi: Actually we saw a little while ago the puppetmaster was not in sync but hesitated to just git reset --hard it. Thanks for this, the outcome fixed the gitlab login but also this general issue with the new puppetmaster. Though all the non-gitlab instances in devtools now use the central master and not the local master.