Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Reedy
	Jan 20 2020, 6:03 PM

Description

<_joe_> Wmflib::Service has been added today
<_joe_> ok the original error was
<_joe_> Jan 20 06:42:30 deployment-cache-upload05 puppet-agent[8798]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Failed to parse template varnish/upload-common.inc.vcl.erb:
<_joe_> Jan 20 06:42:30 deployment-cache-upload05 puppet-agent[8798]:   Filepath: /etc/puppet/modules/varnish/templates/upload-common.inc.vcl.erb
<_joe_> Jan 20 06:42:30 deployment-cache-upload05 puppet-agent[8798]:   Line: 42
<_joe_> Jan 20 06:42:30 deployment-cache-upload05 puppet-agent[8798]:   Detail: key not found: "upload_domain"
<_joe_> so Reedy I can't really wrap my head around why that wouldn't work
<_joe_> ok, found out I think
<_joe_> it's a bug in puppet 4.x
<Reedy> nice
<_joe_> yes confirmed
<_joe_> so I guess you need to open a task about upgrading puppet version in deployment-prep

Details

	Subject	Repo	Branch	Lines +/-
	role::puppetmaster::standalone: Add support for multiple PuppetDB hosts	operations/puppet	production	+4 -1
	role::puppetmaster::standalone: support multiple puppetdb servers	operations/puppet	production	+17 -15

Customize query in gerrit

Related Objects

Mentioned In: T248041: puppetdb on deployment-puppetdb03 keeps getting OOMKilled
T153163: Set up and use exported resources for Tool Labs's shared knowledge
T244472: Stream a subset of mediawiki apache logs to logstash
T244586: Restbase routing down on beta, 2020-02-07
T244074: Host key verification failed x 2
T243881: en.wikipedia.beta.wmflabs.org not accessible in Firefox, due to something with OCSP
T243355: puppet panel: Can't add new prefixes
Mentioned Here: T223971: Old cloudvirt (with Intel Xeon) are half the speed of newer ones (Intel Sky Lake)
T229441: CloudVPS: codfw1dev: missing bits
rCLIPec5f9c1645d2: Horizon auto commit for user krenair

Event Timeline

Reedy created this task.Jan 20 2020, 6:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 20 2020, 6:03 PM

FWIW:

maurelio@deployment-deploy01:~$ sudo puppet --version
4.8.2

to which version do we need to upgrade?

5.5 I think

am guessing this is just us needing to get a new puppetmaster with buster instead of stretch

MoritzMuehlenhoff assigned this task to jbond.Jan 21 2020, 2:34 PM

MoritzMuehlenhoff triaged this task as Medium priority.

>>! In T243226#5817998, @Krenair wrote:

am guessing this is just us needing to get a new puppetmaster with buster instead of stretch

I added the puppet-master version 5 packages to the puppet5 component sso you should be able to upgrade the puppet master with instructions i have added to wikitech. however the bigger problem is the puppetdb server. this will need to be rebuilt on buster as it is not a simple task to backport the puppetdb packages. the procedure i would advice is

build new puppetdb server on buster
update hiera config to include the following which will ensure that the new puppetdb server gets populated with the required metadata

profile::puppetmaster::common::puppetdb_hosts:
 - $fqdn_current_puppetdb
 - $fqdn_new_puppetdb
profile::puppetmaster::common::command_broadcast: true

puppet agent now needs to run on all clients so the new db can get a collection of exported resources, i would run it at least twice just incase there are some complex dependencies
- either run manually or wait an hour or two
swap the order of profile::puppetmaster::common::puppetdb_hosts: so the new puppetdb is first
validate everything is working
remove the old puppetdb from profile::puppetmaster::common::puppetdb_hosts: and remove profile::puppetmaster::common::command_broadcast: true

I'm not an admin in the project but happy for your to add me if you need any help, its quite possible i have missed a step as i'm largely going from memory

MoritzMuehlenhoff subscribed.Jan 21 2020, 3:03 PM

Mentioned in SAL (#wikimedia-releng) [2020-01-21T22:10:51Z] <Krenair> deployment-prep added jbond as admin, T243226

Krenair mentioned this in T243355: puppet panel: Can't add new prefixes.Jan 21 2020, 10:54 PM

In T243226#5819512, @jbond wrote:
however the bigger problem is the puppetdb server. this will need to be rebuilt on buster as it is not a simple task to backport the puppetdb packages. the procedure i would advice is

build new puppetdb server on buster

update hiera config to include the following which will ensure that the new puppetdb server gets populated with the required metadata
profile::puppetmaster::common::puppetdb_hosts:
 - $fqdn_current_puppetdb
 - $fqdn_new_puppetdb
profile::puppetmaster::common::command_broadcast: true
puppet agent now needs to run on all clients so the new db can get a collection of exported resources, i would run it at least twice just incase there are some complex dependencies

either run manually or wait an hour or two

swap the order of profile::puppetmaster::common::puppetdb_hosts: so the new puppetdb is first

validate everything is working

remove the old puppetdb from profile::puppetmaster::common::puppetdb_hosts: and remove profile::puppetmaster::common::command_broadcast: true

made deployment-puppetdb03, did the usual puppet client setup
Turns out we use role::puppetmaster::standalone which it seems doesn't support command_broadcast

In T243226#5819512, @jbond wrote:

I'm not an admin in the project but happy for your to add me if you need any help, its quite possible i have missed a step as i'm largely going from memory

done anyway :)

Krenair claimed this task.Jan 21 2020, 11:02 PM

Krenair added a subscriber: jbond.

Change 566380 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] role::puppetmaster::standalone: Add support for multiple PuppetDB hosts

https://gerrit.wikimedia.org/r/566380

gerritbot added a project: Patch-For-Review.Jan 21 2020, 11:08 PM

Well, my new puppetdb instance seems to not be working very well yet:

root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production u+14)# puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: [503 Service Unavailable] PuppetDB is currently down. Try again later.
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production u+14)#

The change applied earlier on the puppetmaster was:

Notice: /Stage[main]/Puppetmaster::Puppetdb::Client/File[/etc/puppet/puppetdb.conf]/content: 
--- /etc/puppet/puppetdb.conf	2018-05-27 03:14:06.246340384 +0000
+++ /tmp/puppet-file20200121-15085-n4buga	2020-01-21 23:11:14.919611883 +0000
@@ -1,3 +1,4 @@
 [main]
-server_urls = https://deployment-puppetdb02.deployment-prep.eqiad.wmflabs:443
+server_urls = https://deployment-puppetdb02.deployment-prep.eqiad.wmflabs:443,https://deployment-puppetdb03.deployment-prep.eqiad.wmflabs:443
+command_broadcast = true
 soft_write_failure = false

Am going to remove the new host from the puppetdb_hosts lists in hieradata on the puppetmaster and manually revert this change to get things working again.

rCLIPec5f9c1645d2eadc8db259755bf163c69e0409d6
Reloaded apache2 on deployment-puppetmaster03

Change 566500 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] role::puppetmaster::standalone: support multiple puppetdb servers

https://gerrit.wikimedia.org/r/566500

hi alex,

I noticed that the postgress databases was missing the uppetdb user, however a simple puppet run on the puppet master fixed the problem. i did restart the postgress server first as i had seen issues of it only binding to localhost in the past so this could have helped. Anyway after the puppet run puppetdb started correctly and applying a change similar to the one you highlighted above seems to work.

puppet run output

root@deployment-puppetdb03:~# sudo puppet agent -t
Warning: Downgrading to PSON for future requests
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-puppetdb03.deployment-prep.eqiad.wmflabs
Info: Applying configuration version '(2460e5f31d) root - role::puppetmaster::standalone: Add support for multiple PuppetDB hosts'                                                                              
Notice: The LDAP client stack for this host is: sssd/sudo
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: sssd/sudo'
Notice: /Stage[main]/Puppetmaster::Puppetdb::Database/Postgresql::User[puppetdb@localhost]/Exec[create_user-puppetdb@localhost]/returns: executed successfully
Info: /Stage[main]/Puppetmaster::Puppetdb::Database/Postgresql::User[puppetdb@localhost]/Exec[create_user-puppetdb@localhost]: Scheduling refresh of Exec[pass_set-puppetdb@localhost]                          
Notice: /Stage[main]/Puppetmaster::Puppetdb::Database/Postgresql::User[puppetdb@localhost]/Exec[pass_set-puppetdb@localhost]/returns: executed successfully
Notice: /Stage[main]/Puppetmaster::Puppetdb::Database/Postgresql::User[puppetdb@localhost]/Exec[pass_set-puppetdb@localhost]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Puppetmaster::Puppetdb::Database/Postgresql::User[prometheus@localhost]/Exec[create_user-prometheus@localhost]/returns: executed successfully
Notice: /Stage[main]/Puppetmaster::Puppetdb::Database/Postgresql::Db[puppetdb]/Exec[create_postgres_db_puppetdb]/returns: executed successfully
Notice: /Stage[main]/Puppetmaster::Puppetdb::Database/Exec[create_tgrm_extension]/returns: executed successfully
Notice: Applied catalog in 6.41 seconds

Change 566500 merged by Jbond:
[operations/puppet@production] role::puppetmaster::standalone: support multiple puppetdb servers

https://gerrit.wikimedia.org/r/566500

Hi alex,

I have pushed the change to allow multiple puppetdb's with command_broadcast and updated the node meta data in horiozon (i removed some redundant keys as well). You can check on how populated the new DB is with the following command:

check how many nodes have checked once the output of this matches on both nodes it should be safe to make the switch

curl -X POST http:/localhost:8080/pdb/query/v4/nodes | jq .[].certname

In T243226#5822813, @jbond wrote:

I noticed that the postgress databases was missing the uppetdb user, however a simple puppet run on the puppet master fixed the problem. i did restart the postgress server first as i had seen issues of it only binding to localhost in the past so this could have helped

Simple running of puppet on the puppetmaster/puppetdb hosts was one of the things that was broken... I bet the restarting of postgres was what actually fixed things.

In T243226#5823172, @jbond wrote:
Hi alex,

I have pushed the change to allow multiple puppetdb's with command_broadcast and updated the node meta data in horiozon (i removed some redundant keys as well). You can check on how populated the new DB is with the following command:

check how many nodes have checked once the output of this matches on both nodes it should be safe to make the switch
curl -X POST http:/localhost:8080/pdb/query/v4/nodes | jq .[].certname

alex@alex-laptop:~$ ssh deployment-puppetdb02 curl -s -X POST http://localhost:8080/pdb/query/v4/nodes | jq -r .[].certname | sort -d > puppetdb02
alex@alex-laptop:~$ ssh deployment-puppetdb03 curl -s -X POST http://localhost:8080/pdb/query/v4/nodes | jq -r .[].certname | sort -d > puppetdb03
alex@alex-laptop:~$ diff puppetdb0{2,3}
26d25
< deployment-eventgate-1.deployment-prep.eqiad.wmflabs
43,45d41
< deployment-logstash03.deployment-prep.eqiad.wmflabs
< deployment-logstash2.deployment-prep.eqiad.wmflabs
< deployment-maps04.deployment-prep.eqiad.wmflabs
48d43
< deployment-mediawiki-07.deployment-prep.eqiad.wmflabs
50d44
< deployment-mediawiki-jhuneidi.deployment-prep.eqiad.wmflabs
64d57
< deployment-pdfrender02.deployment-prep.eqiad.wmflabs
69d61
< deployment-puppetmaster03.deployment-prep.eqiad.wmflabs
75d66
< deployment-schema-1.deployment-prep.eqiad.wmflabs
78,79d68
< deployment-sessionstore01.deployment-prep.eqiad.wmflabs
< deployment-sessionstore02.deployment-prep.eqiad.wmflabs
81d69
< deployment-snapshot01.deployment-prep.eqiad.wmflabs

~~deployment-eventgate-1.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting~~
deployment-logstash03.deployment-prep.eqiad.wmflabs - exists but out of disk space so no puppet runs
deployment-logstash2.deployment-prep.eqiad.wmflabs - exists but out of disk space so no puppet runs (not planning to fix this instance anymore, I made 03 to replace it)
~~deployment-maps04.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting~~
deployment-mediawiki-07.deployment-prep.eqiad.wmflabs - Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'effie'); The last Puppet run was at Mon Jan 20 10:56:33 UTC 2020 (3515 minutes ago). Puppet is disabled. effie @jijiki
~~deployment-mediawiki-jhuneidi.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting~~
~~deployment-pdfrender02.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting~~
~~deployment-puppetmaster03.deployment-prep.eqiad.wmflabs - Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'test multiple puppetdb backends - jbond');~~
~~deployment-schema-1.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting~~
~~deployment-sessionstore01.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting~~
~~deployment-sessionstore02.deployment-prep.eqiad.wmflabs - no longer exists, someone probably forgot to deactivate node when deleting~~
deployment-snapshot01.deployment-prep.eqiad.wmflabs - Resource type not found: Wmflib::Service at /etc/puppet/modules/wmflib/functions/service/fetch.pp:4:51

New diff after deactivating deceased things and enabling puppet on the puppetmaster:

alex@alex-laptop:~$ diff puppetdb0{2,3}
42,43d41
< deployment-logstash03.deployment-prep.eqiad.wmflabs
< deployment-logstash2.deployment-prep.eqiad.wmflabs
46d43
< deployment-mediawiki-07.deployment-prep.eqiad.wmflabs
74d70
< deployment-snapshot01.deployment-prep.eqiad.wmflabs

Change 566380 abandoned by Alex Monk:
role::puppetmaster::standalone: Add support for multiple PuppetDB hosts

Reason:
Redone in I3a2cb763, for some reason.

https://gerrit.wikimedia.org/r/566380

logstash03 seems to be making good use of the 20G root volume to store... logs :(

root@deployment-logstash03:~# du -hsx /var/log/* | grep -E '^([0-9]*M|[0-9\.]*G)'
5.8G	/var/log/daemon.log
230M	/var/log/daemon.log.1
211M	/var/log/elasticsearch
1.9G	/var/log/kafka
2.2G	/var/log/logstash
3.4G	/var/log/syslog
1.7G	/var/log/syslog.1
37M	/var/log/syslog.4.gz
18M	/var/log/syslog.5.gz

as for snapshot01 - that one is erroring about wmflib::service, is this something that'll get resolved by the puppet upgrade? should we proceed without that one?

In T243226#5824821, @Krenair wrote:

as for snapshot01 - that one is erroring about wmflib::service, is this something that'll get resolved by the puppet upgrade? should we proceed without that one?

Yes i believe this is a bug in puppet 4.8 so this can be ignored

Maintenance_bot removed a project: Patch-For-Review.Jan 22 2020, 11:11 PM

I'm not sure what did it but logstash2 has shown up. I made a little space on logstash03 and that's on the list now too. So that just leaves us with mediawiki-07 - need to find out why puppet is disabled and see if we can get it running again, if only temporarily.

@jijiki: Hi, deployment-mediawiki-07.deployment-prep.eqiad.wmflabs has puppet disabled since approx Mon Jan 20 10:56:33 UTC 2020 (12417 minutes ago) with the comment effie - assuming that's you, do you still need that? Would like to enable puppet so we can complete some puppet infrastructure changes in the project.

Krenair mentioned this in T243881: en.wikipedia.beta.wmflabs.org not accessible in Firefox, due to something with OCSP.Jan 31 2020, 5:44 PM

Swapped the list in hieradata for deployment-puppetmaster03 around. Once I'm happy it's working I'll remove the old puppetdb from the list and disable command_broadcast.

On the subject of deployment-mediawiki-07: I plan to re-enable puppet on Wednesday 5th unless we can find out why it was disabled.

@Krenair I have been experimenting with this host, feel free to enable puppet if needed, sorry for the inconvenience:)

No problem, just trying to avoid overwriting other's work. Am also aware
(including having done it myself a few times) sometimes people just forget
to re-enable it when done. When I do it I'll post a copy of the diff to
avoid losing anything

@Krenair thank you, I already have a backup of the configuration, so there is no need to let this disrupt you any further

Peachey88 updated the task description. (Show Details)Feb 1 2020, 3:38 PM

@jbond: Okay so I think our puppetdb stuff is migrated now. I just removed the old server from the list, removed command broadcasting, and puppet still seems to be working. Next step is to upgrade our puppetmaster?

Krenair mentioned this in T244074: Host key verification failed x 2.Feb 2 2020, 12:39 PM

Just done the puppetmaster upgrade and found puppet no longer runs on the puppetmaster: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: undefined method `key_attributes' for nil:NilClass

Is it possibly our lack of updated puppetdb-termini package?

root@deployment-puppetmaster03:~# apt-cache policy puppetdb-termini
puppetdb-termini:
  Installed: 4.4.0-1~wmf2
  Candidate: 4.4.0-1~wmf2
  Version table:
 *** 4.4.0-1~wmf2 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/component/puppetdb4 amd64 Packages
        100 /var/lib/dpkg/status
     4.4.0-1~wmf1 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages

Krenair merged a task: T244306: Puppet broken on Beta Cluster app server.Feb 4 2020, 11:31 PM

Krenair added a subscriber: Krinkle.

Krinkle renamed this task from Upgrade puppet in deployment-prep to Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster).Feb 4 2020, 11:34 PM

@Krenair i think the package needed is puppet-terminus-puppetdb which is provided by the puppetdb-6.2.0 source package. I have looked at building this but it seems to have a lot of dependencies based on clojure making it difficult to build for stretch. I think the simplest solution would be to build a new buster puppetmaster04 and migrate to that. I'm not to sure on that infrastructure so wouldn't want to make the changes myself however happy to help, feel free to ping me on irc (jbond42)

ugh, ok

Have put puppetmaster03 back on the old version and created puppetmaster04

Last night I tried moving cache-text05 to use the new puppet master. It
doesn't seem to work just yet as for some reason it's (the puppetmaster?)
attempting to connect to puppet:8140 instead of our configured puppetdb
host. I think in labs that name always points to the central puppetmaster
which we don't use in deployment-prep.

I have preformed the following actions

copied the CA from deployment-puppetmaster03 to deployment-puppetmaster04
on deployment-puppetmaster04 renamed the deployment-puppetmaster03 to puppetca under /var/lib/puppet/server/ssl/
regenerate the deployment-cache-text05 certificate.

With this set up the puppet masters can have there own client/agent certificate under /var/lib/puppet/ssl/ and use the ca certificates under /var/lib/puppet/server/ssl/certs/ for signing . The ca certificate will continue to have the issuer=CN = Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs however this is not a big problem as it will just be used for signing. This is similar to productions which still has issuer=CN = Puppet CA: palladium.eqiad.wmnet. With this configuration you should be able to just point hosts at the new puppet master i.e. add the following to /etc/puppet/puppet.conf and things should just work without issue and no need to delete and re-sign ssl certicates

[agent]
server = deployment-puppetmaster04.deployment-prep.eqiad.wmflabs

I test a puppet run on deployment-cache-text05 however that is no failing with the following unrelated error

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find template 'varnish/misc-common.inc.vcl.erb' (file: /etc/puppet/modules/varnish/manifests/wikimedia_vcl.pp, line: 31, column: 28) (file: /etc/puppet/modules/varnish/manifests/instance.pp, line: 38) on node deployment-cache-text05.deployment-prep.eqiad.wmflabs

This seems to be caused because the deployment-puppetmaster's have a different checkout of the puppet repo which is missing a commit which removes varnish/misc-common.inc.vcl.erb

In T243226#5855852, @jbond wrote:

This is similar to productions which still has issuer=CN = Puppet CA: palladium.eqiad.wmnet.

That is what I was hoping to avoid, but thanks for getting it working.

In T243226#5855852, @jbond wrote:
I test a puppet run on deployment-cache-text05 however that is no failing with the following unrelated error
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find template 'varnish/misc-common.inc.vcl.erb' (file: /etc/puppet/modules/varnish/manifests/wikimedia_vcl.pp, line: 31, column: 28) (file: /etc/puppet/modules/varnish/manifests/instance.pp, line: 38) on node deployment-cache-text05.deployment-prep.eqiad.wmflabs
This seems to be caused because the deployment-puppetmaster's have a different checkout of the puppet repo which is missing a commit which removes varnish/misc-common.inc.vcl.erb

Nah it's got the commit, it's just another hieradata difference. I'm behind on the whack-a-mole.

Got puppet running on -cache-text05, whole beta cluster broke, fixed acme-chief and ATS, going to sleep.

MarcoAurelio mentioned this in T244586: Restbase routing down on beta, 2020-02-07.Feb 7 2020, 11:22 PM

puppetdb on deployment-puppetdb03 was killed by kernel OOM at Feb 7 09:50:29, per syslog. I just now ran systemctl start puppetdb on that host, to fix puppet issues in beta.

Not sure if the OOM is a known issue related to this task. As a follow-up, we might want to add a Restart=... line to the systemd unit file to make this self-healing next time this happens.

Thanks dpifke. I've seen it do that before too. We probably need to tune it a bit - I recall puppetdb hosts in particular have a hiera setting relating to memory usage, though this one should be similar in size to puppetdb02. They're m1.small so probably not going to get the same resources ops would give the prod puppetdb VMs.

On a separate note I have moved all the deployment-prep hosts I knew failing due to this error to the new puppetmaster, and will continue to move the rest.

I've moved the remaining instances over to using the new puppetmaster. Puppet does appear to be struggling on deployment-mwmaint01 and deployment-puppetmaster03 though - not sure why.

So before closing this task and removing puppetmaster03, someone should address:

puppetdb03 memory usage
puppetmaster04 disk usage
puppet run time on mwmaint01/puppetmaster03

RhinosF1 subscribed.Feb 8 2020, 8:53 PM

Turning off debug logging on puppetmaster04 (I set logdest = /dev/null in /etc/puppet/puppet.conf) has helped with the disk usage, and sluggishness issues. But sadly puppet runs are currently failing with:

deployment-puppetmaster04:~# puppet agent -t
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, DNS lookup failed for  Resolv::DNS::Resource::IN::A (file: /etc/puppet/manifests/realm.pp, line: 64, column: 9) on node deployment-puppetmaster04.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

herron mentioned this in T244472: Stream a subset of mediawiki apache logs to logstash .Feb 14 2020, 5:22 PM

That last issue (the resolution failure) was a side-effect of work I was doing for T229441. That issue is resolved, but now the failure us

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Failed to parse template ssh/known_hosts.erb:
  Filepath: /var/lib/git/operations/puppet/modules/puppetdbquery/lib/puppetdb/connection.rb
  Line: 68
  Detail: PuppetDB query error: [500] Server Error, query: ["and",["=","type","Sshkey"],["~","title",".*"],["=","exported",true]]
 (file: /etc/puppet/modules/ssh/manifests/client.pp, line: 8, column: 24) on node deployment-puppetmaster04.deployment-prep.eqiad.wmflabs

No idea if that's related to my work or not.

Krenair claimed this task.Feb 14 2020, 7:04 PM

So this error was the memory usage problem on puppetdb03 I mentioned above - puppetdb won't work without postgresql, which can't start because it wants more memory than the system has. Changed role::puppetmaster::puppetdb::shared_buffers from 768MB to 600MB.

After doing this we found everything was very slow when the puppetmaster was running on cloudvirt1013 but ok when moved to cloudvirt1017. Either the host was under particular load (it shouldn't've been, there wasn't a lot of other things running there) or this is another case of T223971

Thanks to Andrew it seems to be running well now. I've copied across /var/lib/puppet/volatile to sort a lot of swift/GeoIP failures.

Also copied /etc/conftool-state/mediawiki.yaml to sort out mediawiki::state for mwmaint01
I've also taken /root and /home and put them at deployment-puppetmaster04:/root/deployment-puppetmaster03-{homes,root}.tar.gz in case anyone needs anything

Mentioned in SAL (#wikimedia-releng) [2020-02-15T20:25:06Z] <Krenair> T243226 Shut off deployment-puppetdb02 and deployment-puppetmaster03 - will leave for a week before deletion

Krenair mentioned this in T153163: Set up and use exported resources for Tool Labs's shared knowledge.Feb 16 2020, 6:09 PM

Done, old instances deleted. Thanks to all who helped.

Krenair mentioned this in T248041: puppetdb on deployment-puppetdb03 keeps getting OOMKilled.Mar 19 2020, 12:04 AM

Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster)Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Upgrade puppet in deployment-prep (Puppet agent broken in Beta Cluster)
Closed, ResolvedPublic
Actions