Page MenuHomePhabricator

Puppet agent failing on deploy-1004.devtools.eqiad1.wikimedia.cloud
Closed, ResolvedPublic

Description

$ sudo run-puppet-agent
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Unknown function: 'puppetdb_query'. (file: /etc/puppet/modules/wmflib/functions/resource/hosts.pp, line: 33, column: 5) on node deploy-1004.devtools.eqiad1.wikimedia.cloud
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

puppetdb_query is supposed to be built-in so this is confusing.

The instance Puppet configuration has a single class applied: role::deployment_server

Event Timeline

Thanks for filing it. The instance deploy-1004.devtools.eqiad1.wikimedia.cloud is using the Puppet server puppetmaster.cloudinfra.wmflabs.org.

From the puppet log files, it last worked on Apr 28 11:33 and failed on Apr 28 12:03. From the review data in operations/puppet:

git fetch origin refs/notes/review:refs/notes/review
git log --notes=review

Then we can search the commits against the review field Submitted-at which gives me a few commit and one by @jbond and @Volans mentions puppetdb query: e127f7988894f4aa41493cde6e7069cf19d5488a / https://gerrit.wikimedia.org/r/c/operations/puppet/+/787436

commit e127f7988894f4aa41493cde6e7069cf19d5488a
Author: John Bond <jbond@wikimedia.org>
Date:   Thu Apr 28 11:50:57 2022 +0200

    P:cumin::master: Add documentation and fix minor lint issue
    
    This also moves to using wmflib::role_hosts instead of the puppetdbquery
    functions
    
    Hosts: P:cumin::master
    Bug: T306830
    Change-Id: I510b70521e23f94992c9da02b1639dd1eedb51a9

Notes (review):
    Verified+2: jenkins-bot
    Code-Review+1: Volans <rcoccioli@wikimedia.org>
    Code-Review+2: Jbond <jbond@wikimedia.org>
    Submitted-by: Jbond <jbond@wikimedia.org>
    Submitted-at: Thu, 28 Apr 2022 10:20:11 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/operations/puppet/+/787436
    Project: operations/puppet
    Branch: refs/heads/production

I don't see how it is related :\

I guess we would need some trace from the cloud service Puppet master? :-\

That is from another commit submitted slightly before https://gerrit.wikimedia.org/r/c/operations/puppet/+/771441 //Add scap targets as a dsh group`. Since the instance has the deployment server role, the new code is called to generate a scap_targets` dsh group using:

wmflib::class_hosts('mediawiki::scap') + wmflib::resource_hosts('scap::target')).sort.unique

The function wmflib functions ends up calling puppetdb_query() which is not available on the cloudinfra Puppet master.

commit 2b6e3f0c988591b70d72ab809cde637cf3a95b24
Author: John Bond <jbond@wikimedia.org>
Date:   Wed Mar 16 21:01:00 2022 +0100

    P:scap::dsh: Add scap targets as a dsh group
    
    Hosts: P:scap::dsh
    Bug: T303559
    Change-Id: I264ece635aaf59790d6381c26dde7f29f40a107a

Notes (review):
    Verified+2: jenkins-bot
    Verified+1: Majavah <hi@taavi.wtf>
    Code-Review+2: Jbond <jbond@wikimedia.org>
    Code-Review+1: Majavah <hi@taavi.wtf>
    Code-Review+1: Ahmon Dancy <adancy@wikimedia.org>
    Submitted-by: Jbond <jbond@wikimedia.org>
    Submitted-at: Thu, 28 Apr 2022 11:39:38 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/operations/puppet/+/771441
    Project: operations/puppet
    Branch: refs/heads/production

We might want to add a PuppetDB to the devtools project :-\

Documentation to create a PuppetDB is https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster/PuppetDB . We need a quota increase for the devtools project in order to be able to create the new instance: T311302

Change 808236 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] wmflib: Add check for storconfig to puppetdb functions

https://gerrit.wikimedia.org/r/808236

@hashar this is correct since https://gerrit.wikimedia.org/r/c/operations/puppet/+/771441, the deployment server now relies on puppetdb. I tried to create a qick patch to work around this, however 1) im not confident it works as planned 2) im not sure its the best thing to do. instead i think the best thing to do is create a standalone puppetmaster and puppetdb (or use pontoon) to more accurately replicate the production environment. From you last comment im guessing you came to a similar conclusion, let me know if you need any help

Change 808236 abandoned by Jbond:

[operations/puppet@production] wmflib: Add check for storconfig to puppetdb functions

Reason:

https://gerrit.wikimedia.org/r/808236

Thanks @jbond ! I went to configure a puppet db server for the devtools project :]

Puppet fails on on puppet-db-1001 with certificate revoked which I found out comes from the Puppet master trying to reach out back to https://puppet-db-1001.devtools.eqiad1.wikimedia.cloud:443//pdb/cmd/v1 and that cert is somehow invalid.

Because we use a local puppet master I had to remove the puppet-db-1001 certificate a few time and on the puppet master I removed it wit puppet cert clean <fqdn of puppetdb>. That apparently cause it the cert to be added to a revocation list which seems to be /var/lib/puppet/server/ssl/ca/inventory.txt. I have manually deleted the old entries.

Running puppet from the puppet-db-1001 I have a warning:

Warning: Error 500 on SERVER:
Server Error: Could not retrieve facts for puppet-db-1001.devtools.eqiad1.wikimedia.cloud:
Failed to find facts from PuppetDB at puppet:8140: SSL_connect returned=1 errno=0 state=error: certificate verify failed (certificate revoked):
[certificate revoked for /CN=puppet-db-1001.devtools.eqiad1.wikimedia.cloud]

Looks like the puppet agent attempts to retrieve facts from puppet:8140 which is the cloudinfra puppet master. So some configuration is missing.

hashar assigned this task to jbond.

@jbond offered help and rebuild the whole mess I have created. It is fixed and puppet managed to work on deploy-1004. Thank you!