Page MenuHomePhabricator

Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7
Closed, ResolvedPublic

Description

The Alert hosts currently use Puppet 5.
After the Bookworm upgrade we need to upgrade the catalogue to be compatible with Puppet 7.

Event Timeline

I had a look and what is failing is the naggen2 calls in the icinga::naggen class in L12 and L21. These were moved to the Puppet servers in https://gerrit.wikimedia.org/r/c/operations/puppet/+/991361, but don't appear to work fully yet. If I run the command manually on puppetserver1001, the naggen command itself fails to query puppetdb:

jmm@puppetserver1001:~$ /usr/local/bin/naggen2 --type hosts
jmm@puppetserver1001:~$ echo $?
30
jmm@puppetserver1001:~$ /usr/local/bin/naggen2 --debug --type hosts
naggen2: INFO - Generating output for resource hosts
naggen2: DEBUG - Loading configfile /etc/puppet/puppetdb.conf
urllib3.connectionpool: DEBUG - Starting new HTTPS connection (1): puppetdb1003.eqiad.wmnet:443
urllib3.connectionpool: DEBUG - https://puppetdb1003.eqiad.wmnet:443 "GET /pdb/query/v4/resources/Nagios_host?query=%5B%22and%22%2C+++++++++++++++++++++++++%5B%22%3D%22%2C+%5B%22parameter%22%2C+%22ensure%22%5D%2C+%22present%22%5D%2C+++++++++++++++++++++++++%5B%22%3D%22%2C+%22exported%22%2C+true%5D+++++++++++++++++++++%5D HTTP/1.1" 400 237
naggen2: ERROR - Could not generate output for resource hosts
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/naggen2", line 153, in render
    for entity in self._query(definition['puppet_resource_type']):
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/bin/naggen2", line 190, in _query
    return resources_raw.json()
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/requests/models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Is the 400 because of a missing cert? From a cumin host I get:

$ curl -G "https://puppetdb1003.eqiad.wmnet/pdb/query/v4/resources/Nagios_host" --data-urlencode 'query=["and", ["=", ["parameter", "ensure"], "present"], ["=", "exported", true]]'
<html>
<head><title>400 No required SSL certificate was sent</title></head>
<body>
<center><h1>400 Bad Request</h1></center>
<center>No required SSL certificate was sent</center>
<hr><center>nginx/1.22.1</center>
</body>
</html>
cmooney triaged this task as Medium priority.Feb 28 2024, 10:56 AM

Is the 400 because of a missing cert? From a cumin host I get:

$ curl -G "https://puppetdb1003.eqiad.wmnet/pdb/query/v4/resources/Nagios_host" --data-urlencode 'query=["and", ["=", ["parameter", "ensure"], "present"], ["=", "exported", true]]'
<html>
<head><title>400 No required SSL certificate was sent</title></head>
<body>
<center><h1>400 Bad Request</h1></center>
<center>No required SSL certificate was sent</center>
<hr><center>nginx/1.22.1</center>
</body>
</html>

I'm wondering how this worked in the first place? _is_ naggen currently actually working, this seems unrelated to Puppet 7? naggen.py simply uses requests, so doing the equivalent of Riccardo's curl command from above and when I run it from e.g.puppetserver1002, which is expected since the nginx site for puppetdb has ssl_verify_client enabled.

For other tooling which queries puppetdb we use pypuppetdb and the microservice typically.

One other option would be to move the generators to the puppedb hosts and have it simply query puppetdb on localhost.

But is this still task still valid? The alert hosts were migrated to bookworm this week and puppet is running fine there.

andrea.denisse changed the task status from Open to In Progress.Mar 26 2024, 4:37 PM

Change #1003527 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Update hiera entries for alert2001 to use Puppet 7

https://gerrit.wikimedia.org/r/1003527

Change #1003531 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] alert: Update hiera entries for alert1001 to use Puppet 7

https://gerrit.wikimedia.org/r/1003531

Change #1003527 merged by Andrea Denisse:

[operations/puppet@production] alert: Update hiera entries for alert2001 to use Puppet 7

https://gerrit.wikimedia.org/r/1003527

Hi Infrastructure Foundations Team,

We're currently facing a challenge with Puppet on alert2001, it has come to our attention that Puppet 7 hosts require an SSL certificate to communicate with the Puppet master and access Puppet's database, despite our attempts to troubleshoot, we're at a standstill on how to proceed effectively.

Could you please provide your insights on the best way to address the SSL certificate requirement? Furthermore, any direct support you could offer in resolving this issue would be greatly appreciated.

Relevant output:

denisse@cumin2002:~$ sudo cumin 'alert2*' 'run-puppet-agent'
1 hosts will be targeted:
alert2001.wikimedia.org
OK to proceed on 1 hosts? Enter the number of affected hosts to confirm or "q" to quit: 1
----- OUTPUT of 'run-puppet-agent' -----
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Failed to execute generator /usr/local/bin/naggen2: Execution of '/usr/local/bin/naggen2 --type hosts' returned 30:  (file: /srv/puppet_code/environments/production/modules/icinga/manifests/naggen.pp, line: 12, column: 18) on node alert2001.wikimedia.org
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
================
PASS |                                                                                                                                                                                                                         |   0% (0/1) [00:17<?, ?hosts/s]
FAIL |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:17<00:00, 17.77s/hosts]
100.0% (1/1) of nodes failed to execute command 'run-puppet-agent': alert2001.wikimedia.org
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'run-puppet-agent'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
andrea.denisse changed the task status from In Progress to Stalled.Mar 27 2024, 4:52 PM

The premise seems to mix different things. PuppetDB is a totally separated service from the PuppetMaster/PuppetServer ones and runs on their own hosts. Are you saying that naggen fails to connect to PuppetDB?

What kind of queries does it do? Could it use the public proxy that is exposed in the intranet? If so connect to puppetdb-api.discovery.wmnet:8090

The premise seems to mix different things. PuppetDB is a totally separated service from the PuppetMaster/PuppetServer ones and runs on their own hosts. Are you saying that naggen fails to connect to PuppetDB?

What kind of queries does it do? Could it use the public proxy that is exposed in the intranet? If so connect to puppetdb-api.discovery.wmnet:8090

From what I understand the Puppet run is failing on the alert2001 host because the naggen2 command tries to query puppetdb unsuccessfully due to a missing (maybe incorrect) SSL certificate required on Puppet7 hosts to contact the Puppetmaster and query its DB.

Apologies if the premise is not clear, I'm still trying to understand the root cause of the issue.

Yes but to which endpoint is trying to connect? Please try to use puppetdb-api.discovery.wmnet:8090 and let us know if that works or not (that's a proxy that allows only some queries and not others, so it might need tweaking based on which queries naggen does).

I've been working on debugging this too, here's my understanding:

  • naggen2 is used to generate icinga configuration for nagios_host and nagios_service exported resources, runs as a generator on puppet master/server
  • naggen2 reads /etc/puppet/puppetdb.conf to discover the puppetdb url
  • on puppetmaster naggen2 works because puppetdb url points on port 8443, which has ssl cert validation as optional
  • this is not the case on puppetserver, thus naggen2 can't query puppetdb

The premise seems to mix different things. PuppetDB is a totally separated service from the PuppetMaster/PuppetServer ones and runs on their own hosts. Are you saying that naggen fails to connect to PuppetDB?

What kind of queries does it do? Could it use the public proxy that is exposed in the intranet? If so connect to puppetdb-api.discovery.wmnet:8090

It queries for all exported Nagios_host and Nagios_service, I tried adapting it to use puppetdb-api though AFAICS the microservice only returns certname ? naggen2 will look for other resource parameters like contact_groups, title, etc

Sorry, ignore my previous comments, there was some misunderstanding:

  • naggen runs on the puppetmaster/puppetserver server side, not on the agent side
  • it needs all the info from the resources and hence can't use the proxy
  • it should just work using the puppet client certificates (you might need to expose them, not sure)

Something like this from root works:

curl --cert $(puppet config print hostcert) --key $(puppet config print hostprivkey) --cacert $(puppet config print cacert) https://puppetdb1003.eqiad.wmnet/pdb/query/v4/....

Sorry for the confusion

Change #1015326 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] puppetserver: use client certs for naggen2 puppetdb

https://gerrit.wikimedia.org/r/1015326

Change #1015326 merged by Filippo Giunchedi:

[operations/puppet@production] puppetserver: use client certs for naggen2 puppetdb

https://gerrit.wikimedia.org/r/1015326

Change #1015511 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] puppetserver: use client certs for naggen2/puppetdb

https://gerrit.wikimedia.org/r/1015511

Change #1015511 merged by Filippo Giunchedi:

[operations/puppet@production] puppetserver: use client certs for naggen2/puppetdb

https://gerrit.wikimedia.org/r/1015511

Change #1003531 merged by Filippo Giunchedi:

[operations/puppet@production] alert: Update hiera entries for alert1001 to use Puppet 7

https://gerrit.wikimedia.org/r/1003531

I pushed forward with this to be in a stable/known state ASAP, i.e. alert1001 and alert2001 are both on puppet 7 now and catalogs compile successfully