The Alert hosts currently use Puppet 5.
After the Bookworm upgrade we need to upgrade the catalogue to be compatible with Puppet 7.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | dcaro | T313444 Streamline WMCS Alerting and Paging | |||
Resolved | dcaro | T320973 [wmcs][alerting] Allow silencing alerts metricsinfra alerts on alerts.wikimedia.org | |||
Resolved | andrea.denisse | T333615 Upgrade alert* hosts to Bookworm | |||
Resolved | andrea.denisse | T358506 Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 |
Event Timeline
I had a look and what is failing is the naggen2 calls in the icinga::naggen class in L12 and L21. These were moved to the Puppet servers in https://gerrit.wikimedia.org/r/c/operations/puppet/+/991361, but don't appear to work fully yet. If I run the command manually on puppetserver1001, the naggen command itself fails to query puppetdb:
jmm@puppetserver1001:~$ /usr/local/bin/naggen2 --type hosts jmm@puppetserver1001:~$ echo $? 30 jmm@puppetserver1001:~$ /usr/local/bin/naggen2 --debug --type hosts naggen2: INFO - Generating output for resource hosts naggen2: DEBUG - Loading configfile /etc/puppet/puppetdb.conf urllib3.connectionpool: DEBUG - Starting new HTTPS connection (1): puppetdb1003.eqiad.wmnet:443 urllib3.connectionpool: DEBUG - https://puppetdb1003.eqiad.wmnet:443 "GET /pdb/query/v4/resources/Nagios_host?query=%5B%22and%22%2C+++++++++++++++++++++++++%5B%22%3D%22%2C+%5B%22parameter%22%2C+%22ensure%22%5D%2C+%22present%22%5D%2C+++++++++++++++++++++++++%5B%22%3D%22%2C+%22exported%22%2C+true%5D+++++++++++++++++++++%5D HTTP/1.1" 400 237 naggen2: ERROR - Could not generate output for resource hosts Traceback (most recent call last): File "/usr/lib/python3/dist-packages/requests/models.py", line 971, in json return complexjson.loads(self.text, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/__init__.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/bin/naggen2", line 153, in render for entity in self._query(definition['puppet_resource_type']): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/bin/naggen2", line 190, in _query return resources_raw.json() ^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3/dist-packages/requests/models.py", line 975, in json raise RequestsJSONDecodeError(e.msg, e.doc, e.pos) requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Is the 400 because of a missing cert? From a cumin host I get:
$ curl -G "https://puppetdb1003.eqiad.wmnet/pdb/query/v4/resources/Nagios_host" --data-urlencode 'query=["and", ["=", ["parameter", "ensure"], "present"], ["=", "exported", true]]' <html> <head><title>400 No required SSL certificate was sent</title></head> <body> <center><h1>400 Bad Request</h1></center> <center>No required SSL certificate was sent</center> <hr><center>nginx/1.22.1</center> </body> </html>
I'm wondering how this worked in the first place? _is_ naggen currently actually working, this seems unrelated to Puppet 7? naggen.py simply uses requests, so doing the equivalent of Riccardo's curl command from above and when I run it from e.g.puppetserver1002, which is expected since the nginx site for puppetdb has ssl_verify_client enabled.
For other tooling which queries puppetdb we use pypuppetdb and the microservice typically.
One other option would be to move the generators to the puppedb hosts and have it simply query puppetdb on localhost.
But is this still task still valid? The alert hosts were migrated to bookworm this week and puppet is running fine there.
Change #1003527 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):
[operations/puppet@production] alert: Update hiera entries for alert2001 to use Puppet 7
Change #1003531 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):
[operations/puppet@production] alert: Update hiera entries for alert1001 to use Puppet 7
Change #1003527 merged by Andrea Denisse:
[operations/puppet@production] alert: Update hiera entries for alert2001 to use Puppet 7
Hi Infrastructure Foundations Team,
We're currently facing a challenge with Puppet on alert2001, it has come to our attention that Puppet 7 hosts require an SSL certificate to communicate with the Puppet master and access Puppet's database, despite our attempts to troubleshoot, we're at a standstill on how to proceed effectively.
Could you please provide your insights on the best way to address the SSL certificate requirement? Furthermore, any direct support you could offer in resolving this issue would be greatly appreciated.
Relevant output:
denisse@cumin2002:~$ sudo cumin 'alert2*' 'run-puppet-agent' 1 hosts will be targeted: alert2001.wikimedia.org OK to proceed on 1 hosts? Enter the number of affected hosts to confirm or "q" to quit: 1 ----- OUTPUT of 'run-puppet-agent' ----- Info: Using environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Failed to execute generator /usr/local/bin/naggen2: Execution of '/usr/local/bin/naggen2 --type hosts' returned 30: (file: /srv/puppet_code/environments/production/modules/icinga/manifests/naggen.pp, line: 12, column: 18) on node alert2001.wikimedia.org Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run ================ PASS | | 0% (0/1) [00:17<?, ?hosts/s] FAIL |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:17<00:00, 17.77s/hosts] 100.0% (1/1) of nodes failed to execute command 'run-puppet-agent': alert2001.wikimedia.org 0.0% (0/1) success ratio (< 100.0% threshold) for command: 'run-puppet-agent'. Aborting. 0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
The premise seems to mix different things. PuppetDB is a totally separated service from the PuppetMaster/PuppetServer ones and runs on their own hosts. Are you saying that naggen fails to connect to PuppetDB?
What kind of queries does it do? Could it use the public proxy that is exposed in the intranet? If so connect to puppetdb-api.discovery.wmnet:8090
From what I understand the Puppet run is failing on the alert2001 host because the naggen2 command tries to query puppetdb unsuccessfully due to a missing (maybe incorrect) SSL certificate required on Puppet7 hosts to contact the Puppetmaster and query its DB.
Apologies if the premise is not clear, I'm still trying to understand the root cause of the issue.
Yes but to which endpoint is trying to connect? Please try to use puppetdb-api.discovery.wmnet:8090 and let us know if that works or not (that's a proxy that allows only some queries and not others, so it might need tweaking based on which queries naggen does).
I've been working on debugging this too, here's my understanding:
- naggen2 is used to generate icinga configuration for nagios_host and nagios_service exported resources, runs as a generator on puppet master/server
- naggen2 reads /etc/puppet/puppetdb.conf to discover the puppetdb url
- on puppetmaster naggen2 works because puppetdb url points on port 8443, which has ssl cert validation as optional
- this is not the case on puppetserver, thus naggen2 can't query puppetdb
It queries for all exported Nagios_host and Nagios_service, I tried adapting it to use puppetdb-api though AFAICS the microservice only returns certname ? naggen2 will look for other resource parameters like contact_groups, title, etc
Sorry, ignore my previous comments, there was some misunderstanding:
- naggen runs on the puppetmaster/puppetserver server side, not on the agent side
- it needs all the info from the resources and hence can't use the proxy
- it should just work using the puppet client certificates (you might need to expose them, not sure)
Something like this from root works:
curl --cert $(puppet config print hostcert) --key $(puppet config print hostprivkey) --cacert $(puppet config print cacert) https://puppetdb1003.eqiad.wmnet/pdb/query/v4/....
Sorry for the confusion
Change #1015326 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] puppetserver: use client certs for naggen2 puppetdb
Change #1015326 merged by Filippo Giunchedi:
[operations/puppet@production] puppetserver: use client certs for naggen2 puppetdb
Change #1015511 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] puppetserver: use client certs for naggen2/puppetdb
Change #1015511 merged by Filippo Giunchedi:
[operations/puppet@production] puppetserver: use client certs for naggen2/puppetdb
Change #1003531 merged by Filippo Giunchedi:
[operations/puppet@production] alert: Update hiera entries for alert1001 to use Puppet 7
I pushed forward with this to be in a stable/known state ASAP, i.e. alert1001 and alert2001 are both on puppet 7 now and catalogs compile successfully