Page MenuHomePhabricator

No Puppet resources found on instance deployment-etcd05 on project deployment-prep
Closed, ResolvedPublic

Description

Common information

  • summary: No Puppet resources found on instance deployment-etcd05 on project deployment-prep
  • alertname: PuppetAgentNoResources
  • instance: deployment-etcd05
  • job: node
  • project: deployment-prep
  • severity: warning

Firing alerts


  • summary: No Puppet resources found on instance deployment-etcd05 on project deployment-prep
  • alertname: PuppetAgentNoResources
  • instance: deployment-etcd05
  • job: node
  • project: deployment-prep
  • severity: warning
  • Source

Event Timeline

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Invalid Prometheus instance 'ops'. Must be one of:  (file: /srv/puppet_code/environments/production/modules/prometheus/manifests/alert/rule.pp, line: 44, column: 9) (file: /srv/puppet_code/environments/production/modules/nrpe/manifests/monitor_service.pp, line: 129) on node deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud
bd808 changed the task status from Open to In Progress.Aug 20 2025, 5:11 PM
bd808 claimed this task.
bd808 triaged this task as High priority.
bd808 moved this task from To Triage to Puppet errors on the Beta-Cluster-Infrastructure board.
bd808 added a subscriber: tappof.

Potentially related to T395446: Evaluate which solution we could adopt as a drop-in replacement for NRPE (and start prototyping). These were merged today:

The new prometheus::instances hieradata that @tappof added to ops/puppet.git:hieradata/cloud.yaml is being shadowed by stub data that I put into cloud/instance-puppet.git:deployment-prep/deployment-etcd.yaml back in May.

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/eb6726025a34b769bbe63dfd32fb1f126204a171%5E%21/#F0

diff --git a/deployment-prep/deployment-etcd.yaml b/deployment-prep/deployment-etcd.yaml
index 9b36737..7f834b3 100644
--- a/deployment-prep/deployment-etcd.yaml
+++ b/deployment-prep/deployment-etcd.yaml

@@ -6,6 +6,7 @@
 profile::etcd::v3::max_latency: 100
 profile::etcd::v3::use_client_certs: false
 profile::etcd::v3::use_pki_certs: true
+prometheus::instances: {}
 prometheus::instances_defaults:
   hosts: null
   k8s_cluster_name: null

Let's start by removing that stubbed out data.

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/f21538d1f32640d19240a4f92b1626250b60f563%5E%21/#F0

diff --git a/deployment-prep/deployment-etcd.yaml b/deployment-prep/deployment-etcd.yaml
index 7f834b3..9b36737 100644
--- a/deployment-prep/deployment-etcd.yaml
+++ b/deployment-prep/deployment-etcd.yaml

@@ -6,7 +6,6 @@
 profile::etcd::v3::max_latency: 100
 profile::etcd::v3::use_client_certs: false
 profile::etcd::v3::use_pki_certs: true
-prometheus::instances: {}
 prometheus::instances_defaults:
   hosts: null
   k8s_cluster_name: null
bd808@deployment-etcd05:~$ sudo -i puppet agent -tv
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud
Info: Applying configuration version '(d438a3c77f) gitpuppet - mediawiki: Remove unused wikidata.org vhost and fix beta redirect'
Notice: /Stage[main]/Base::Kernel/Nrpe::Monitor_service[cpu_microcode_status]/File[/usr/local/bin/nrpe2nodexp]/content:
--- /usr/local/bin/nrpe2nodexp  2025-07-30 09:02:36.853226945 +0000
+++ /tmp/puppet-file20250820-315811-12qq255     2025-08-20 17:13:35.713064973 +0000
@@ -41,6 +41,29 @@
     CRITICAL = 2
     UNKNOWN = 3

+    def get_severity(self, page: bool = False) -> str:
+        """
+        Maps Nagios exit values to the Prometheus severity label
+
+        Args:
+            page (bool): if CRITICAL should be interpreted as a paging alert
+
+        Returns:
+            status = WARNING  -> severity = warning
+            status = CRITICAL -> severity = critical|page
+                                 depending on the --page flag
+            all other values  -> severity = info
+        """
+        if StatusCodes.WARNING.value == self.value:
+            return self.name.lower()
+        elif StatusCodes.CRITICAL.value == self.value:
+            if page:
+                return "page"
+            else:
+                return self.name.lower()
+        else:
+            return "info"
+

 class ECSFormatter(logging.Formatter):
     """
@@ -280,7 +303,9 @@
         cache_dir: Path,
         check_command: str,
         executor: DeferredProcess,
+        alert_rule_hash: str,
         perf_data: bool,
+        page: bool,
         timeout: int,
     ) -> None:
         """
@@ -292,13 +317,17 @@
                                  used to identify the configuration file.
             executor (DeferredProcess): The Popen placeholder responsible
                                         for executing the configured command.
+            alert_rule_hash (str): Used to uniquely identify the corresponding alert rule.
             perf_data (bool): True if performance data parsing is enabled.
+            page (bool): if CRITICAL should be interpreted as a paging alert
             timeout (int): The execution timeout, in seconds.
         """
         self._cache_dir = cache_dir
         self._check_command = check_command
         self._executor = executor
+        self._alert_rule_hash = alert_rule_hash
         self._perf_data = perf_data
+        self._page = page
         self._timeout = timeout

     def store_log(self, executable: str, returncode: int, log: str) -> None:
@@ -370,7 +399,13 @@

         for status_label, status_code in StatusCodes.__members__.items():
             metric.add_metric(
-                [self._check_command, status_label], 1 if status_code == code else 0
+                [
+                    self._check_command,
+                    self._alert_rule_hash,
+                    status_label,
+                    StatusCodes(status_code).get_severity(page=self._page),
+                ],
+                1 if status_code == code else 0,
             )

         if self._perf_data:
@@ -392,6 +427,7 @@
                 perfdata_metric.add_metric(
                     [
                         self._check_command,
+                        self._alert_rule_hash,
                         m.label,
                         m.get("uom", "undef"),
                         m.get("warn", "undef"),
@@ -410,7 +446,7 @@
         metric = GaugeMetricFamily(
             "nagios_nrpe_check_result",
             "Result of Nagios-style check script execution",
-            labels=["check_name", "status"],
+            labels=["check_name", "alert_rule_hash", "status", "severity"],
         )

         perfdata = GaugeMetricFamily(
@@ -418,6 +454,7 @@
             "PerfData generated by Nagios-style check script execution",
             labels=[
                 "check_name",
+                "alert_rule_hash",
                 "entity",
                 "unit",
                 "warn",
@@ -430,12 +467,15 @@
         error = GaugeMetricFamily(
             "nagios_nrpe_check_config_error",
             "True if errors occurred during check initialization.",
-            labels=["check_name"],
+            labels=["check_name", "alert_rule_hash"],
         )

         if self._executor is not None:
             self.execute_and_parse(metric, perfdata)
-        error.add_metric([self._check_command], 1 if self._executor is None else 0)
+        error.add_metric(
+            [self._check_command, self._alert_rule_hash],
+            1 if self._executor is None else 0,
+        )

         if len(metric.samples) > 0:
             yield metric
@@ -446,6 +486,11 @@

 @click.command()
 @click.option("--check-command", required=True, help="The name of the check command")
+@click.option(
+    "--alert-rule-hash",
+    required=True,
+    help="Used to uniquely identify the corresponding alert rule.",
+)
 @click.option("--timeout", default=10, help="check command timeout in seconds")
 @click.option(
     "--perf-data",
@@ -454,6 +499,12 @@
     help="Parse perfdata as defined in https://nagios-plugins.org/doc/guidelines.html#AEN200",
 )
 @click.option(
+    "--page",
+    is_flag=True,
+    default=False,
+    help="if CRITICAL should be interpreted as a paging alert",
+)
+@click.option(
     "--cleanup", is_flag=True, default=True, help="Clean up on uncaught exceptions"
 )
 @click.option(
@@ -462,8 +513,10 @@
 @click.option("--debug", is_flag=True, default=False, help="Print verbose debug ouput")
 def main(
     check_command: str,
+    alert_rule_hash: str,
     timeout: int,
     perf_data: bool,
+    page: bool,
     cleanup: bool,
     dry_run: bool,
     debug: bool,
@@ -481,7 +534,9 @@
             NODE_EXPORTER_DIR,
             check_command,
             nagiosconfig.get_executor(),
+            alert_rule_hash,
             perf_data,
+            page,
             timeout,
         )
         registry.register(nagioscollector)

Notice: /Stage[main]/Base::Kernel/Nrpe::Monitor_service[cpu_microcode_status]/File[/usr/local/bin/nrpe2nodexp]/content: content changed '{sha256}fc7174eeec75650760e61773db915be1534f26d28286c1d0b741cde83a021dd5' to '{sha256}f0a3f51db7e152358d7907d46e008fee4959033a96fa59fde7af9def4590962f'
Notice: Applied catalog in 5.57 seconds

Huh. I left this open because in the past I have seen the @wmcs-alerts bot open a new ticket if the original is closed when the alert is resolved upstream. This time that didn't happen?