Page MenuHomePhabricator

deployment-etcd05: Function lookup() did not find a value for the name 'prometheus::instances_defaults'
Closed, ResolvedPublic

Description

Forked from T393855, deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud fails to run Puppet since

The last Puppet run was at Wed Apr 16 12:32:36 UTC 2025 (37120 minutes ago).

Because it fails with:

May 12 07:02:42 deployment-etcd05 puppet-agent[165659]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'prometheus::instances_defaults' on node deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud

Even though hieradata/common/prometheus.yaml has:

prometheus::instances_defaults:
  retention_time: 4032h
  retention_size: ~
  thanos_upload: true
  k8s_cluster_name: ~
  hosts: ~
  provision_lv_size: '50g'

I can not figure out what is wrong in Puppet . From the Puppet server there was a single change applied after the last working run:

$ git cherry -v snapshot-202504161216 snapshot-202504161258|egrep '^\+'
+ 09b591c8145b013acd71f00e1fca7bb8e982c53e etcd: replace prometheus_all_nodes

That is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1129177 etcd: replace prometheus_all_nodes. That at least matches etcd and prometheus but I fail to find how that broke Puppet or how that patch is related to a missing prometheus::instances_defaults.

Event Timeline

hieradata/common/prometheus.yaml

Loading that file relies on the wmflib::expand_path backend which is not in use in Cloud VPS.

I'm ok to move prometheus::instances_default elsewhere in hieradata where it would be compatible with cloudvps (not sure where?) Alternatively could we add hieradata/common to cloudvps (or beta's puppetserver) hiera ?

Change #1144589 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add prometheus::instance_defaults to deployment-prep's common settings

https://gerrit.wikimedia.org/r/1144589

The patch would set:

prometheus::instances_defaults:
  retention_time: 4032h
  retention_size: ~
  thanos_upload: true
  k8s_cluster_name: ~
  hosts: ~
  provision_lv_size: '50g'

But then puppet error becomes did not find a value for the name 'prometheus::instances', so there is some mismatch still.

The prometheus::instances: settings seem to also be in common/prometheus.yaml and apparently describe various Prometheus clusters.

Mentioned in SAL (#wikimedia-releng) [2025-05-12T15:21:59Z] <bd808> Added prometheus::instances and prometheus::instances_defaults hiera settings to "deployment-etcd" Prefix Puppet via Horizon (T393866)

Mentioned in SAL (#wikimedia-releng) [2025-05-12T15:21:59Z] <bd808> Added prometheus::instances and prometheus::instances_defaults hiera settings to "deployment-etcd" Prefix Puppet via Horizon (T393866)

Applied config:

prometheus::instances: {}
prometheus::instances_defaults:
  hosts: null
  k8s_cluster_name: null
  provision_lv_size: 50g
  retention_size: null
  retention_time: 4032h
  thanos_upload: true

Mentioned in SAL (#wikimedia-releng) [2025-05-12T15:28:03Z] <bd808> Forced puppet run on deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud to fix Puppet run (T393866)

Mentioned in SAL (#wikimedia-releng) [2025-05-12T15:28:08Z] <bd808> Forced puppet run on deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud to fix Puppet run (T393866)

Change #1144589 abandoned by Elukey:

[operations/puppet@production] Add prometheus::instance_defaults to cloud's common settings

https://gerrit.wikimedia.org/r/1144589

hashar assigned this task to bd808.

That was solved by @bd808 who applied the settings via Horizon T393866#10812296. That solved the Puppet run which in turn unbroke the beta cluster (T393855) which was the reason I have file those tasks.