Page MenuHomePhabricator

Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep
Closed, ResolvedPublicBUG REPORT

Description

Common information

  • summary: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep
  • alertname: PuppetAgentFailure
  • instance: deployment-cache-text08
  • job: node
  • project: deployment-prep
  • severity: warning

Firing alerts


  • summary: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep
  • alertname: PuppetAgentFailure
  • instance: deployment-cache-text08
  • job: node
  • project: deployment-prep
  • severity: warning
  • Source

Event Timeline

Looks like it ain't me this time.

krinkle@deployment-cache-text08:~$ sudo run-puppet-agent
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud
Info: Applying configuration version '(8f2f111dea) gitpuppet - varnish: Implement new direct routing for mobile views'
Notice: /Stage[main]/Prometheus::Varnishkafka_exporter/Service[prometheus-varnishkafka-exporter]/ensure: ensure changed 'stopped' to 'running' (corrective)
Info: /Stage[main]/Prometheus::Varnishkafka_exporter/Service[prometheus-varnishkafka-exporter]: Unscheduling refresh on Service[prometheus-varnishkafka-exporter]
Error: /Stage[main]/Profile::Cache::Haproxy/File[/usr/share/GeoIP/datacenter.mmdb]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///volatile/datacenter_vendors/datacenter.mmdb
Info: Stage[main]: Unscheduling all events on Stage[main]
Notice: Applied catalog in 17.47 seconds
root@deployment-puppetserver-1:/srv/puppet_fileserver/volatile# systemctl status dump_datacenter_ranges --no-pager --full
× dump_datacenter_ranges.service - Job to update known datacenter database
     Loaded: loaded (/lib/systemd/system/dump_datacenter_ranges.service; static)
     Active: failed (Result: exit-code) since Wed 2025-09-03 00:00:08 UTC; 17h ago
TriggeredBy: ● dump_datacenter_ranges.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 253431 ExecStart=/usr/local/bin/fetch-datacenter-vendors -t /tmp/dch_latest.json -o /srv/puppet_fileserver/volatile/datacenter_vendors/datacenter.mmdb (code=exited, status=255/EXCEPTION)
   Main PID: 253431 (code=exited, status=255/EXCEPTION)
        CPU: 734ms

Sep 03 00:00:06 deployment-puppetserver-1 systemd[1]: Starting dump_datacenter_ranges.service - Job to update known datacenter database...
Sep 03 00:00:08 deployment-puppetserver-1 fetch-datacenter-vendors[253431]: Please add access token to environment variable SPUR_TOKEN
Sep 03 00:00:08 deployment-puppetserver-1 systemd[1]: dump_datacenter_ranges.service: Main process exited, code=exited, status=255/EXCEPTION
Sep 03 00:00:08 deployment-puppetserver-1 systemd[1]: dump_datacenter_ranges.service: Failed with result 'exit-code'.
Sep 03 00:00:08 deployment-puppetserver-1 systemd[1]: Failed to start dump_datacenter_ranges.service - Job to update known datacenter database.

@SLyngshede-WMF, @Vgutierrez: For https://gerrit.wikimedia.org/r/c/operations/puppet/+/1181090 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184037, we will need to carve out exceptions for the Beta cluster. Should we introduce a new conditional for such private data, or should we re-use the existing ones that we are extending in other ways, such as use_etcd_req_filters and geoip_fetch_private (and moving the relevant code for spur.us under it)?

bd808 changed the subtype of this task from "Task" to "Bug Report".Sep 3 2025, 8:42 PM
bd808 moved this task from To Triage to Puppet errors on the Beta-Cluster-Infrastructure board.

Change #1184646 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] P:puppetserver::volatile avoid loading Spur data on certain host

https://gerrit.wikimedia.org/r/1184646

Change #1184646 merged by Slyngshede:

[operations/puppet@production] P:puppetserver::volatile avoid loading Spur data on certain host

https://gerrit.wikimedia.org/r/1184646

SLyngshede-WMF claimed this task.
SLyngshede-WMF triaged this task as High priority.
bd808@deployment-cache-text08.deployment-prep.eqiad1:/etc/haproxy/conf.d$ sudo -i puppet agent -tv
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud
Info: Applying configuration version '(f110740a81) gitpuppet - varnish: Enable unified routing on test.wikidata, wikitech, officewiki'
Notice: /Stage[main]/Prometheus::Varnishkafka_exporter/Service[prometheus-varnishkafka-exporter]/ensure: ensure changed 'stopped' to 'running' (corrective)
Info: /Stage[main]/Prometheus::Varnishkafka_exporter/Service[prometheus-varnishkafka-exporter]: Unscheduling refresh on Service[prometheus-varnishkafka-exporter]
Error: /Stage[main]/Profile::Cache::Haproxy/File[/usr/share/GeoIP/datacenter.mmdb]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///volatile/datacenter_vendors/datacenter.mmdb
Info: Stage[main]: Unscheduling all events on Stage[main]
Notice: Applied catalog in 15.68 seconds

This is still happening because the guard condition added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184646/2/modules/profile/manifests/cache/haproxy.pp is enabled in Beta Cluster just as it is in production.

This is still happening because the guard condition added in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184646/2/modules/profile/manifests/cache/haproxy.pp is enabled in Beta Cluster just as it is in production.

It is reasonable to assume that @SLyngshede-WMF thought he was copying a good pattern. Unfortunately T403105: Remove need for manually applied MaxMind data hacks on Beta Cluster cache servers already existed for the maxmind data that uses the same guard.

Change #1184966 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] P:cache::haproxy allow datacenter information to be disabled

https://gerrit.wikimedia.org/r/1184966

Change #1184966 merged by Slyngshede:

[operations/puppet@production] P:cache::haproxy allow datacenter information to be disabled

https://gerrit.wikimedia.org/r/1184966

@bd808 We've added a special clause just for the datacenter database. We can use the same variable to prevent the datacenter checks from happening else where in the future.

puppet is happier in deployment-cache-text08 but not 100%:

Sep  5 10:22:01 deployment-cache-text08 puppet-agent[1978923]: (/Stage[main]/Profile::Cache::Haproxy/File[/usr/share/GeoIP/datacenter.mmdb]) Could not evaluate: Could not retrieve information from environment production source(s) puppet:///volatile/datacenter_vendors/datacenter.mmdb

it looks like we need an if guard or mocking that mmdb file on puppetservers

Change #1185067 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] P:cache:haproxy guard datacenter database with if statement

https://gerrit.wikimedia.org/r/1185067

Change #1185067 merged by Slyngshede:

[operations/puppet@production] P:cache:haproxy prevent download of datacenter.mmdb

https://gerrit.wikimedia.org/r/1185067

Change #1185090 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] P:cache::haproxy guard datacenter database with if

https://gerrit.wikimedia.org/r/1185090

Change #1185090 merged by Slyngshede:

[operations/puppet@production] P:cache::haproxy guard datacenter database with if

https://gerrit.wikimedia.org/r/1185090

Thanks everyone, for working so quickly to get this fixed!

puppet is now happy on deployment-cache-text08.