Page MenuHomePhabricator

Puppet failure on deploy-1004.devtools.eqiad1.wikimedia.cloud
Closed, ResolvedPublic

Description

Puppet fails on deploy-1004.devtools.eqiad1.wikimedia.cloud, from the email we receive:

ERR: Could not retrieve catalog from remote server: Error 500 on SERVER:
Server Error: Evaluation Error:
Error while evaluating a Resource Statement, Evaluation Error: 
Error while evaluating a Function Call, Class[Profile::Mediawiki::Deployment::Server]: parameter 'statsd' expects a String value, got Undef (file: /etc/puppet/modules/role/manifests/deployment_server.pp, line: 15, column: 5)
on node deploy-1004.devtools.eqiad1.wikimedia.cloud
WARNING: Not using cache on failed catalog
ERR: Could not retrieve catalog; skipping run

Connecting to the instance gives the last time Puppet ran:

The last Puppet run was at Mon Sep 25 11:44:19 UTC 2023 (25607 minutes ago).

Event Timeline

The instance has:

Classes

role::deployment_server

Hiera configuration

puppetmasterpuppetmaster-1001.devtools.eqiad1.wikimedia.cloud

profile::mediawiki::deployment::server has required a stastd parameter for a looong time.

Looks like statsd was removed by @taavi via https://gerrit.wikimedia.org/r/c/operations/puppet/+/960576 for T326266: Remove the WMCS statsd/Graphite service:

--- a/hieradata/cloud.yaml
+++ b/hieradata/cloud.yaml
+statsd: ~

--- a/hieradata/cloud/eqiad1.yaml
+++ b/hieradata/cloud/eqiad1.yaml
-# Labs statsd instance
-statsd: cloudmetrics1003.eqiad.wmnet:8125

The statsd parameter of profile::mediawiki::deployment::server is passed to:

modules/profile/manifests/kubernetes/deployment_server/mediawiki/config.pp
modules/mediawiki/manifests/web/yaml_defs.pp
modules/profile/templates/mediawiki/error-params.php.erb
modules/profile/files/mediawiki/php/php7-fatal-error.php

Which increments the MediaWiki.errors.fatal counter (which I guess might be required for deployment-prep / Beta-Cluster-Infrastructure but that is another topic).

For devtools we don't use MediaWiki at all so I guess statsd could be made optional. Then most probably we would need another role to setup a scap deployment server without all the production MediaWiki deployment. Ie a new role.

The deployment_server role is:

modules/role/manifests/deployment_server.pp
 MediaWiki Deployment Server (prod). This role DOES NOT include the kubernetes stuff.
class role::deployment_server {

    system::role { 'deployment_server':
        description => 'Deployment server for MediaWiki and related code',
    }

    # standards
    include profile::base::production
    include profile::firewall
    include profile::backup::host
    backup::set {'home': }

    # webserver, scap deployment tool with SSH agent, rsync
    include profile::mediawiki::deployment::server
    include profile::scap::dsh
    include profile::keyholder::server

    # memcached-related
    include profile::mediawiki::mcrouter_wancache

    # client to fetch configuration data
    include profile::conftool::client

    # MediaWiki release uploads to releases servers
    include profile::releases::mediawiki::private
    include profile::releases::mediawiki::security

    # tool to test webserver config changes
    include profile::httpbb

    # proxy for connection to other servers
    include profile::services_proxy::envoy

    # Scap relies on pulling Docker images in order to self-update
    include profile::docker::engine
    include profile::docker::prune_old_images

I guess the issue is scap was intended first for MediaWiki and the puppet manifests are tied to MediaWiki deployment. We later extended scap to be used for other services (known as scap v3) which is the sole use case for devtools.

In the ideal world setting up the basis for scap v3 should be extracted out of profile::mediawiki::deployment::server and that profile would only include the MediaWiki bits. That requires a fairly large refactoring in Puppet :-\

Mentioned in SAL (#wikimedia-releng) [2023-10-13T06:53:45Z] <hashar> devtools: set in hiera statsd: 127.0.0.1:8125 to fix Puppet on deploy-1004.devtools.eqiad1.wikimedia.cloud # T348830

hashar claimed this task.

Since we don't need MediaWiki fatal error reporting or statsd on the devtools deployment server, I went to set a dummy value statsd: 127.0.0.1:8125 which satisfies the Puppet class requirements and unbroke Puppet on the host.

I am happily ignoring:

  • Beta-Cluster-Infrastructure no more having statsd metrics
  • whether MediaWiki.errors.fatal statsd counter still has any use or whether it could be decommissioned