Page MenuHomePhabricator

Puppet failure on deploy-1002.devtools.eqiad1.wikimedia.cloud due to missing profile::kubernetes::deployment_server::user_defaults
Closed, ResolvedPublic

Description

deploy-1002 is a Deployment server for MediaWiki and related code (deployment_server)
The last Puppet run was at Tue Sep 28 09:45:35 UTC 2021 (35943 minutes ago).

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::kubernetes::deployment_server::user_defaults' (file: /etc/puppet/modules/profile/manifests/kubernetes/deployment_server.pp, line: 2) on node deploy-1002.devtools.eqiad.wmflabs

The hiera key profile::kubernetes::deployment_server::user_defaults has been added by https://gerrit.wikimedia.org/r/c/operations/puppet/+/723419 . We currently have:

hieradata/common/profile/kubernetes/deployment_server.yaml:profile::kubernetes::deployment_server::user_defaults:
hieradata/common/profile/kubernetes/deployment_server.yaml-  owner: mwdeploy
hieradata/common/profile/kubernetes/deployment_server.yaml-  group: wikidev
hieradata/common/profile/kubernetes/deployment_server.yaml-  mode: "0640"
--
hieradata/role/common/ci/master.yaml:profile::kubernetes::deployment_server::user_defaults:
hieradata/role/common/ci/master.yaml-  owner: jenkins-slave
hieradata/role/common/ci/master.yaml-  group: contint-admins
hieradata/role/common/ci/master.yaml-  mode: "0440"
--
hieradata/role/common/releases.yaml:profile::kubernetes::deployment_server::user_defaults:
hieradata/role/common/releases.yaml-  group: wikidev
hieradata/role/common/releases.yaml-  mode: "0640"
hieradata/role/common/releases.yaml-  owner: jenkins-slave
--
hieradata/role/common/releases.yaml:profile::kubernetes::deployment_server::user_defaults:
hieradata/role/common/releases.yaml-  owner: jenkins-slave
hieradata/role/common/releases.yaml-  group: wikidev
hieradata/role/common/releases.yaml-  mode: "0640"

I have no idea what sane values should be used for a WMCS instance. Maybe that got solved on Beta-Cluster-Infrastructure via Horizon?

Event Timeline

hashar renamed this task from Puppet failure on deploy-1002.devtools.eqiad1.wikimedia.cloud to Puppet failure on deploy-1002.devtools.eqiad1.wikimedia.cloud due to missing profile::kubernetes::deployment_server::user_defaults.Oct 23 2021, 8:57 AM

Beta-Cluster-Infrastructure has it set via Horizon:

profile::kubernetes::deployment_server::user_defaults:
  group: wikidev
  mode: '0640'
  owner: mwdeploy

I have added the same bits to the devtools project. It now fails further down when invoking kafka_config:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, undefined method `[]' for nil:NilClass (file: /etc/puppet/modules/profile/manifests/kubernetes/deployment_server/mediawiki.pp, line: 44, column: 21) on node deploy-1002.devtools.eqiad.wmflabs

Change 737997 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] cloud/devtools: sync Horizon Hiera with repo Hiera

https://gerrit.wikimedia.org/r/737997

Change 737997 merged by Dzahn:

[operations/puppet@production] cloud/devtools: sync Horizon Hiera with repo Hiera

https://gerrit.wikimedia.org/r/737997

Change 738002 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] cloud/devtools: fix puppet run on deploy-1002, add missing kafka/zookeeper keys

https://gerrit.wikimedia.org/r/738002

Change 738002 merged by Dzahn:

[operations/puppet@production] cloud/devtools: fix puppet run on deploy-1002, add missing kafka/zookeeper keys

https://gerrit.wikimedia.org/r/738002

Beta-Cluster-Infrastructure has it set via Horizon:

One reason to create a separate project for this was to avoid doing the same things we ended up doing in deployment_prep with Hiera data in multiple places. Everything here used to be in the repo on purpose. Some values added in web UI were duplicate, others were not. I am trying to clean that up again and sync the repo with reality and keep Horizon clean.

puppet run finished on deploy1002 again for the first time since a while and caught up applying a bunch of things.

Notice: Applied catalog in 13.95 seconds

Though host names in deployment-prep have disappeared meanwhile so there are still issues for ferm to setup iptables properly.

It's hard keeping a deployment server in cloud VPS up and running when things change in both prod and beta and both can break it in other projects.

Change 738005 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] cloud/devtools: fix resolv.conf search path (wmflabs->wikimedia.cloud)

https://gerrit.wikimedia.org/r/738005

Change 738005 merged by Dzahn:

[operations/puppet@production] cloud/devtools: fix resolv.conf search path (wmflabs->wikimedia.cloud)

https://gerrit.wikimedia.org/r/738005

Dzahn claimed this task.

Horizon Hiera empty, repo Hiera adjusted. Puppet fixed.

The reason the ferm rules could not come up was that the search path in resolv.conf was still wmflabs and needed to be updated to wikimedia.cloud

dzahn@deploy-1002:~$ sudo puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for deploy-1002.devtools.eqiad.wmflabs
Info: Applying configuration version '(cb1c074bae) Dzahn - cloud/devtools: fix resolv.conf search path (wmflabs->wikimedia.cloud)'
Notice: The LDAP client stack for this host is: classic/sudoldap
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: classic/sudoldap'
Notice: Applied catalog in 14.01 seconds

Thanks for the missing Kafka / Zookeeper settings, I could not figure them out \o/