Page MenuHomePhabricator

Hosts in puppet with $cluster missing from wikimedia_clusters
Closed, ResolvedPublic

Description

Spotted this while investigating something else, it looks like at least acme-chief hosts are not in Prometheus. The bigger issue being that now it is possible to silently set $cluster to an invalid (i.e. not in wikimedia_cluster) value

Event Timeline

Change 539927 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add acmechief cluster

https://gerrit.wikimedia.org/r/539927

Change 539927 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add acmechief cluster

https://gerrit.wikimedia.org/r/539927

Change 539934 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP profile: sanity checks for cluster

https://gerrit.wikimedia.org/r/539934

fgiunchedi renamed this task from acme-chief hosts not in Prometheus to Hosts in puppet with $cluster missing from wikimedia_clusters.Oct 1 2019, 6:57 PM
fgiunchedi updated the task description. (Show Details)
jijiki triaged this task as Medium priority.Oct 14 2019, 2:10 PM

Change 539934 merged by Filippo Giunchedi:
[operations/puppet@production] profile: sanity checks for cluster

https://gerrit.wikimedia.org/r/539934

Change 544155 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: fix cluster inconsistencies

https://gerrit.wikimedia.org/r/544155

Change 544155 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: fix cluster inconsistencies

https://gerrit.wikimedia.org/r/544155

In CloudVPS every VM I could check have this puppet agent error now:

aborrero@cloud-cumin-01:~$ sudo puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Cluster misc not defined in wikimedia_clusters at /etc/puppet/modules/profile/manifests/base.pp:49:9 on node cloud-cumin-01.cloudinfra.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

Looks like this is causing a basically vanilla VM on cloudvps to fail to run puppet (and therefore properly initialise ssh keys etc..)

wikidata-icinga.wikidata-dev.eqiad.wmflabs

rc.local[410]: [1;31mError: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Cluster misc not defined in wikimedia_clusters at /etc/puppet/modules/profile/manifests/base.pp:49:9 on node wikidata-icinga.wikidata-dev.eqiad.wmflabs[0m

Looks like it's common across cloudVPS

@aborrero on IRC:
<arturo> tarrow: this seems to be a wide-spread issue

Change 544166 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: fix wikimedia_clusters for wmcs

https://gerrit.wikimedia.org/r/544166

Change 544166 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: fix wikimedia_clusters for wmcs

https://gerrit.wikimedia.org/r/544166

fgiunchedi claimed this task.

Excellent @aborrero ! All looking good, boldly resolving.