Page MenuHomePhabricator

Hosts distribution across puppetmasters
Open, MediumPublic

Description

While debugging puppet or puppetdb issues it would be beneficial to be able to easily know which hosts are going to which puppetmaster. But it's not that easy as the puppet configuration just have puppet has hostname and its resolution depends entirely on the order of the search parameter in /etc/resolv.conf.

So for example on ulsfo, running ping -c1 puppet | head -n1 we get:

(6) bast4003.wikimedia.org,dns[4001-4002].wikimedia.org,doh[4001-4002].wikimedia.org,install4001.wikimedia.org
----- OUTPUT of 'ping -c1 puppet | head -n1' -----
PING puppet(puppetmaster1001.eqiad.wmnet (2620:0:861:102:10:64:16:73)) 56 data bytes
===== NODE GROUP =====
(25) cp[4021-4032].ulsfo.wmnet,durum[4001-4002].ulsfo.wmnet,ganeti[4001-4004].ulsfo.wmnet,lvs[4005-4007].ulsfo.wmnet,ncredir[4001-4002].ulsfo.wmnet,netflow4001.ulsfo.wmnet,prometheus4001.ulsfo.wmnet
----- OUTPUT of 'ping -c1 puppet | head -n1' -----
PING puppet(puppetmaster2001.codfw.wmnet (2620:0:860:101:10:192:0:27)) 56 data bytes

And looking at their /etc/resolv.conf:

===== NODE GROUP =====
(1) prometheus4001.ulsfo.wmnet
----- OUTPUT of 'grep search /etc/resolv.conf' -----
search ulsfo.wmnet wikimedia.org eqiad.wmnet codfw.wmnet esams.wmnet eqsin.wmnet drmrs.wmnet
===== NODE GROUP =====
(1) bast4003.wikimedia.org
----- OUTPUT of 'grep search /etc/resolv.conf' -----
search wikimedia.org eqiad.wmnet codfw.wmnet esams.wmnet ulsfo.wmnet eqsin.wmnet
===== NODE GROUP =====
(5) dns[4001-4002].wikimedia.org,doh[4001-4002].wikimedia.org,install4001.wikimedia.org
----- OUTPUT of 'grep search /etc/resolv.conf' -----
search wikimedia.org
===== NODE GROUP =====
(24) cp[4021-4032].ulsfo.wmnet,durum[4001-4002].ulsfo.wmnet,ganeti[4001-4004].ulsfo.wmnet,lvs[4005-4007].ulsfo.wmnet,ncredir[4001-4002].ulsfo.wmnet,netflow4001.ulsfo.wmnet
----- OUTPUT of 'grep search /etc/resolv.conf' -----
search ulsfo.wmnet

I'm wondering if instead we should be more explicit and populate puppet.conf explicitly with the FQDN of the assigned puppetmaster directly from puppet.
The generic puppet hostname might still be useful in some cases, like for the first puppet run after a reimage.
Thoughts?

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone

Event Timeline

Volans triaged this task as Medium priority.Sep 22 2021, 7:26 AM
Volans created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Volans renamed this task from Host distribution across puppetmasters to Hosts distribution across puppetmasters.Sep 22 2021, 7:32 AM

its resolution depends entirely on the order of the search

Seems like hosts don't have the $site.wmnet as the first entry in the search path, this is probably an easy fix. but at the same time i wonder if DNS discovery addresses *without* LVS might be a better option?

I'm wondering if instead we should be more explicit and populate puppet.conf explicitly with the FQDN of the assigned puppetmaster directly from puppet.

I think moving to this method would add complexity and make it more difficult to fail over. There is a, likely minor, issue that currently the certificate at puppet only has a puppet CN. but the more practical issues is that we can currently fail over the primary puppet server for any site within 300 seconds (the dns TTL). if we change to using puppet we would need to wait a minimum of 30 minutes and we could end up with some split brain issues e.g.

  • sretest1001 is down (unrealted)
  • we make a change so puppetmaseter2001 is the primary for all sites
  • wait 30-60 minutes
  • shutdown puppetmaster1001
  • sretest1001 comes back online

at this point sretest will still have the old puppet config pointing to puppetmaster1001 , which is now down, it can't reach puppet and is unable to fix itself. In short, i don't think we should have the failover method for the puppetmaster dependent on puppet.

its resolution depends entirely on the order of the search

Seems like hosts don't have the $site.wmnet as the first entry in the search path, this is probably an easy fix. but at the same time i wonder if DNS discovery addresses *without* LVS might be a better option?

Yes *.wikimedia.org hosts have wikimedia.org as their first entry and we have puppet.wikimedia.org as CNAME too.
A discovery record might be an option too here, I agree.

I'm wondering if instead we should be more explicit and populate puppet.conf explicitly with the FQDN of the assigned puppetmaster directly from puppet.

In short, i don't think we should have the failover method for the puppetmaster dependent on puppet.

I totally agree, I wasn't suggesting to use puppet as the failover mechanism. But more having the FQDN of the CNAMEs there that can still be failovered as needed via DNS.

Basically instead of puppet have puppet.eqiad.wmnet, that might point to a different host, even in a difference datacenter for temporary failover purposes.

Basically instead of puppet have puppet.eqiad.wmnet, that might point to a different host, even in a difference datacenter for temporary failover purposes.

Ahh i see, what you mean and agree that would be a good improvement.