Page MenuHomePhabricator

Provide second acmechief server configured for Puppet 7 in eqiad
Closed, ResolvedPublic

Description

In T349915 acmechief2002 was configured to use the Puppet 7 environment. This server is currently configured for hosts and roles migrated to Puppet 7 via "acmechief_host: acmechief2002.codfw.wmnet" via hieradata/hosts/foo.yaml or hieradata/role/commom/foo.yaml

Currently we have 933 hosts running on Puppet 7 and 1254 on Puppet 5 and with some bigger clusters up for migration we'll soon reach the point where we have more hosts on Puppet 7 than 5.

As such, we should setup a second Puppet7-enabled acmechief server in eqiad as well (for situations where codfw (or even just the Ganeti codfw cluster)) is unavailable, then we'd at least be able to failover to the eqiad hosts by submitting a Puppet change to move all Hiera settings from acmechief2002 to acmechief100x.

Event Timeline

KOfori added subscribers: BCornwall, KOfori.

Hi @BCornwall, can you take care of this?

@KOfori I can but want to point out that, unless I'm mistaken, the hosts that actually use acme-chief are much smaller than the numbers put forth:

$ sudo -i cumin 'R:acme_chief::cert' 'puppet --version'

shows 78 clients on 7 and 117 still on 5.

@Vgutierrez, I'd love your opinion on whether we should wait for some time before going forward with this.

I don't think it's a problem of load as our puppetization doesn't balance Puppet API requests between different acme-chief hosts but as @MoritzMuehlenhoff mentions on the description: if we have some kind of incident on codfw puppet 7 hosts lose the ability of fetching/refreshing acme-chief TLS material.

Given that we have some acme-chief clients running Buster (alert[1001,2001].wikimedia.org,apt[1001,2001].wikimedia.org,archiva1002.wikimedia.org,lists1001.wikimedia.org,seaborgium.wikimedia.org,serpens.wikimedia.org) it looks like this Puppet 5 / Puppet 7 shared scenario will be around some time and we need to provide a reliable setup for Puppet 7 clients during this transition.

Given that we have some acme-chief clients running Buster (alert[1001,2001].wikimedia.org,apt[1001,2001].wikimedia.org,archiva1002.wikimedia.org,lists1001.wikimedia.org,seaborgium.wikimedia.org,serpens.wikimedia.org) it looks like this Puppet 5 / Puppet 7 shared scenario will be around some time and we need to provide a reliable setup for Puppet 7 clients during this transition.

Exactly, especially lists1001 will be a complex update that is unlikely to be completed within the coming 1-2 months, as such ideally we just build out the acmechief Puppet 7/Bookworm setup in parallel mimicking the Puppet 5/Bullseye setup and when all acme clients using Puppet 5 are updated we just decom all the old bullseye hosts?

Change 982926 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] Create hieradata for host acmechief1002

https://gerrit.wikimedia.org/r/982926

Change 982926 merged by BCornwall:

[operations/puppet@production] Create hieradata for host acmechief1002

https://gerrit.wikimedia.org/r/982926

BCornwall changed the task status from Open to In Progress.Dec 14 2023, 7:50 PM
BCornwall triaged this task as High priority.
BCornwall moved this task from Ready for work to Traffic team actively servicing on the Traffic board.

Change 983276 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/puppet@production] site.pp: Add acmechief1002

https://gerrit.wikimedia.org/r/983276

Change 983276 merged by BCornwall:

[operations/puppet@production] site.pp: Add acmechief1002

https://gerrit.wikimedia.org/r/983276

Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host acmechief1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host acmechief1002.eqiad.wmnet with OS bookworm completed:

  • acmechief1002 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312142302_brett_788018_acmechief1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

acmechief1002 has been deployed with Puppet 7. It's now available for a switchover should the need arise.

There's one active alert, is that known/expected?

FILE_AGE CRITICAL: File not found - /var/lib/acme-chief/certs/.rsync.status

There's one active alert, is that known/expected?

FILE_AGE CRITICAL: File not found - /var/lib/acme-chief/certs/.rsync.status

Per @BCornwall comment:

acmechief1002 has been deployed with Puppet 7. It's now available for a switchover should the need arise.

I'm guessing it's definitely not expected or known.

Change 983675 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Add acmechief1002 to the list of acme-chief passive hosts

https://gerrit.wikimedia.org/r/983675

Mentioned in SAL (#wikimedia-operations) [2023-12-18T09:10:05Z] <vgutierrez> vgutierrez@acmechief1002:~$ sudo -i keyholder arm - T352242

Change 983675 merged by Vgutierrez:

[operations/puppet@production] hiera: Add acmechief1002 to the list of acme-chief passive hosts

https://gerrit.wikimedia.org/r/983675

@MoritzMuehlenhoff Sorry about that, acmechief1002 is now ready for service :)