Page MenuHomePhabricator

compile/diff catalogs between puppetdb v2 (production) and puppetdb v4
Closed, ResolvedPublic

Description

Compile catalogs for all hosts using a puppetdb v2 and puppetdb v4 backends, then diff to ensure no unexpected changes will happen after upgrading

Event Timeline

herron triaged this task as Medium priority.Feb 28 2018, 8:11 PM
herron created this task.

While the catalog-compiler (T187258) has been useful to test compilation under the new version of puppetdb I haven't found a straightforward way to compile and diff catalogs using one puppetdb/terminus for "production" and another for "change" using this tool.

Also complicating this is the puppetdbquery module which does not support a puppetdb4 backend in our current version and is not backwards compatible with puppetdb v2 in new versions.

So I'll try setting up a separate puppet master with puppetdb v4 and puppetdbquery 3.0.1. Then, from a VM, use octocatalog-diff to bulk compile/diff catalogs between the production puppet masters and a puppetdb4 master.

Change 415382 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] add forward/reverse records for new ganeti VM elnath

https://gerrit.wikimedia.org/r/415382

Change 415382 merged by Herron:
[operations/dns@master] add forward/reverse records for new ganeti VM elnath

https://gerrit.wikimedia.org/r/415382

Change 415452 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] install_server: add dhcp and netboot entries for ganeti VM elnath

https://gerrit.wikimedia.org/r/415452

Change 415452 merged by Herron:
[operations/puppet@production] add dhcp netboot and site.pp entries for ganeti VM elnath

https://gerrit.wikimedia.org/r/415452

While the catalog-compiler (T187258) has been useful to test compilation under the new version of puppetdb I haven't found a straightforward way to compile and diff catalogs using one puppetdb/terminus for "production" and another for "change" using this tool.

Also complicating this is the puppetdbquery module which does not support a puppetdb4 backend in our current version and is not backwards compatible with puppetdb v2 in new versions.

So I'll try setting up a separate puppet master with puppetdb v4 and puppetdbquery 3.0.1. Then, from a VM, use octocatalog-diff to bulk compile/diff catalogs between the production puppet masters and a puppetdb4 master.

Why not use a separate environment for the newer puppetdbquery version?

That could allow you to do the testing without having to fiddle with the sources.

Why not use a separate environment for the newer puppetdbquery version?

That could allow you to do the testing without having to fiddle with the sources.

The newer version of puppetdbquery requires newer puppetdb which in turn requires the newer puppetdb-terminus package to be installed on the master. The newer versions of both puppetdb-terminus (now called termini) and puppetdbquery are not backwards compatible with puppetdb2 afaict

! In T188544#4012015, @herron wrote:
The newer version of puppetdbquery requires newer puppetdb which in turn requires the newer puppetdb-terminus package to be installed on the master. The newer versions of both puppetdb-terminus (now called termini) and puppetdbquery are not backwards compatible with puppetdb2 afaict

Wow they really messed up the migration path, didn't they?

It's probably a good idea then to think of a transition procedure now, as it's going to be tricky.

Octocatalog-diff is set up on elnath.codfw.wmnet with a local /etc/puppet/auth.conf hack in place on the 3 eqiad puppet masters to allow elnath to fetch catalogs for other nodes (in order to diff them). It's simply the line allow elnath.codfw.wmnet at the bottom of the section beginning with path ~ ^/puppet/v3/catalog/([^/]+)$

The simplest way to diff is by using the wrapper script in /root. Currently it will compare puppetmaster1001.eqiad.wmnet (production) to puppetmaster.test.eqiad.wmnet (rhodium with stretch/hiera 3). It can be ran like so:

elnath:~# bash octocatalog-diff.sh bast1001.wikimedia.org | grep INFO
I, [2018-03-02T00:11:56.955485 #10386]  INFO -- : Catalogs compiled for bast1001.wikimedia.org
I, [2018-03-02T00:11:57.972950 #10386]  INFO -- : Diffs computed for bast1001.wikimedia.org
I, [2018-03-02T00:11:57.972963 #10386]  INFO -- : No differences

Remove the grep for full debug output

Options can be changed by editing this script, or editing /root/.octocatalog-diff.cfg.rb. Settings that are less likely to change are set in the dotfile.

A bulk compile/diff of all hosts is running now in a screen session with output logged to /root/log/prod_vs_rhodium-20180301.log. This run should give us a clearer picture of which hosts have issues with hiera 3 on stretch.

Important to note -- bulk compilation can be taxing on the puppet master infrastructure, so there is a long 30s sleep between diff commands in the loop.

Change 415813 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] Puppet: temporary allow elnath to retrieve catalogs

https://gerrit.wikimedia.org/r/415813

Change 415813 merged by Volans:
[operations/puppet@production] Puppet: temporary allow elnath to retrieve catalogs

https://gerrit.wikimedia.org/r/415813

Change 417376 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] change rhodium to use puppetdb4 server puppetdb1001

https://gerrit.wikimedia.org/r/417376

Change 417376 merged by Herron:
[operations/puppet@production] change rhodium to use puppetdb4 server puppetdb1001

https://gerrit.wikimedia.org/r/417376

Rhodium (depooled) is now using the puppetdb v4 backend puppetdb1001.

Since the puppetdbquery version in production is incompatible with puppetdb v4, and the new version is not backward compatible with v2, I've pointed /etc/puppet/modules at a copy of https://gerrit.wikimedia.org/r/#/c/410050/ and stopped the puppet agent on rhodium. Interested to find a better way to handle this so the patch is applied as the production branch is updated. Maybe some modification to puppet-merge on rhodium?

A bulk octocatalog-diff has been started to compare production to rhodium. It will take at least two passes to gather useful diffs since the first pass is populating the new (empty) puppetdb.

rhodium is (was after Filippo's change https://gerrit.wikimedia.org/r/420667 ) a production puppet master in a bad state- local changes to real.pp made a commit to that file break the merging process, making subsequent commits to fail (not only on that server, but on all codfw ones, as it follow the 1001 -> 1002 -> rhodium -> 2001 -> 2002 order). As mentioned bellow, it has been depooled from the trigger, so not it is stale now (but better than all servers being stale).

We need to make sure that regular merges can continue even if a "fork" is happening there. We should remember to revert the above patch when the workflow is not obstructed.

Also, apparently puppet has been disabled for a long time, maybe it should be run from time to time?

rhodium is (was after Filippo's change https://gerrit.wikimedia.org/r/420667 ) a production puppet master in a bad state- local changes to real.pp made a commit to that file break the merging process, making subsequent commits to fail (not only on that server, but on all codfw ones, as it follow the 1001 -> 1002 -> rhodium -> 2001 -> 2002 order). As mentioned bellow, it has been depooled from the trigger, so not it is stale now (but better than all servers being stale).

I'll clarify what
https://gerrit.wikimedia.org/r/420667 did: namely remove rhodium from getting puppet-merge updates though rhodium wasn't serving puppet traffic already (i.e. offline: true in hieradata).

We need to make sure that regular merges can continue even if a "fork" is happening there. We should remember to revert the above patch when the workflow is not obstructed.

Also, apparently puppet has been disabled for a long time, maybe it should be run from time to time?

There's different issues here at play as you pointed out, one is that IMO puppet-merge should have failed much earlier when presented with a unclean git working directory like rhodium had, not only when git merge --ff-only would cause a merge conflict. Alternatively alert on unclean git working copies. Also in cases like this it is simple enough to "retry" a failed puppet-merge by passing the sha1 intended for merge (which is what puppet-merge does on non-local machines)

The other issue is long-disabled puppet agent in general, which of course we should minimize as much as possible.

The puppet agent on rhodium has been re-enabled, so we should be in a good place to revert https://gerrit.wikimedia.org/r/420667 and prepare rhodium to be re-pooled (when eqiad puppet masters are re-pooled)

Resolving since the original task (run catalog diffs in production) has been completed

@herron This task is done, is elnath still used for anything? Otherwise can you decom the Ganeti VM?

Change 579034 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] puppetmaster: stop allowing elnath

https://gerrit.wikimedia.org/r/579034

Change 579034 merged by Jbond:
[operations/puppet@production] puppetmaster: stop allowing elnath

https://gerrit.wikimedia.org/r/579034

Mentioned in SAL (#wikimedia-operations) [2020-03-18T19:21:28Z] <mutante> shutting down (decom cookbook) elnath.codfw.wmnet (T188544)

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: elnath.codfw.wmnet

  • elnath.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed

Change 581060 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove elnath.codfw.wmnet

https://gerrit.wikimedia.org/r/581060

Change 581060 merged by Dzahn:
[operations/dns@master] remove elnath.codfw.wmnet

https://gerrit.wikimedia.org/r/581060