Page MenuHomePhabricator

Failover puppet ca service from eqiad to codfw
Closed, ResolvedPublic

Description

In order to rebuild puppetmaster1001 with stretch we will need to first failover the puppet ca service to puppetmaster2001. Creating a task to prepare for this.

Puppet CA failover process for review

  1. Disable puppet across the fleet
    1. neodymium:~$ sudo cumin -p 95 -b 100 '*' "disable-puppet 'temporarily disabled for puppet ca relocation - T189891 - godog'"
  2. Ensure rsync/git (ca, private and volatile) destinations are up to date on puppetmaster2001
    1. /var/lib/puppet/server/ssl/ca
    2. /var/lib/puppet/volatile
    3. /srv/private/
  3. Make backup copies of puppetmaster[12]001:/var/lib/puppet to neodymium/sarin
  4. Merge change updating puppetmaster::ca_server: puppetmaster2001.codfw.wmnet in hiera (https://gerrit.wikimedia.org/r/c/420721/) in order to...
    1. Repoint puppet agents ca_server to puppetmaster2001.codfw.wmnet
    2. Repoint apache frontend proxypass entries to puppetmaster2001.codfw.wmnet
    3. Reverse the direction of the puppetmaster rsync to puppetmaster2001 -> puppetmaster1001
  5. Enable and run puppet on puppetmaster1001
  6. Enable and run puppet on puppetmaster2001
  7. Enable and run puppet on a few canary hosts (puppet agents)
  8. Enable and force puppet agent run across fleet
    1. open a screen/tmux on neodymium or sarin and run:
    2. sudo cumin -p 70 -b 15 '*' "run-puppet-agent -q -e 'temporarily disabled for puppet ca relocation - T189891 - godog'"

Event Timeline

herron triaged this task as Medium priority.Mar 16 2018, 4:59 PM
herron created this task.
This comment was removed by herron.

Change 420705 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] puppetmaster: lock commits on /srv/private on non-master hosts

https://gerrit.wikimedia.org/r/420705

Change 420721 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: use puppetmaster2001 as ca_server

https://gerrit.wikimedia.org/r/420721

Mentioned in SAL (#wikimedia-operations) [2018-03-22T12:12:56Z] <godog> stopping puppet fleetwide for ca migration - T189891

Change 420721 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: use puppetmaster2001 as ca_server

https://gerrit.wikimedia.org/r/420721

Mentioned in SAL (#wikimedia-operations) [2018-03-22T12:20:39Z] <godog> running puppet on puppetmaster[21]001 - T189891

Mentioned in SAL (#wikimedia-operations) [2018-03-22T13:23:27Z] <godog> reenabling puppet fleetwide to enable CA switch - T189891

Mentioned in SAL (#wikimedia-operations) [2018-03-22T15:23:26Z] <ottomata> ran puppet-merge on puppetmaster2001, got ssh: connect to host puppetmaster1001.eqiad.wmnet port 22: Connection timed out, hope all is ok. T189891

Problems discovered today during the ca switchover:

  1. permissions during rsync for ca/volatile are set based on uids apparently, not user names, thus some files were owned by nagios and puppet-master refused to start
  2. We need to rsync the "puppet" keypair used by the server as well, namely /var/lib/puppet/server/ssl/private_keys/puppet.pem and /var/lib/puppet/server/ssl/certs/puppet.pem for puppet-server to start properly. Said keypair was autogenerated once the master was in ca mode (current theory) and thus wouldn't match what the server had cached from puppetmaster1001

Change 421839 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Remove non-authoritative SRV puppet records

https://gerrit.wikimedia.org/r/421839

Change 421842 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] puppetmaster: install keypair for 'puppet' when running as CA

https://gerrit.wikimedia.org/r/421842

Change 421839 merged by Filippo Giunchedi:
[operations/dns@master] Remove non-authoritative SRV puppet records

https://gerrit.wikimedia.org/r/421839

Change 421917 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Revert "hieradata: use puppetmaster2001 as ca_server"

https://gerrit.wikimedia.org/r/421917

Change 421842 merged by Filippo Giunchedi:
[operations/puppet@production] puppetmaster: install keypair for 'puppet' when running as CA

https://gerrit.wikimedia.org/r/421842

Mentioned in SAL (#wikimedia-operations) [2018-03-27T15:10:12Z] <godog> stop puppet fleetwide for CA failover - T189891

Change 421917 merged by Filippo Giunchedi:
[operations/puppet@production] Revert "hieradata: use puppetmaster2001 as ca_server"

https://gerrit.wikimedia.org/r/421917

Mentioned in SAL (#wikimedia-operations) [2018-03-27T15:23:29Z] <godog> reenable puppet fleetwide for CA failover - T189891

fgiunchedi claimed this task.

This is complete. Added documentation to https://wikitech.wikimedia.org/wiki/Puppet#Puppet_CA

Change 420705 abandoned by Filippo Giunchedi:
puppetmaster: lock commits on /srv/private on non-master hosts

Reason:
puppet-merge now locks

https://gerrit.wikimedia.org/r/420705