Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | jbond | T184561 Modernize Puppet Configuration Management (2017-18 Q3 Goal) | |||
Resolved | fgiunchedi | T184562 Upgrade Puppet Master Infrastructure to Debian Stretch | |||
Resolved | fgiunchedi | T188623 Upgrade hiera to stretch (version 3) | |||
Resolved | fgiunchedi | T185215 Puppet compiler failure to lookup some keys | |||
Resolved | fgiunchedi | T189891 Failover puppet ca service from eqiad to codfw |
Event Timeline
Change 414675 abandoned by Filippo Giunchedi:
WIP ruby-mysql2
Reason:
Not needed, will upload ruby-mysql to stretch-wikimedia instead
Mentioned in SAL (#wikimedia-operations) [2018-02-27T13:22:37Z] <godog> upload ruby-mysql 2.9.1-1~bpo9+1 to stretch-wikimedia - T184562
Change 391336 abandoned by Paladox:
puppetmaster: Use ruby-mysql2 over ruby-mysql and migrate servermon to it
Mentioned in SAL (#wikimedia-operations) [2018-02-27T17:14:36Z] <godog> upload puppetdb 2.3.8-1~wmf1+stretch to stretch-wikimedia - T184562
Change 415244 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] aptrepo: add puppetdb4 component
Change 415244 merged by Filippo Giunchedi:
[operations/puppet@production] aptrepo: add puppetdb4 component
rhodium with puppetdb-terminus from puppetdb 2.3 works as expected, the only initialization I had to do was to update /srv/private with actual contents instead of waiting for a commit on private.git
Change 415299 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: repool rhodium
Change 415299 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: repool rhodium
Change 415316 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Reinstall puppetmaster1002 with stretch
Change 415316 merged by Filippo Giunchedi:
[operations/puppet@production] Reinstall puppetmaster1002 with stretch
Change 415327 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] puppetmaster: naggen2 depends on python-requests
Change 415335 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] puppetmaster: capture warnings in logging for naggen2
Change 415327 merged by Filippo Giunchedi:
[operations/puppet@production] puppetmaster: naggen2 depends on python-requests
Change 415341 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: depool rhodium, bring back puppetmaster1002
Change 415335 merged by Filippo Giunchedi:
[operations/puppet@production] puppetmaster: capture warnings in logging for naggen2
Change 415341 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: depool rhodium, bring back puppetmaster1002
Change 419173 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] puppet: depool and reinstall puppetmaster2002 with stretch
Change 419173 merged by Filippo Giunchedi:
[operations/puppet@production] puppet: depool and reinstall puppetmaster2002 with stretch
Change 419455 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Add puppetmaster2002 back, offline
Change 419455 merged by Filippo Giunchedi:
[operations/puppet@production] Add puppetmaster2002 back, offline
Change 419689 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: pool puppetmaster2002
Change 419689 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: pool puppetmaster2002
puppetmaster2002 was repooled today and is working as intended. puppetdb on nihal had a spike in commands processed while compilations were happening on puppetmaster2002 and "recovered" after about half an hour.
Change 419704 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: depool puppetmaster1002 for stretch reimage
Change 419704 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: depool puppetmaster1002 for stretch reimage
Change 419758 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add puppetmaster1002 back, offline
Change 419758 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add puppetmaster1002 back, offline
Change 419764 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: repool puppetmaster1002
Change 419764 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: repool puppetmaster1002
Change 419767 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] utils: fetch puppet ca server from agent config
Change 419767 abandoned by Filippo Giunchedi:
utils: fetch puppet ca server from agent config
Reason:
Script is meant to be run on a local checkout
Change 419774 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Depool codfw puppetmaster
Change 419781 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hiera: use puppet.codfw.wmnet alias for labtestpuppetmaster
Change 419794 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: use stretch for puppetmaster2001
Change 419795 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cache: depool puppetmaster2001 from config-master.w.o
Change 419781 merged by Andrew Bogott:
[operations/puppet@production] hiera: use puppet.codfw.wmnet alias for labtestpuppetmaster
Change 419802 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] lower TTL for puppetmaster-related CNAMEs
Change 419802 merged by Filippo Giunchedi:
[operations/dns@master] lower TTL for puppetmaster-related CNAMEs
Change 419794 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: use stretch for puppetmaster2001
On Monday 19th I'll reinstall puppetmaster2001 with stretch, using the following procedure:
- Depool puppetmaster2001 via dns, from config-master and its "puppetmaster frontend" role: https://gerrit.wikimedia.org/r/c/419795/ https://gerrit.wikimedia.org/r/c/419774/
- Verify that traffic has been drained and puppetmasters in eqiad can cope with the additional load (>= 30 min), fail back if not
- Reimage puppetmaster2001 with stretch via wmf-auto-reimage-host, taking care of the first puppet run too
- Synchronize /srv/private from puppetmaster1001 with su -c "export GIT_SSH=/srv/private/.git/ssh_wrapper.sh ; git push ssh://puppetmaster2001.codfw.wmnet/srv/private master" gitpuppet
- Force-run rsync crons for volatile and ca on puppetmaster2001: /usr/bin/rsync -avz --delete puppetmaster1001.eqiad.wmnet::puppet_volatile /var/lib/puppet/volatile and /usr/bin/rsync -avz --delete puppetmaster1001.eqiad.wmnet::puppet_ca /var/lib/puppet/server/ssl/ca
- Verify puppet agent can run using the new frontend on a test host using https://wikitech.wikimedia.org/wiki/Puppet#force_puppet_agent_to_use_a_specific_puppetmaster
- Repool a small site first (e.g. ulsfo) in dns and verify all is well https://gerrit.wikimedia.org/r/c/420003/
- Repool remaining sites, eqsin and codfw https://gerrit.wikimedia.org/r/c/420004/ and https://gerrit.wikimedia.org/r/c/420005/
- Verify /srv/config-master is getting updated
- Repool config-master in varnish
Change 419774 merged by Filippo Giunchedi:
[operations/dns@master] Depool codfw puppetmaster
Mentioned in SAL (#wikimedia-operations) [2018-03-19T09:10:16Z] <godog> depool codfw puppetmaster - T184562
Change 419795 merged by Filippo Giunchedi:
[operations/puppet@production] cache: depool puppetmaster2001 from config-master.w.o
Mentioned in SAL (#wikimedia-operations) [2018-03-19T09:27:03Z] <godog> reimage puppetmaster2001 with stretch - T184562
puppetmaster2001 was reimaged with stretch and traffic moved back as planned, notes from the process:
- The procedure should include removing the puppet master from the list of workers so puppet-merge doesn't attempt to sync to it while the reimage is ongoing
- apache2 won't start, lamenting that /var/lib/puppet/server/ssl/certs/ca.pem is missing. I manually copied it from /var/lib/puppet/ssl/certs/ca.pem
- There's an apache warning AH00548: NameVirtualHost has no effect and will be removed in the next release /etc/apache2/conf-enabled/50-puppetmaster-ports.conf:5
- After reimage we should systemctl reset-failed puppet-master since the unit is disabled and we're not running puppet master as a separate process.
Change 420351 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] puppetmaster: disable puppet-master service
Change 420351 merged by Filippo Giunchedi:
[operations/puppet@production] puppetmaster: disable puppet-master service
Change 420733 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Depool eqiad puppetmaster
Change 420734 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Move config-master to codfw
Change 420744 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cache: depool puppetmaster1001 from config-master.w.o
Change 421031 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: reinstall puppetmaster1001 with stretch
Change 421031 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: reinstall puppetmaster1001 with stretch
Tomorrow we're going to reinstall puppetmaster1001, puppet traffic is already pointed away from it. After the CA/private failover is completed (T189891: Failover puppet ca service from eqiad to codfw) these are the remaining steps:
- Move config-master away from eqiad (varnish + dns): https://gerrit.wikimedia.org/r/420744 https://gerrit.wikimedia.org/r/c/420734/
- Reimage puppetmaster1001 with stretch via wmf-auto-reimage-host, taking care of the first puppet run too
- Synchronize /srv/private from puppetmaster2001 with su -c "export GIT_SSH=/srv/private/.git/ssh_wrapper.sh ; git push ssh://puppetmaster1001.eqiad.wmnet/srv/private master" gitpuppet
- Force-run rsync crons for volatile and ca on puppetmaster1001: /usr/bin/rsync -avz --delete puppetmaster2001.codfw.wmnet::puppet_volatile /var/lib/puppet/volatile and /usr/bin/rsync -avz --delete puppetmaster2001.codfw.wmnet::puppet_ca /var/lib/puppet/server/ssl/ca
- Verify puppet agent can run using the new frontend on a test host using https://wikitech.wikimedia.org/wiki/Puppet#force_puppet_agent_to_use_a_specific_puppetmaster
- Repool esams first in dns and verify all is well https://gerrit.wikimedia.org/r/c/421060/
- Repool eqiad and wikimedia.org in dns and verify all is well https://gerrit.wikimedia.org/r/c/421061/
- Verify /srv/config-master is getting updated on puppetmaster1001
- Repool config-master in varnish and dns by reverting https://gerrit.wikimedia.org/r/420744 https://gerrit.wikimedia.org/r/c/420734/
Change 420733 abandoned by Filippo Giunchedi:
Depool eqiad puppetmaster
Reason:
Not needed
Change 421060 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] wmnet: point esams puppet to eqiad
Change 421061 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Point wikimedia.org and eqiad puppet to eqiad
Change 420734 merged by Filippo Giunchedi:
[operations/dns@master] Move config-master to codfw
Change 420744 merged by Filippo Giunchedi:
[operations/puppet@production] cache: depool puppetmaster1001 from config-master.w.o
Mentioned in SAL (#wikimedia-operations) [2018-03-22T14:00:09Z] <godog> reimage puppetmaster1001 - T184562
Change 421317 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: take out puppetmaster1001 as frontend
Change 421317 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: take out puppetmaster1001 as frontend
Reimaging puppetmaster1001 isn't going according to plan, namely eno1 is seemingly brought up and gets a dhcp lease, then brought down. Without a default gateway the subnet-specific network preseed file isn't loaded, leading to debconf question about network mask. See also logs at https://phabricator.wikimedia.org/P6885
The reimage problem on puppetmaster1001 was solved by reverting https://gerrit.wikimedia.org/r/#/c/421279/ which had inadvertently commented out a large portion of the preseed.cfg
Proceeding with wmf-auto-reimage-host on puppetmaster1001 now
Change 421918 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Revert "Move config-master to codfw"
Change 421919 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Revert "cache: depool puppetmaster1001 from config-master.w.o"
Change 421918 merged by Filippo Giunchedi:
[operations/dns@master] Revert "Move config-master to codfw"
Change 421919 merged by Filippo Giunchedi:
[operations/puppet@production] Revert "cache: depool puppetmaster1001 from config-master.w.o"
Change 421060 merged by Filippo Giunchedi:
[operations/dns@master] wmnet: point esams puppet to eqiad
Change 421061 merged by Filippo Giunchedi:
[operations/dns@master] Point wikimedia.org and eqiad puppet to eqiad
This is completed, added documentation on pooling/depooling frontend/backend at https://wikitech.wikimedia.org/wiki/Puppet#Operations