Page MenuHomePhabricator

Upgrade Puppet Master Infrastructure to Debian Stretch
Closed, ResolvedPublic

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 414978 merged by Filippo Giunchedi:
[operations/puppet@production] puppetmaster: document 'offline' worker option

https://gerrit.wikimedia.org/r/414978

Change 414979 merged by Filippo Giunchedi:
[operations/puppet@production] puppetmaster: add rhodium, depooled

https://gerrit.wikimedia.org/r/414979

Change 414675 abandoned by Filippo Giunchedi:
WIP ruby-mysql2

Reason:
Not needed, will upload ruby-mysql to stretch-wikimedia instead

https://gerrit.wikimedia.org/r/414675

Mentioned in SAL (#wikimedia-operations) [2018-02-27T13:22:37Z] <godog> upload ruby-mysql 2.9.1-1~bpo9+1 to stretch-wikimedia - T184562

Change 391336 abandoned by Paladox:
puppetmaster: Use ruby-mysql2 over ruby-mysql and migrate servermon to it

https://gerrit.wikimedia.org/r/391336

Mentioned in SAL (#wikimedia-operations) [2018-02-27T17:14:36Z] <godog> upload puppetdb 2.3.8-1~wmf1+stretch to stretch-wikimedia - T184562

puppetdb 4 was in stretch-wikipedia. But seems it is now puppetdb 2.

Change 415244 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] aptrepo: add puppetdb4 component

https://gerrit.wikimedia.org/r/415244

Change 415244 merged by Filippo Giunchedi:
[operations/puppet@production] aptrepo: add puppetdb4 component

https://gerrit.wikimedia.org/r/415244

rhodium with puppetdb-terminus from puppetdb 2.3 works as expected, the only initialization I had to do was to update /srv/private with actual contents instead of waiting for a commit on private.git

Change 415299 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: repool rhodium

https://gerrit.wikimedia.org/r/415299

Change 415299 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: repool rhodium

https://gerrit.wikimedia.org/r/415299

Change 415316 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Reinstall puppetmaster1002 with stretch

https://gerrit.wikimedia.org/r/415316

Change 415316 merged by Filippo Giunchedi:
[operations/puppet@production] Reinstall puppetmaster1002 with stretch

https://gerrit.wikimedia.org/r/415316

Change 415327 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] puppetmaster: naggen2 depends on python-requests

https://gerrit.wikimedia.org/r/415327

Change 415335 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] puppetmaster: capture warnings in logging for naggen2

https://gerrit.wikimedia.org/r/415335

Change 415327 merged by Filippo Giunchedi:
[operations/puppet@production] puppetmaster: naggen2 depends on python-requests

https://gerrit.wikimedia.org/r/415327

Change 415341 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: depool rhodium, bring back puppetmaster1002

https://gerrit.wikimedia.org/r/415341

Change 415335 merged by Filippo Giunchedi:
[operations/puppet@production] puppetmaster: capture warnings in logging for naggen2

https://gerrit.wikimedia.org/r/415335

Change 415341 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: depool rhodium, bring back puppetmaster1002

https://gerrit.wikimedia.org/r/415341

herron added a subscriber: Andrew.Mar 6 2018, 6:41 PM

Change 419173 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] puppet: depool and reinstall puppetmaster2002 with stretch

https://gerrit.wikimedia.org/r/419173

Change 419173 merged by Filippo Giunchedi:
[operations/puppet@production] puppet: depool and reinstall puppetmaster2002 with stretch

https://gerrit.wikimedia.org/r/419173

Change 419455 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Add puppetmaster2002 back, offline

https://gerrit.wikimedia.org/r/419455

Change 419455 merged by Filippo Giunchedi:
[operations/puppet@production] Add puppetmaster2002 back, offline

https://gerrit.wikimedia.org/r/419455

Change 419689 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: pool puppetmaster2002

https://gerrit.wikimedia.org/r/419689

Change 419689 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: pool puppetmaster2002

https://gerrit.wikimedia.org/r/419689

puppetmaster2002 was repooled today and is working as intended. puppetdb on nihal had a spike in commands processed while compilations were happening on puppetmaster2002 and "recovered" after about half an hour.

Change 419704 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: depool puppetmaster1002 for stretch reimage

https://gerrit.wikimedia.org/r/419704

Change 419704 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: depool puppetmaster1002 for stretch reimage

https://gerrit.wikimedia.org/r/419704

Change 419758 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add puppetmaster1002 back, offline

https://gerrit.wikimedia.org/r/419758

Change 419758 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add puppetmaster1002 back, offline

https://gerrit.wikimedia.org/r/419758

Change 419764 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: repool puppetmaster1002

https://gerrit.wikimedia.org/r/419764

Change 419764 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: repool puppetmaster1002

https://gerrit.wikimedia.org/r/419764

Change 419767 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] utils: fetch puppet ca server from agent config

https://gerrit.wikimedia.org/r/419767

Change 419767 abandoned by Filippo Giunchedi:
utils: fetch puppet ca server from agent config

Reason:
Script is meant to be run on a local checkout

https://gerrit.wikimedia.org/r/419767

Change 419774 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Depool codfw puppetmaster

https://gerrit.wikimedia.org/r/419774

Change 419781 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hiera: use puppet.codfw.wmnet alias for labtestpuppetmaster

https://gerrit.wikimedia.org/r/419781

Change 419794 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: use stretch for puppetmaster2001

https://gerrit.wikimedia.org/r/419794

Change 419795 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cache: depool puppetmaster2001 from config-master.w.o

https://gerrit.wikimedia.org/r/419795

Change 419781 merged by Andrew Bogott:
[operations/puppet@production] hiera: use puppet.codfw.wmnet alias for labtestpuppetmaster

https://gerrit.wikimedia.org/r/419781

Change 419802 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] lower TTL for puppetmaster-related CNAMEs

https://gerrit.wikimedia.org/r/419802

Change 419802 merged by Filippo Giunchedi:
[operations/dns@master] lower TTL for puppetmaster-related CNAMEs

https://gerrit.wikimedia.org/r/419802

Change 419794 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: use stretch for puppetmaster2001

https://gerrit.wikimedia.org/r/419794

fgiunchedi added a comment.EditedMar 16 2018, 9:19 AM

On Monday 19th I'll reinstall puppetmaster2001 with stretch, using the following procedure:

  1. Depool puppetmaster2001 via dns, from config-master and its "puppetmaster frontend" role: https://gerrit.wikimedia.org/r/c/419795/ https://gerrit.wikimedia.org/r/c/419774/
  2. Verify that traffic has been drained and puppetmasters in eqiad can cope with the additional load (>= 30 min), fail back if not
  3. Reimage puppetmaster2001 with stretch via wmf-auto-reimage-host, taking care of the first puppet run too
  4. Synchronize /srv/private from puppetmaster1001 with su -c "export GIT_SSH=/srv/private/.git/ssh_wrapper.sh ; git push ssh://puppetmaster2001.codfw.wmnet/srv/private master" gitpuppet
  5. Force-run rsync crons for volatile and ca on puppetmaster2001: /usr/bin/rsync -avz --delete puppetmaster1001.eqiad.wmnet::puppet_volatile /var/lib/puppet/volatile and /usr/bin/rsync -avz --delete puppetmaster1001.eqiad.wmnet::puppet_ca /var/lib/puppet/server/ssl/ca
  6. Verify puppet agent can run using the new frontend on a test host using https://wikitech.wikimedia.org/wiki/Puppet#force_puppet_agent_to_use_a_specific_puppetmaster
  7. Repool a small site first (e.g. ulsfo) in dns and verify all is well https://gerrit.wikimedia.org/r/c/420003/
  8. Repool remaining sites, eqsin and codfw https://gerrit.wikimedia.org/r/c/420004/ and https://gerrit.wikimedia.org/r/c/420005/
  9. Verify /srv/config-master is getting updated
  10. Repool config-master in varnish
Joe added a subscriber: Joe.Mar 16 2018, 10:40 AM

The plan looks fine to me!

Change 419774 merged by Filippo Giunchedi:
[operations/dns@master] Depool codfw puppetmaster

https://gerrit.wikimedia.org/r/419774

Mentioned in SAL (#wikimedia-operations) [2018-03-19T09:10:16Z] <godog> depool codfw puppetmaster - T184562

Change 419795 merged by Filippo Giunchedi:
[operations/puppet@production] cache: depool puppetmaster2001 from config-master.w.o

https://gerrit.wikimedia.org/r/419795

Mentioned in SAL (#wikimedia-operations) [2018-03-19T09:27:03Z] <godog> reimage puppetmaster2001 with stretch - T184562

puppetmaster2001 was reimaged with stretch and traffic moved back as planned, notes from the process:

  1. The procedure should include removing the puppet master from the list of workers so puppet-merge doesn't attempt to sync to it while the reimage is ongoing
  2. apache2 won't start, lamenting that /var/lib/puppet/server/ssl/certs/ca.pem is missing. I manually copied it from /var/lib/puppet/ssl/certs/ca.pem
  3. There's an apache warning AH00548: NameVirtualHost has no effect and will be removed in the next release /etc/apache2/conf-enabled/50-puppetmaster-ports.conf:5
  4. After reimage we should systemctl reset-failed puppet-master since the unit is disabled and we're not running puppet master as a separate process.

Change 420351 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] puppetmaster: disable puppet-master service

https://gerrit.wikimedia.org/r/420351

Change 420351 merged by Filippo Giunchedi:
[operations/puppet@production] puppetmaster: disable puppet-master service

https://gerrit.wikimedia.org/r/420351

Change 420733 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Depool eqiad puppetmaster

https://gerrit.wikimedia.org/r/420733

Change 420734 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Move config-master to codfw

https://gerrit.wikimedia.org/r/420734

Change 420744 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] cache: depool puppetmaster1001 from config-master.w.o

https://gerrit.wikimedia.org/r/420744

Change 421031 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: reinstall puppetmaster1001 with stretch

https://gerrit.wikimedia.org/r/421031

Change 421031 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: reinstall puppetmaster1001 with stretch

https://gerrit.wikimedia.org/r/421031

fgiunchedi added a comment.EditedMar 21 2018, 4:16 PM

Tomorrow we're going to reinstall puppetmaster1001, puppet traffic is already pointed away from it. After the CA/private failover is completed (T189891: Failover puppet ca service from eqiad to codfw) these are the remaining steps:

  1. Move config-master away from eqiad (varnish + dns): https://gerrit.wikimedia.org/r/420744 https://gerrit.wikimedia.org/r/c/420734/
  2. Reimage puppetmaster1001 with stretch via wmf-auto-reimage-host, taking care of the first puppet run too
  3. Synchronize /srv/private from puppetmaster2001 with su -c "export GIT_SSH=/srv/private/.git/ssh_wrapper.sh ; git push ssh://puppetmaster1001.eqiad.wmnet/srv/private master" gitpuppet
  4. Force-run rsync crons for volatile and ca on puppetmaster1001: /usr/bin/rsync -avz --delete puppetmaster2001.codfw.wmnet::puppet_volatile /var/lib/puppet/volatile and /usr/bin/rsync -avz --delete puppetmaster2001.codfw.wmnet::puppet_ca /var/lib/puppet/server/ssl/ca
  5. Verify puppet agent can run using the new frontend on a test host using https://wikitech.wikimedia.org/wiki/Puppet#force_puppet_agent_to_use_a_specific_puppetmaster
  6. Repool esams first in dns and verify all is well https://gerrit.wikimedia.org/r/c/421060/
  7. Repool eqiad and wikimedia.org in dns and verify all is well https://gerrit.wikimedia.org/r/c/421061/
  8. Verify /srv/config-master is getting updated on puppetmaster1001
  9. Repool config-master in varnish and dns by reverting https://gerrit.wikimedia.org/r/420744 https://gerrit.wikimedia.org/r/c/420734/

Change 420733 abandoned by Filippo Giunchedi:
Depool eqiad puppetmaster

Reason:
Not needed

https://gerrit.wikimedia.org/r/420733

Change 421060 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] wmnet: point esams puppet to eqiad

https://gerrit.wikimedia.org/r/421060

Change 421061 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Point wikimedia.org and eqiad puppet to eqiad

https://gerrit.wikimedia.org/r/421061

Change 420734 merged by Filippo Giunchedi:
[operations/dns@master] Move config-master to codfw

https://gerrit.wikimedia.org/r/420734

Change 420744 merged by Filippo Giunchedi:
[operations/puppet@production] cache: depool puppetmaster1001 from config-master.w.o

https://gerrit.wikimedia.org/r/420744

Mentioned in SAL (#wikimedia-operations) [2018-03-22T14:00:09Z] <godog> reimage puppetmaster1001 - T184562

Change 421317 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: take out puppetmaster1001 as frontend

https://gerrit.wikimedia.org/r/421317

Change 421317 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: take out puppetmaster1001 as frontend

https://gerrit.wikimedia.org/r/421317

Reimaging puppetmaster1001 isn't going according to plan, namely eno1 is seemingly brought up and gets a dhcp lease, then brought down. Without a default gateway the subnet-specific network preseed file isn't loaded, leading to debconf question about network mask. See also logs at https://phabricator.wikimedia.org/P6885

The reimage problem on puppetmaster1001 was solved by reverting https://gerrit.wikimedia.org/r/#/c/421279/ which had inadvertently commented out a large portion of the preseed.cfg

Proceeding with wmf-auto-reimage-host on puppetmaster1001 now

Change 421918 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] Revert "Move config-master to codfw"

https://gerrit.wikimedia.org/r/421918

Change 421919 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Revert "cache: depool puppetmaster1001 from config-master.w.o"

https://gerrit.wikimedia.org/r/421919

Change 421918 merged by Filippo Giunchedi:
[operations/dns@master] Revert "Move config-master to codfw"

https://gerrit.wikimedia.org/r/421918

Change 421919 merged by Filippo Giunchedi:
[operations/puppet@production] Revert "cache: depool puppetmaster1001 from config-master.w.o"

https://gerrit.wikimedia.org/r/421919

Change 421060 merged by Filippo Giunchedi:
[operations/dns@master] wmnet: point esams puppet to eqiad

https://gerrit.wikimedia.org/r/421060

Change 421061 merged by Filippo Giunchedi:
[operations/dns@master] Point wikimedia.org and eqiad puppet to eqiad

https://gerrit.wikimedia.org/r/421061

fgiunchedi closed this task as Resolved.Mar 28 2018, 1:59 PM

This is completed, added documentation on pooling/depooling frontend/backend at https://wikitech.wikimedia.org/wiki/Puppet#Operations

238482n375 set Security to Software security bug.Jun 15 2018, 8:07 AM
238482n375 changed the visibility from "Public (No Login Required)" to "Custom Policy".
This comment was removed by Volans.
Restricted Application added a project: Security. · View Herald TranscriptJun 15 2018, 9:06 AM