Page MenuHomePhabricator

Upgrade install servers to Buster
Closed, ResolvedPublic

Description

These act as TFTP/DHCP/boot servers, but also host apt.wikimedia.org

  • assign new IP addresses for replacement servers (private IPs, unlike before!)
  • create Ganeti VMs
  • install OS (buster) on new VMs
  • create new puppet role for installserver minus APT repo (role::installserver::light)
  • create new puppet role for APT repo minus DHCP/TFTP (role::apt_repo)
  • apply both puppet roles and check/fix all errors (add buster support)
  • rsync data APT repo data (done from install1002 to apt1001 and apt2001, added to cron jobs)
  • switch DNS CNAME for webproxy / switch squids to buster
  • switch DHCP_relay config on routers
  • switch TFTP server IPs in DHCP config
  • switch apt.wikimedia.org to the new backend apt1001 in DNS after making some tests
  • install1002.wikimedia.org replaced with install1003.eqiad.wmnet in all config files
  • install2002.wikimedia.org replaced with install2003.codfw.wmnet in all config files
  • remove old servers from DNS and decom fully with cookbook

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+0 -5
operations/dnsmaster+0 -8
operations/puppetproduction+0 -5
operations/dnsmaster+2 -0
operations/puppetproduction+6 -6
operations/puppetproduction+1 -2
operations/puppetproduction+1 -1
operations/puppetproduction+44 -0
operations/puppetproduction+5 -0
operations/puppetproduction+13 -9
operations/dnsmaster+1 -1
operations/puppetproduction+35 -0
operations/puppetproduction+1 -3
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+5 -18
operations/puppetproduction+6 -0
operations/puppetproduction+3 -1
operations/puppetproduction+9 -3
operations/puppetproduction+26 -26
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/homer/publicmaster+47 -47
operations/puppetproduction+2 -1
operations/puppetproduction+7 -3
operations/puppetproduction+0 -1
operations/dnsmaster+5 -5
operations/puppetproduction+4 -4
operations/puppetproduction+2 -2
operations/puppetproduction+4 -4
operations/puppetproduction+1 -1
operations/dnsmaster+8 -0
operations/dnsmaster+0 -8
operations/puppetproduction+14 -1
operations/puppetproduction+0 -0
operations/puppetproduction+12 -6
operations/puppetproduction+19 -8
operations/puppetproduction+7 -1
operations/puppetproduction+2 -0
operations/puppetproduction+5 -1
operations/puppetproduction+12 -3
operations/dnsmaster+4 -4
operations/puppetproduction+1 -0
operations/puppetproduction+0 -3
operations/puppetproduction+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+10 -0
operations/puppetproduction+5 -0
operations/puppetproduction+9 -6
operations/puppetproduction+16 -11
operations/puppetproduction+5 -5
operations/puppetproduction+11 -0
operations/puppetproduction+8 -8
operations/dnsmaster+1 -1
operations/puppetproduction+5 -0
operations/puppetproduction+0 -2
operations/puppetproduction+15 -0
operations/dnsmaster+8 -0
operations/dnsmaster+8 -0
operations/puppetproduction+1 -1
operations/puppetproduction+9 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 572394 abandoned by Dzahn:
site: add installserver::light role on new install servers

Reason:
rebased to nothing as it was done in https://gerrit.wikimedia.org/r/c/operations/puppet/ /576460

https://gerrit.wikimedia.org/r/572394

Change 576479 merged by Dzahn:
[operations/puppet@production] installserver: ensure interface for DHCP server is configured

https://gerrit.wikimedia.org/r/576479

Change 577640 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove install1003/2003 again to recreate with public IPs

https://gerrit.wikimedia.org/r/577640

Change 577641 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: switch install1003/2003 to public IPs

https://gerrit.wikimedia.org/r/577641

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: install1003.eqiad.wmnet

  • install1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: install2003.codfw.wmnet

  • install2003.codfw.wmnet (FAIL)
    • Host steps raised exception: Empty Management Password

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: install2003.codfw.wmnet

  • install2003.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed

Change 577640 merged by Dzahn:
[operations/dns@master] remove install1003/2003 again to recreate with public IPs

https://gerrit.wikimedia.org/r/577640

Change 577649 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] re-add install1002/install2003 with public IPs

https://gerrit.wikimedia.org/r/577649

Change 577649 merged by Dzahn:
[operations/dns@master] re-add install1002/install2003 with public IPs

https://gerrit.wikimedia.org/r/577649

install1003.eqiad.wmnet and install2003.codfw.wmnet have been removed entirely with the decom cookbook.

DNS has been adjusted and then VMs with identical specs but public IPs have been created with the makevm cookbook.

install1003.wikimedia.org and install2003.wikimedia.org

Change 577663 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: switch install1003/2003 to public IP, new VMs, update MACs

https://gerrit.wikimedia.org/r/577663

Change 577641 merged by Dzahn:
[operations/puppet@production] site: switch install1003/2003 to public IPs

https://gerrit.wikimedia.org/r/577641

Change 577663 merged by Dzahn:
[operations/puppet@production] DHCP: switch install1003/2003 to public IP, new VMs, update MACs

https://gerrit.wikimedia.org/r/577663

@Muehlenhoff @ayounsi

Tested squid on re-installed install1003.wikimedia.org:

[planet1001:~] $ http_proxy="http://install1003.wikimedia.org:8080; export HTTPS_PROXY=http://install1003.wikimedia.org:8080" curl foobar.org

and works for me.

Also see that in squid/access.log and the prometheus squid exporter has been added.

Change 569687 abandoned by Dzahn:
switch install_server and failover in hieradata/common.yaml

Reason:
wrong, this is only used by aptrepo which is now a separate role, will rename this variable

https://gerrit.wikimedia.org/r/569687

Change 577701 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] hiera/aptrepo: rename install_server variables to aptrepo_server

https://gerrit.wikimedia.org/r/577701

Change 577701 merged by Dzahn:
[operations/puppet@production] hiera/aptrepo: rename install_server variables to aptrepo_server

https://gerrit.wikimedia.org/r/577701

Change 569680 merged by Dzahn:
[operations/dns@master] switch webproxy CNAMEs to new install servers

https://gerrit.wikimedia.org/r/569680

Change 579106 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: allow configuring squid as running/stopped in Hiera

https://gerrit.wikimedia.org/r/579106

Change 579373 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] squid: remove obsolete hierarchy_stoplist config directive

https://gerrit.wikimedia.org/r/579373

Change 579373 merged by Dzahn:
[operations/puppet@production] squid: remove obsolete hierarchy_stoplist config directive

https://gerrit.wikimedia.org/r/579373

Change 579106 merged by Dzahn:
[operations/puppet@production] installserver: allow configuring squid as absent in Hiera

https://gerrit.wikimedia.org/r/579106

Mentioned in SAL (#wikimedia-operations) [2020-03-13T17:21:12Z] <mutante> removed squid from install1002/install2002 (formerly webproxy.(eqiad|codfw).wmnet until 2 days ago, replaced by install1003/install2003) T224576

Change 579610 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: disable squid monitoring if service is absent

https://gerrit.wikimedia.org/r/579610

Change 579610 merged by Dzahn:
[operations/puppet@production] installserver: disable squid monitoring if service is absent

https://gerrit.wikimedia.org/r/579610

When building a docker container on contint1001.wikimedia.org with docker-pkg, pip gets proxy timeout error when using http://webproxy.eqiad.wmnet:8080.

I have manually switched to the codfw one (http://webproxy.codfw.wmnet:8080) and it worked fine.

So I guess install1003.wikimedia.org has an issue of some sort?

Change 584963 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/homer/public@master] replace DCHP relays with new installservers

https://gerrit.wikimedia.org/r/584963

Change 584963 merged by jenkins-bot:
[operations/homer/public@master] replace DCHP relays with new installservers

https://gerrit.wikimedia.org/r/584963

Mentioned in SAL (#wikimedia-operations) [2020-04-01T09:34:33Z] <mutante> install_servers: DHCP_relay in routers and TFTP server in DHCP server config have been switched from install1002/2002 to install1003/2003 - doing a test install, but if any issues report on T224576

Mentioned in SAL (#wikimedia-operations) [2020-04-01T09:35:02Z] <XioNoX> Update install servers IPs (dhcp helpers + firewall rules) - T224576

Change 585194 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] start DHCP service on new installserver::light servers

https://gerrit.wikimedia.org/r/585194

Change 585194 merged by Dzahn:
[operations/puppet@production] start DHCP service on new installserver::light servers

https://gerrit.wikimedia.org/r/585194

Change 585199 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] stop DHCP service on jessie install servers

https://gerrit.wikimedia.org/r/585199

Change 585199 merged by Dzahn:
[operations/puppet@production] stop DHCP service on jessie install servers

https://gerrit.wikimedia.org/r/585199

Change 569684 merged by Dzahn:
[operations/puppet@production] install_server: switch TFTP servers in DHCP to new install servers

https://gerrit.wikimedia.org/r/569684

Change 585221 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: allow stopping tftp service with parameter

https://gerrit.wikimedia.org/r/585221

Change 585221 merged by Dzahn:
[operations/puppet@production] installserver: stop tftp service on old install servers

https://gerrit.wikimedia.org/r/585221

Change 585233 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver/tftp: add missing ensure_service parameter

https://gerrit.wikimedia.org/r/585233

Change 585233 merged by Dzahn:
[operations/puppet@production] installserver/tftp: add missing ensure_service parameter

https://gerrit.wikimedia.org/r/585233

Change 585238 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver/tftp: if service is ensured stop, do not monitor it

https://gerrit.wikimedia.org/r/585238

Change 585238 merged by Dzahn:
[operations/puppet@production] tftp: if service is told to be stopped, do not monitor it

https://gerrit.wikimedia.org/r/585238

Change 569686 merged by Dzahn:
[operations/puppet@production] DHCP: remove old install servers and use new servers as next-server

https://gerrit.wikimedia.org/r/569686

Change 585245 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] hiera/apt.wikimedia.org: switch from install1002 to apt1001

https://gerrit.wikimedia.org/r/585245

Change 585370 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] tftp,bastionhost: Add missing ensure_service hiera key

https://gerrit.wikimedia.org/r/585370

Change 585370 merged by Vgutierrez:
[operations/puppet@production] tftp,bastionhost: Add missing ensure_service hiera key

https://gerrit.wikimedia.org/r/585370

Change 585485 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] adjust my .bash_profile to set base dir for reprepro on apt*

https://gerrit.wikimedia.org/r/585485

Change 585485 merged by Dzahn:
[operations/puppet@production] adjust my .bash_profile to set base dir for reprepro on apt*

https://gerrit.wikimedia.org/r/585485

Change 585528 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Add s-nail to send mails after package imports

https://gerrit.wikimedia.org/r/585528

Change 585528 merged by Muehlenhoff:
[operations/puppet@production] Add s-nail to send mails after package imports

https://gerrit.wikimedia.org/r/585528

I successfully tested import and removals of a dummy build (hello) and fixed up the sending of status mails.

I've also synched the public repo keys from install1002 to apt1001, "reprepro checkup" against external repos now also works fine and I could successfully update the external Tor repo for stretch.

Change 585245 merged by Dzahn:
[operations/puppet@production] hiera/apt.wikimedia.org: switch from install1002 to apt1001

https://gerrit.wikimedia.org/r/585245

Change 585722 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aptrepo: add MOTD warning to not use old install servers for APT

https://gerrit.wikimedia.org/r/585722

Change 585723 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aptrepo: remove reprepro package on inactive servers

https://gerrit.wikimedia.org/r/585723

Change 585722 merged by Dzahn:
[operations/puppet@production] aptrepo: add MOTD warning to not use old install servers for APT

https://gerrit.wikimedia.org/r/585722

Change 575404 merged by Dzahn:
[operations/dns@master] switch apt.wikimedia.org from install1002 to apt1001

https://gerrit.wikimedia.org/r/575404

Change 585723 abandoned by Dzahn:
aptrepo: remove reprepro package on inactive servers

https://gerrit.wikimedia.org/r/585723

Change 585765 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: rsync home dir data for migration

https://gerrit.wikimedia.org/r/585765

Change 585765 merged by Dzahn:
[operations/puppet@production] installserver: rsync home dir data for migration

https://gerrit.wikimedia.org/r/585765

Change 585789 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver::migration: fix rsync dir name

https://gerrit.wikimedia.org/r/585789

Change 585789 merged by Dzahn:
[operations/puppet@production] installserver::migration: fix rsync dir name

https://gerrit.wikimedia.org/r/585789

There seems to be a problem with PXE installs, today I tried to live-hack a partman recipe from install1003/2003 but it didn't work, and after a while I found the following in the squid logs:

GET http://apt.wikimedia.org/autoinstall/partman/custom/kafka-jumbo.cfg

After live hacking the recipe on apt1001 I was finally able to see results. I also noticed that /srv/autoinstall on apt1001 seems not managed by puppet. A quick grep in the puppet repo shows stuff like:

preseed/url=http://apt.wikimedia.org/autoinstall/preseed.cfg

 d-i	preseed/early_command	string	wget -O /tmp/early_command http://apt.wikimedia.org/autoinstall/scripts/early_command.sh && sh /tmp/early_command

etc..

We should review all occurrences of apt.wikimedia.org in puppet and see if they need to be migrated to another DNS/alias.

I have also experienced autoinstall/ not being managed by puppet issues :-(

root@install1002:/srv/autoinstall# grep -iR "no-srv-format.cfg" * | grep pc2007
netboot.cfg:        db1[01][0-9][0-9]|dbstore100[3-5]|db2[01][0-9][0-9]|es101[1-9]|es201[1-9]|pc100[7-9]|pc2007|pc2009|pc[12]010|labsdb1009|labsdb101[0-2]|dbprov[12]00[12]) echo partman/custom/no-srv-format.cfg ;; \
preseed.cfg:        db1[01][0-9][0-9]|dbstore100[3-5]|db2[01][0-9][0-9]|es101[1-9]|es201[1-9]|pc100[7-9]|pc2007|pc2009|pc[12]010|labsdb1009|labsdb101[0-2]|dbprov[12]00[12]) echo partman/custom/no-srv-format.cfg ;; \
root@install1002:/srv/autoinstall#
root@apt1001:/srv/autoinstall# grep -iR "no-srv-format.cfg" * | grep pc2007
root@apt1001:/srv/autoinstall#

Change 587698 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add install.wikimedia.org CNAME to install1003

https://gerrit.wikimedia.org/r/587698

Change 587699 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: replace apt.wm.org with install.wm.org in autoinstall URLs

https://gerrit.wikimedia.org/r/587699

Change 587701 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: include preseed in apt role, not in light role

https://gerrit.wikimedia.org/r/587701

Change 587701 merged by Dzahn:
[operations/puppet@production] installserver: include preseed in apt role, not in light role

https://gerrit.wikimedia.org/r/587701

Change 587699 abandoned by Dzahn:
install_server: replace apt.wm.org with install.wm.org in autoinstall URLs

Reason:
preferring other solution https://gerrit.wikimedia.org/r/c/operations/puppet/ /587701

https://gerrit.wikimedia.org/r/587699

Change 587698 abandoned by Dzahn:
add install.wikimedia.org CNAME to install1003

Reason:
https://gerrit.wikimedia.org/r/c/operations/puppet/ /587701

https://gerrit.wikimedia.org/r/587698

Looks like the puppet issue isn't yet fixed, no? (I thought it was, so commenting here just to make sure)

I pushed a change to netboot.cfg which does show on a puppet run on install1002 but not on install1003.

@Marostegui Please go to apt1001.wikimedia.org and you should see your change there. The "preseed" profile which includes /srv/autoinstall files is applied on the same hosts that have the APT repo and the webserver. install1003 is only TFTP/DHCP/proxy but not a webserver.

Yeah, it was applied there indeed. Thanks for the clarification!

Change 592662 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] apt_repo: stop including migration rsync class

https://gerrit.wikimedia.org/r/592662

Change 592662 merged by Dzahn:
[operations/puppet@production] apt_repo: stop including migration rsync class

https://gerrit.wikimedia.org/r/592662

Change 569683 merged by Dzahn:
[operations/dns@master] remove install1002/install2002

https://gerrit.wikimedia.org/r/569683

Dzahn updated the task description. (Show Details)

Change 598071 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Remove install[12]002 site.pp entries

https://gerrit.wikimedia.org/r/598071

Change 598071 merged by Alexandros Kosiaris:
[operations/puppet@production] Remove install[12]002 site.pp entries

https://gerrit.wikimedia.org/r/598071