Page MenuHomePhabricator

Upgrade install servers to Buster
Open, MediumPublic

Description

These act as TFTP/DHCP/boot servers, but also host apt.wikimedia.org

  • assign new IP addresses for replacement servers (private IPs, unlike before!)
  • create Ganeti VMs
  • install OS (buster) on new VMs
  • create new puppet role for installserver minus APT repo (role::installserver::light)
  • create new puppet role for APT repo minus DHCP/TFTP (role::apt_repo)
  • apply both puppet roles and check/fix all errors (add buster support)
  • rsync data APT repo data (done from install1002 to apt1001 and apt2001, added to cron jobs)
  • switch DNS CNAME for webproxy / switch squids to buster
  • switch DHCP_relay config on routers
  • switch TFTP server IPs in DHCP config
  • switch apt.wikimedia.org to the new backend apt1001 in DNS after making some tests
  • install1002.wikimedia.org replaced with install1003.eqiad.wmnet in all config files
  • install2002.wikimedia.org replaced with install2003.codfw.wmnet in all config files
  • remove old servers from DNS and decom fully with cookbook

Details

Related Gerrit Patches:
operations/puppet : productioninstallserver::migration: fix rsync dir name
operations/puppet : productioninstallserver: rsync home dir data for migration
operations/puppet : productionAdd s-nail to send mails after package imports
operations/puppet : productionaptrepo: remove reprepro package on inactive servers
operations/dns : masterswitch apt.wikimedia.org from install1002 to apt1001
operations/puppet : productionaptrepo: add MOTD warning to not use old install servers for APT
operations/puppet : productionhiera/apt.wikimedia.org: switch from install1002 to apt1001
operations/puppet : productionadjust my .bash_profile to set base dir for reprepro on apt*
operations/puppet : productiontftp,bastionhost: Add missing ensure_service hiera key
operations/puppet : productionDHCP: remove old install servers and use new servers as next-server
operations/puppet : productiontftp: if service is told to be stopped, do not monitor it
operations/puppet : productioninstallserver/tftp: add missing ensure_service parameter
operations/puppet : productioninstallserver: stop tftp service on old install servers
operations/puppet : productioninstall_server: switch TFTP servers in DHCP to new install servers
operations/puppet : productionstop DHCP service on jessie install servers
operations/puppet : productionstart DHCP service on new installserver::light servers
operations/homer/public : masterreplace DCHP relays with new installservers
operations/puppet : productioninstallserver: disable squid monitoring if service is absent
operations/puppet : productioninstallserver: allow configuring squid as absent in Hiera
operations/puppet : productionsquid: remove obsolete hierarchy_stoplist config directive
operations/dns : masterswitch webproxy CNAMEs to new install servers
operations/puppet : productionhiera/aptrepo: rename install_server variables to aptrepo_server
operations/puppet : productionswitch install_server and failover in hieradata/common.yaml
operations/puppet : productionDHCP: switch install1003/2003 to public IP, new VMs, update MACs
operations/puppet : productionsite: switch install1003/2003 to public IPs
operations/dns : masterre-add install1002/install2003 with public IPs
operations/dns : masterremove install1003/2003 again to recreate with public IPs
operations/puppet : productioninstallserver: ensure interface for DHCP server is configured
operations/puppet : productionsite: add installserver::light role on new install servers
operations/puppet : productioninstallserver: allow stopping DHCP server in Hiera, apply new role
operations/puppet : productioninstallserver/apt: allow setting gpg_user different from reprepro user
operations/puppet : productioninstallserver/apt_repo: add homedir parameter, move dirs to Hiera
operations/puppet : productionaptrepo: puppetize gpg sec and pub keys for apt.wm.org
operations/puppet : productionaptrepo: puppetize REPREPRO_BASE_DIR env variable
operations/puppet : productioninstall_server: allow rsyncing from active to replacement servers
operations/dns : masterfix IP address for apt2001.wikimedia.org
operations/puppet : productionadd apt1001 to install_servers failover for rsyncing APT repo data
operations/puppet : productionapt: remove (duplicate) OCSP stapling config and RSA cert
operations/puppet : productionaptrepo: add support for buster, install python-apt package
operations/puppet : productionsite: add apt[12]001.wikimedia.org with role::apt_repo
operations/puppet : productionDHCP: add apt1002 and apt2001
operations/puppet : productionsite: add apt1001/apt2001 with spare role
operations/puppet : productionaptrepo/install: move https monitoring to aptrepo profile
operations/puppet : productioninstallserver: allow for multiple failover servers at once
operations/puppet : productionwmflib: replace install1002 with bast1002 in ipresolve_spec
operations/puppet : productionintroduce new role to install nginx and APT repo without DHCP/TFTP
operations/puppet : productioninstallserver: create new role without HTTP/APT, rename existing role
operations/dns : masterswitch apt.wikimedia.org from install1002 to install1003
operations/puppet : productionsite: add new install servers with private IP, spare role
operations/puppet : productioninstall: remove next-server for new install servers for OS install
operations/puppet : productionDHCP: add install1003/install2003, using current install servers
operations/dns : masteradd private IPs for new install servers
operations/dns : masteradd IP addresses for new install servers on buster
operations/puppet : productioninstall_server: add install[12]003 to partman recipe regex
operations/dns : masterremove install1002/install2002
operations/puppet : productionCreate separate role for repository servers

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2020-02-26T13:45:59Z] <XioNoX> ganeti2001:~$ sudo gnt-instance shutdown apt2001.wikimedia.org - T224576

Dzahn added a comment.Feb 26 2020, 3:10 PM

Thanks @ayounsi For unknown reasons i could never get console on this server so I did not know if it was actually installing an OS. (The exact same thing in eqiad works fine). It does have a puppet role applied though.

Change 575066 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add apt1001 to install_servers failover for rsyncing APT repo data

https://gerrit.wikimedia.org/r/575066

Change 575066 merged by Dzahn:
[operations/puppet@production] add apt1001 to install_servers failover for rsyncing APT repo data

https://gerrit.wikimedia.org/r/575066

Change 575318 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] fix IP address for apt2001.wikimedia.org

https://gerrit.wikimedia.org/r/575318

Change 575318 merged by Dzahn:
[operations/dns@master] fix IP address for apt2001.wikimedia.org

https://gerrit.wikimedia.org/r/575318

Change 569691 abandoned by Dzahn:
install_server: allow rsyncing from active to replacement servers

Reason:
superseded by https://gerrit.wikimedia.org/r/c/operations/puppet/ /574088

https://gerrit.wikimedia.org/r/569691

Change 572394 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add installserver::light role on new install servers

https://gerrit.wikimedia.org/r/572394

Change 575404 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] switch apt.wikimedia.org from install1002 to apt1001

https://gerrit.wikimedia.org/r/575404

Dzahn updated the task description. (Show Details)Feb 28 2020, 10:00 PM

Change 575638 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aptrepo: puppetize REPREPRO_BASE_DIR env variable

https://gerrit.wikimedia.org/r/575638

Change 575638 merged by Dzahn:
[operations/puppet@production] aptrepo: puppetize REPREPRO_BASE_DIR env variable

https://gerrit.wikimedia.org/r/575638

Change 576143 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aptrepo: puppetize gpg sec and pub keys for apt.wm.org

https://gerrit.wikimedia.org/r/576143

Change 576143 merged by Dzahn:
[operations/puppet@production] aptrepo: puppetize gpg sec and pub keys for apt.wm.org

https://gerrit.wikimedia.org/r/576143

Change 576160 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver/apt_repo: add homedir parameter, move dirs to Hiera

https://gerrit.wikimedia.org/r/576160

Change 576160 merged by Dzahn:
[operations/puppet@production] installserver/apt_repo: add homedir parameter, move dirs to Hiera

https://gerrit.wikimedia.org/r/576160

Change 576164 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver/apt: allow setting gpg_user different from reprepro user

https://gerrit.wikimedia.org/r/576164

Change 576164 merged by Dzahn:
[operations/puppet@production] installserver/apt: allow setting gpg_user different from reprepro user

https://gerrit.wikimedia.org/r/576164

Change 576460 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: allow stopping the DHCP server via Hiera

https://gerrit.wikimedia.org/r/576460

Change 576460 merged by Dzahn:
[operations/puppet@production] installserver: allow stopping DHCP server in Hiera, apply new role

https://gerrit.wikimedia.org/r/576460

Change 576479 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: add parameter for DHCP interface

https://gerrit.wikimedia.org/r/576479

Change 572394 abandoned by Dzahn:
site: add installserver::light role on new install servers

Reason:
rebased to nothing as it was done in https://gerrit.wikimedia.org/r/c/operations/puppet/ /576460

https://gerrit.wikimedia.org/r/572394

Change 576479 merged by Dzahn:
[operations/puppet@production] installserver: ensure interface for DHCP server is configured

https://gerrit.wikimedia.org/r/576479

Change 577640 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove install1003/2003 again to recreate with public IPs

https://gerrit.wikimedia.org/r/577640

Change 577641 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: switch install1003/2003 to public IPs

https://gerrit.wikimedia.org/r/577641

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: install1003.eqiad.wmnet

  • install1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: install2003.codfw.wmnet

  • install2003.codfw.wmnet (FAIL)
    • Host steps raised exception: Empty Management Password

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: install2003.codfw.wmnet

  • install2003.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed

Change 577640 merged by Dzahn:
[operations/dns@master] remove install1003/2003 again to recreate with public IPs

https://gerrit.wikimedia.org/r/577640

Change 577649 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] re-add install1002/install2003 with public IPs

https://gerrit.wikimedia.org/r/577649

Change 577649 merged by Dzahn:
[operations/dns@master] re-add install1002/install2003 with public IPs

https://gerrit.wikimedia.org/r/577649

Dzahn added a comment.Mar 6 2020, 7:55 PM

install1003.eqiad.wmnet and install2003.codfw.wmnet have been removed entirely with the decom cookbook.

DNS has been adjusted and then VMs with identical specs but public IPs have been created with the makevm cookbook.

install1003.wikimedia.org and install2003.wikimedia.org

Change 577663 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: switch install1003/2003 to public IP, new VMs, update MACs

https://gerrit.wikimedia.org/r/577663

Change 577641 merged by Dzahn:
[operations/puppet@production] site: switch install1003/2003 to public IPs

https://gerrit.wikimedia.org/r/577641

Change 577663 merged by Dzahn:
[operations/puppet@production] DHCP: switch install1003/2003 to public IP, new VMs, update MACs

https://gerrit.wikimedia.org/r/577663

Dzahn added a comment.Mar 7 2020, 12:02 AM

@Muehlenhoff @ayounsi

Tested squid on re-installed install1003.wikimedia.org:

[planet1001:~] $ http_proxy="http://install1003.wikimedia.org:8080; export HTTPS_PROXY=http://install1003.wikimedia.org:8080" curl foobar.org

and works for me.

Also see that in squid/access.log and the prometheus squid exporter has been added.

Change 569687 abandoned by Dzahn:
switch install_server and failover in hieradata/common.yaml

Reason:
wrong, this is only used by aptrepo which is now a separate role, will rename this variable

https://gerrit.wikimedia.org/r/569687

Change 577701 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] hiera/aptrepo: rename install_server variables to aptrepo_server

https://gerrit.wikimedia.org/r/577701

Change 577701 merged by Dzahn:
[operations/puppet@production] hiera/aptrepo: rename install_server variables to aptrepo_server

https://gerrit.wikimedia.org/r/577701

Change 569680 merged by Dzahn:
[operations/dns@master] switch webproxy CNAMEs to new install servers

https://gerrit.wikimedia.org/r/569680

Change 579106 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: allow configuring squid as running/stopped in Hiera

https://gerrit.wikimedia.org/r/579106

Change 579373 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] squid: remove obsolete hierarchy_stoplist config directive

https://gerrit.wikimedia.org/r/579373

Change 579373 merged by Dzahn:
[operations/puppet@production] squid: remove obsolete hierarchy_stoplist config directive

https://gerrit.wikimedia.org/r/579373

Change 579106 merged by Dzahn:
[operations/puppet@production] installserver: allow configuring squid as absent in Hiera

https://gerrit.wikimedia.org/r/579106

Mentioned in SAL (#wikimedia-operations) [2020-03-13T17:21:12Z] <mutante> removed squid from install1002/install2002 (formerly webproxy.(eqiad|codfw).wmnet until 2 days ago, replaced by install1003/install2003) T224576

Change 579610 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: disable squid monitoring if service is absent

https://gerrit.wikimedia.org/r/579610

Change 579610 merged by Dzahn:
[operations/puppet@production] installserver: disable squid monitoring if service is absent

https://gerrit.wikimedia.org/r/579610

Dzahn updated the task description. (Show Details)Fri, Mar 13, 10:59 PM
hashar added a subscriber: hashar.Mon, Mar 16, 2:42 PM

When building a docker container on contint1001.wikimedia.org with docker-pkg, pip gets proxy timeout error when using http://webproxy.eqiad.wmnet:8080.

I have manually switched to the codfw one (http://webproxy.codfw.wmnet:8080) and it worked fine.

So I guess install1003.wikimedia.org has an issue of some sort?

I opened T247759 to track this issue.

Change 584963 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/homer/public@master] replace DCHP relays with new installservers

https://gerrit.wikimedia.org/r/584963

Dzahn updated the task description. (Show Details)Tue, Mar 31, 4:03 PM
hashar removed a subscriber: hashar.Wed, Apr 1, 7:19 AM

Change 584963 merged by jenkins-bot:
[operations/homer/public@master] replace DCHP relays with new installservers

https://gerrit.wikimedia.org/r/584963

Mentioned in SAL (#wikimedia-operations) [2020-04-01T09:34:33Z] <mutante> install_servers: DHCP_relay in routers and TFTP server in DHCP server config have been switched from install1002/2002 to install1003/2003 - doing a test install, but if any issues report on T224576

Mentioned in SAL (#wikimedia-operations) [2020-04-01T09:35:02Z] <XioNoX> Update install servers IPs (dhcp helpers + firewall rules) - T224576

Change 585194 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] start DHCP service on new installserver::light servers

https://gerrit.wikimedia.org/r/585194

Change 585194 merged by Dzahn:
[operations/puppet@production] start DHCP service on new installserver::light servers

https://gerrit.wikimedia.org/r/585194

Change 585199 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] stop DHCP service on jessie install servers

https://gerrit.wikimedia.org/r/585199

Change 585199 merged by Dzahn:
[operations/puppet@production] stop DHCP service on jessie install servers

https://gerrit.wikimedia.org/r/585199

Dzahn updated the task description. (Show Details)Wed, Apr 1, 1:09 PM

Change 569684 merged by Dzahn:
[operations/puppet@production] install_server: switch TFTP servers in DHCP to new install servers

https://gerrit.wikimedia.org/r/569684

Change 585221 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: allow stopping tftp service with parameter

https://gerrit.wikimedia.org/r/585221

Change 585221 merged by Dzahn:
[operations/puppet@production] installserver: stop tftp service on old install servers

https://gerrit.wikimedia.org/r/585221

Change 585233 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver/tftp: add missing ensure_service parameter

https://gerrit.wikimedia.org/r/585233

Change 585233 merged by Dzahn:
[operations/puppet@production] installserver/tftp: add missing ensure_service parameter

https://gerrit.wikimedia.org/r/585233

Change 585238 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver/tftp: if service is ensured stop, do not monitor it

https://gerrit.wikimedia.org/r/585238

Change 585238 merged by Dzahn:
[operations/puppet@production] tftp: if service is told to be stopped, do not monitor it

https://gerrit.wikimedia.org/r/585238

Change 569686 merged by Dzahn:
[operations/puppet@production] DHCP: remove old install servers and use new servers as next-server

https://gerrit.wikimedia.org/r/569686

Change 585245 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] hiera/apt.wikimedia.org: switch from install1002 to apt1001

https://gerrit.wikimedia.org/r/585245

Change 585370 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] tftp,bastionhost: Add missing ensure_service hiera key

https://gerrit.wikimedia.org/r/585370

Change 585370 merged by Vgutierrez:
[operations/puppet@production] tftp,bastionhost: Add missing ensure_service hiera key

https://gerrit.wikimedia.org/r/585370

Change 585485 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] adjust my .bash_profile to set base dir for reprepro on apt*

https://gerrit.wikimedia.org/r/585485

Change 585485 merged by Dzahn:
[operations/puppet@production] adjust my .bash_profile to set base dir for reprepro on apt*

https://gerrit.wikimedia.org/r/585485

Change 585528 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Add s-nail to send mails after package imports

https://gerrit.wikimedia.org/r/585528

Change 585528 merged by Muehlenhoff:
[operations/puppet@production] Add s-nail to send mails after package imports

https://gerrit.wikimedia.org/r/585528

I successfully tested import and removals of a dummy build (hello) and fixed up the sending of status mails.

I've also synched the public repo keys from install1002 to apt1001, "reprepro checkup" against external repos now also works fine and I could successfully update the external Tor repo for stretch.

ayounsi removed a subscriber: ayounsi.Fri, Apr 3, 7:06 AM

Change 585245 merged by Dzahn:
[operations/puppet@production] hiera/apt.wikimedia.org: switch from install1002 to apt1001

https://gerrit.wikimedia.org/r/585245

Change 585722 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aptrepo: add MOTD warning to not use old install servers for APT

https://gerrit.wikimedia.org/r/585722

Change 585723 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aptrepo: remove reprepro package on inactive servers

https://gerrit.wikimedia.org/r/585723

Change 585722 merged by Dzahn:
[operations/puppet@production] aptrepo: add MOTD warning to not use old install servers for APT

https://gerrit.wikimedia.org/r/585722

Change 575404 merged by Dzahn:
[operations/dns@master] switch apt.wikimedia.org from install1002 to apt1001

https://gerrit.wikimedia.org/r/575404

Change 585723 abandoned by Dzahn:
aptrepo: remove reprepro package on inactive servers

https://gerrit.wikimedia.org/r/585723

Dzahn updated the task description. (Show Details)Fri, Apr 3, 12:08 PM

Change 585765 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: rsync home dir data for migration

https://gerrit.wikimedia.org/r/585765

Change 585765 merged by Dzahn:
[operations/puppet@production] installserver: rsync home dir data for migration

https://gerrit.wikimedia.org/r/585765

Change 585789 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver::migration: fix rsync dir name

https://gerrit.wikimedia.org/r/585789

Change 585789 merged by Dzahn:
[operations/puppet@production] installserver::migration: fix rsync dir name

https://gerrit.wikimedia.org/r/585789