In T190568 both phab1001 and phab2001 have been reimaged with buster.
Currently the temporary server phab1003 is the production Phabricator server and on stretch.
Set a maintenance window and switch over from phab1003 to phab1001.
After a little while shut down phab1003 and decom it / give it back to dcops.
https://etherpad.wikimedia.org/p/Phabricator-migration-20191203
https://phabricator.wikimedia.org/T238956
switch prod Phabricator from phab1003 to phab1001
2019-12-03
branches:
phab-buster (the actual switch)
https://gerrit.wikimedia.org/r/q/topic:%22phab-buster%22+(status:open%20OR%20status:merged)
phab1003-decom: (later)
https://gerrit.wikimedia.org/r/q/topic:%22phab1003-decom%22+(status:open%20OR%20status:merged)
phab-buster:
PREPARE:
site/phabricator: apply phab role on phab1001 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/536712 (MERGED)
phabricator: support buster with PHP 7.3 packages - https://gerrit.wikimedia.org/r/c/operations/puppet/+/541666 (MERGED)
phabricator::httpd: support stretch/buster with/without php-fpm - https://gerrit.wikimedia.org/r/c/operations/puppet/+/541930 (MERGED)
phabricator: install s-nail instead of heirloom-mailx on buster - https://gerrit.wikimedia.org/r/c/operations/puppet/+/541967 (MERGED)
phabricator: install s-nail instead of heirloom-mailx on any distro - https://gerrit.wikimedia.org/r/c/operations/puppet/+/542191 (MERGED)
log downtime in icinga - DONE - scheduled downtime for host and all services on phab1003 until in 2 days. MORE IS NEEDED, the PAGING service https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=phabricator.wikimedia.org , also downtimed
stop phd puppet agent --disable "https://phabricator.wikimedia.org/T238956" and "systemctl stop phd" on both phab1001 and phab1003
change the "phabricator_failover_server" in Hiera common.yaml to the new server to allow it to rsync repo data from active server https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/554390
rsync /srv/repos from phab1003 to phab1001 IN PROGRESS
SWITCH:
- rsync /srv/repos from phab1003 to phab1001 again for good measure, run it with --delete as well and ensure both sides have the same size
- verify code in /srv/phab is up to date and both servers are on the same git tag
- dumps/phabricator: switch dumps host from phab1003 to phab1001 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552593 (ON DUMPS SERVERS)
- put phab on phab1003 in maintenance mode
- phabricator: switch "phabricator_server" from phab1003 to phab1001 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/554397 (ON PHAB HOST), influences rsync ferm rules
- phabricator: switch "active server" from phab1003 to phab1001 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552591 (on PHAB HOST), influences icinga monitoring, dumps enabled, aphlict unless otherwise disabled, rsyncd setup
- switch discovery record for phabricator to 1001 for ATS - https://gerrit.wikimedia.org/r/c/operations/dns/+/552598 (DNS, ATS, the switch of https backend)
- varnish: switch phabricator backend to phab1001 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552595 (varnish, probably not used, now ATS)
- phabricator: switch mail destination to phab1001 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552597 (ON MAIL SERVER)
- depool phab1003-vcs in confctl
- phabricator/conftool: switch phab-vcs (git-ssh) service to phab1001 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552589 (in CONFTOOL, git-ssh requires conftool change, which changes pybal config, may cause Icinga alerts)
- Phabricator: install python3-pygments instead of python-pygments - ] https://gerrit.wikimedia.org/r/#/c/554401/
- remove Icinga downtime for "phabricator.wikimedia.org" meta service, keep phab1003 downtimes extended where needed
phab1003-decom:
DECOM:
phabricator: remove phab1003 from list of phab servers - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552592
remove service IPs and IPv6 for phab1003 - https://gerrit.wikimedia.org/r/c/operations/dns/+/552599
remove production IPs for phab1003 - https://gerrit.wikimedia.org/r/c/operations/dns/+/552601
mariadb: remove grants for users on phab1003 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552607
site: turn phab1003 into a spare::system - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552603
mtail: stop using phab1003 for tests, use phab1001 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552604 <--- part of SWITCH ?? | No, but replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/554403
install_server: remove phab1003 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/552609
ABANDON?!
phabricator-new: https://gerrit.wikimedia.org/r/c/operations/puppet/+/551286 (abandoned)
phabricator-new: https://gerrit.wikimedia.org/r/c/operations/dns/+/551284 (abandoned)
DAY 2 (We have to switch back again to phab1003, change a BIOS setting on phab1001, reimage phab1001 and finally switch over again :/)
SIMPLIFY SETUP FOR MIGRATION (rsync, hiera, puppet)
https://gerrit.wikimedia.org/r/c/operations/puppet/+/554610 (merged)
https://gerrit.wikimedia.org/r/c/operations/puppet/+/554628 (merged)
https://gerrit.wikimedia.org/r/c/operations/puppet/+/554643 (merged)
SWITCH BACK to phab1003:
- schedule icinga downtimes
- switch "phabricator_server" for firewall rules for http/smtp and enable dumps https://gerrit.wikimedia.org/r/c/operations/puppet/+/554644
- revert: discovery record (DNS, ATS) https://gerrit.wikimedia.org/r/c/operations/dns/+/554589
- revert: dumps sync source https://gerrit.wikimedia.org/r/c/operations/puppet/+/554592
- revert: varnish https://gerrit.wikimedia.org/r/c/operations/puppet/+/554590
- revert: mail routing https://gerrit.wikimedia.org/r/c/operations/puppet/+/554591
- change BIOS settings for ATA mode to AHCI on phab1001
- reimage phab1001
- rsync /srv/repos. pull from phab1003 on phab1001 with --delete
- reboot phab1001 to clear "microcode vulns not fixed" Icinga alert
- rsync again
- verify code in /srv/phab is up to date and both servers are on the same git tag
FINAL SWITCH BACK TO PHAB1001
- schedule downtimes in Icinga
- rsync srv/repos one more time
- revert: switch "phabricator_server" https://gerrit.wikimedia.org/r/c/operations/puppet/+/554660
- revert:revert: discovery record https://gerrit.wikimedia.org/r/c/operations/dns/+/554661
- revert:revert: dumps sync source https://gerrit.wikimedia.org/r/c/operations/puppet/+/554657
- revert:revert: varnish https://gerrit.wikimedia.org/r/c/operations/puppet/+/554659
- revert:revert: mail routing https://gerrit.wikimedia.org/r/c/operations/puppet/+/554658
- restart ssh-phab service to make it listten on IPv6 (it wasn't because puppet starts the service before it adds the v6 IP on the interface) that cleared Icinga alerts
- delete stale confd files on puppetmaster to clear more Icinga alerts about confd template compilation failing (because reimage script crashed so it did not get to delete them)
- check status of git-ssh (that wasn't switched twice) https://gerrit.wikimedia.org/r/c/operations/puppet/+/554957 (!) and removed the lo:LVS IPs (v4 and v6!) from interface on phab1003, restart ssh-phab
- make phd run on correct server https://gerrit.wikimedia.org/r/c/operations/puppet/+/554960 to avoid breakage of repos
- check Icinga for any alerts and remove downtimes (don't forget the "phabricator" meta virtual host)