Page MenuHomePhabricator
Paste P40847

phabricator migration plan - revised - v2
ActivePublic

Authored by Dzahn on Nov 23 2022, 9:06 PM.
Tags
None
Referenced Files
F35839133: phabricator migration plan - revised - v2
Dec 5 2022, 10:06 PM
F35824635: phabricator migration plan - revised - v2
Nov 29 2022, 9:58 PM
F35824542: phabricator migration plan - revised - v2
Nov 29 2022, 8:18 PM
F35824263: phabricator migration plan - revised - v2
Nov 29 2022, 4:29 PM
F35823333: phabricator migration plan - revised - v2
Nov 28 2022, 11:38 PM
F35823289: phabricator migration plan - revised - v2
Nov 28 2022, 9:57 PM
F35822647: phabricator migration plan - revised - v2
Nov 28 2022, 6:34 PM
F35816570: phabricator migration plan - revised - v2
Nov 23 2022, 10:00 PM
[x] announce migration window to: ops list, wikitech-l list, Slack
[x] schedule downtime via cookbook for phab1001 and all services on it, via cookbook:
[cumin2002:~] $ sudo cookbook sre.hosts.downtime -D 14 -r 'T322250' phab1001.eqiad.wmnet
[x] confirm downtime is active in Icinga web UI (https://icinga.wikimedia.org)
[x] disable puppet on phab1001: sudo disable-puppet 'T280597'
[x] stop Apache, PHP-FPM and phd on phab1001
[phab1001:~] sudo systemctl stop apache2
[phab1001:~] sudo systemctl stop php7.3-fpm
[phab1001:~] sudo systemctl stop phd
[x] confirm there are no more PHP processes running
[phab1001:~] sudo ps aux | grep php
[x] rsync /srv/repos diff by pulling on phab1004 from phab1001:
[phab1004:/] (as root) rsync -avp --bwlimit=2m --delete rsync://phab1001.eqiad.wmnet/srv-repos/ /srv/repos/
[x] check on phab1004 if any files under /srv/repos owned by UID 497 (vcs). if so, give them to user phd
[phab1004:/] find /srv/repos -uid 497
[phab1004:/] find /srv/repos -uid 497 -exec chown phd {} \;
- find proved far too slow on a fresh rsync of the repos data. We used chmod -R phd:phd instead, accepting that everything is phd:phd and not some mix of phd:phd and phd:www-data
[x] check on phab1004 if any files under /srv/repos owned by GID 498 (aphlict). if so, give them to group phd
[phab1004:/] find /srv/repos -gid 498
[phab1004:/] find /srv/repos -gid 498 -exec chgrp phd {} \;
- find proved far too slow on a fresh rsync of the repos data. We used chmod -R phd:phd instead, accepting that everything is phd:phd and not some mix of phd:phd and phd:www-data
[x] check on phab1004 if any files under /srv/repos are owned by a user that is NOT phd
[phab1004:/] find /srv/repos ! -user phd
[x] expect this to show the PHEX repo but nothing else. decide what to do with PHEX (root-owned)
- Decision here: Only some stuff under here was root-owned, that seems likely to have been an artifact of some manual operation on phab1001
[x] output the full tree of /srv/repos and compare number of directories / files between both servers
[phab1001:/] tree -upfg > /root/repos-tree (this file will be just under 500MB of text)
[phab1001:/] tail /root/repos-tree
[phab1004:/] tree -upfg > /root/repos-tree
[phab1004:/] tail /root/repos-tree
[] optional: if not satisfied yet: copy result file from old server to new server (scp -3 ...) and run an actual diff between them
[x] set mysql ports for master and slave, specifically for eqiad (currently this happens in codfw but not in common hiera)
merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/859145 run-puppet-agent, check what happens on phab1004
[x] merge re-revert of the phabricator server name in common Hiera, run puppet, watch the changes on phab1004 and phab2002
https://gerrit.wikimedia.org/r/c/operations/puppet/+/860031
[x] run a scap deploy to phab1004
(insert command, deployment server name)
[x] enable phd service on phab1004
merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/859628 and run-puppet-agent
[x] wait a couple minutes and check phd is still running (how long?)
(if killed by puppet for any reason, it'll be every puppet run...)
[x] merge re-revert of the DNS/SPF change
https://gerrit.wikimedia.org/r/c/operations/dns/+/860032 and run "authdns-update" on ns0.wikimedia.org, syncs to other DNS servers
[x] wait about a minute and optionally use "dig phabricator.discovery.wmnet @ns0.wikimedia.org" to see it change from alias for phab1001 to an alias for phab1004
[x] informational: dumps don't need to switch, they are already on phab1004, this has happened before
[x] informational: stats emails don't need to switch, they are already on phab1004, this has happened before
testing
[x] check https://phabricator.wikimedia.org works, watch out for yellow exclamation marks / warnings for admins
[x] test aphlict works by moving something on a workboard while someone else watches
[x] test if a ticket update shows up on IRC
[x] test if email from a ticket update arrives (by a user who has email notifications)
[x] check phabricator logs for exceptions (that aren't usual noise)
(insert command / pathes)
[x] test if CI works / "recheck" on a change in Gerrit
finalizing
[] merge patch to disable phd (and apache and php-fpm) on phab1001?
[x] verify proper monitoring downtime on phab1001
[x] reply to list emails and Slack that migration is done succesfully, link to ticket in case they see any issues
[x] publish fingerprints on wikitech page
after migration is done and grace period (how long?):
[x] double check which settings can move to common Hiera, remove setting from hosts files in Hiera
[] merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/824412 and check puppet run
[] remove phab1001 from mysql grants, coordinate with DBA on merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/858419
[x] create decom ticket for phab1001 - https://phabricator.wikimedia.org/T323418
[x] remove production puppet role from phab1001, merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/824804
[x] run decom cookbook from a cumin host on phab1001
[cumin2002:~] $ sudo cookbook sre.hosts.decommission phab1001.eqiad.wmnet -t T323418
[x] remove phab1001 from site.pp https://gerrit.wikimedia.org/r/c/operations/puppet/+/858421
[x] check all the SRE boxes on decom ticket, assign to dcops in eqiad, add dcops tag
[x] resolve https://phabricator.wikimedia.org/T280597
[x] set OKR to 100% in Betterworks, profit