Page MenuHomePhabricator

move phabricator to new hardware generation
Closed, ResolvedPublic

Description

Our Phabricator servers are getting old and will be routinely replaced.

T279176 is hardware refresh for phab1001

T280540 is to rack phab1002 (eqiad)

T279177 is hardware refresh for phab2001

T280544 is to rack phab2002 (codfw)

T323418 is to decom phab1001 (eqiad)

New hosts have been added to "insetup" role in https://gerrit.wikimedia.org/r/c/operations/puppet/+/681197 and will get DHCP and OS install
in the tasks linked above.

Then they will be handed over to serviceops to move into production.


migration plan to switch from phab1001 to phab1004 in production:

  • announce downtime window of 30 min (November 21, 2022, 2pm PST, 10pm GMT, 2200 UTC)
  • stop Phabricator and everything on phab1001 (kill PHP processes, make sure it's dead)
  • merge puppet change in hieradata/common.yaml (https://gerrit.wikimedia.org/r/c/operations/puppet/+/858397)
  • run puppet on phab1004, see it change database config to use m3-master and port 3306 (instead of m3-slave and 3323) in the file /etc/phabricator/config.yaml
  • test phab web UI on phab1004 somehow before DNS is switched (ssh tunnel, foxy proxy)
  • if it looks good, merge DNS change for both discovery and SPF record (https://gerrit.wikimedia.org/r/c/operations/dns/+/858409), otherwise revert

topic branch: https://gerrit.wikimedia.org/r/q/topic:phab_migration

1[x] announce migration window to: ops list, wikitech-l list, Slack
2[x] schedule downtime via cookbook for phab1001 and all services on it, via cookbook:
3 [cumin2002:~] $ sudo cookbook sre.hosts.downtime -D 14 -r 'T322250' phab1001.eqiad.wmnet
4[x] confirm downtime is active in Icinga web UI (https://icinga.wikimedia.org)
5[x] disable puppet on phab1001: sudo disable-puppet 'T280597'
6[x] stop Apache, PHP-FPM and phd on phab1001
7 [phab1001:~] sudo systemctl stop apache2
8 [phab1001:~] sudo systemctl stop php7.3-fpm
9 [phab1001:~] sudo systemctl stop phd
10[x] confirm there are no more PHP processes running
11 [phab1001:~] sudo ps aux | grep php
12[x] rsync /srv/repos diff by pulling on phab1004 from phab1001:
13 [phab1004:/] (as root) rsync -avp --bwlimit=2m --delete rsync://phab1001.eqiad.wmnet/srv-repos/ /srv/repos/
14[x] check on phab1004 if any files under /srv/repos owned by UID 497 (vcs). if so, give them to user phd
15 [phab1004:/] find /srv/repos -uid 497
16 [phab1004:/] find /srv/repos -uid 497 -exec chown phd {} \;
17 - find proved far too slow on a fresh rsync of the repos data. We used chmod -R phd:phd instead, accepting that everything is phd:phd and not some mix of phd:phd and phd:www-data
18[x] check on phab1004 if any files under /srv/repos owned by GID 498 (aphlict). if so, give them to group phd
19 [phab1004:/] find /srv/repos -gid 498
20 [phab1004:/] find /srv/repos -gid 498 -exec chgrp phd {} \;
21 - find proved far too slow on a fresh rsync of the repos data. We used chmod -R phd:phd instead, accepting that everything is phd:phd and not some mix of phd:phd and phd:www-data
22[x] check on phab1004 if any files under /srv/repos are owned by a user that is NOT phd
23 [phab1004:/] find /srv/repos ! -user phd
24[x] expect this to show the PHEX repo but nothing else. decide what to do with PHEX (root-owned)
25 - Decision here: Only some stuff under here was root-owned, that seems likely to have been an artifact of some manual operation on phab1001
26[x] output the full tree of /srv/repos and compare number of directories / files between both servers
27 [phab1001:/] tree -upfg > /root/repos-tree (this file will be just under 500MB of text)
28 [phab1001:/] tail /root/repos-tree
29 [phab1004:/] tree -upfg > /root/repos-tree
30 [phab1004:/] tail /root/repos-tree
31[] optional: if not satisfied yet: copy result file from old server to new server (scp -3 ...) and run an actual diff between them
32[x] set mysql ports for master and slave, specifically for eqiad (currently this happens in codfw but not in common hiera)
33 merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/859145 run-puppet-agent, check what happens on phab1004
34[x] merge re-revert of the phabricator server name in common Hiera, run puppet, watch the changes on phab1004 and phab2002
35 https://gerrit.wikimedia.org/r/c/operations/puppet/+/860031
36[x] run a scap deploy to phab1004
37 (insert command, deployment server name)
38[x] enable phd service on phab1004
39 merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/859628 and run-puppet-agent
40[x] wait a couple minutes and check phd is still running (how long?)
41 (if killed by puppet for any reason, it'll be every puppet run...)
42[x] merge re-revert of the DNS/SPF change
43 https://gerrit.wikimedia.org/r/c/operations/dns/+/860032 and run "authdns-update" on ns0.wikimedia.org, syncs to other DNS servers
44[x] wait about a minute and optionally use "dig phabricator.discovery.wmnet @ns0.wikimedia.org" to see it change from alias for phab1001 to an alias for phab1004
45[x] informational: dumps don't need to switch, they are already on phab1004, this has happened before
46[x] informational: stats emails don't need to switch, they are already on phab1004, this has happened before
47
48testing
49
50[x] check https://phabricator.wikimedia.org works, watch out for yellow exclamation marks / warnings for admins
51[x] test aphlict works by moving something on a workboard while someone else watches
52[x] test if a ticket update shows up on IRC
53[x] test if email from a ticket update arrives (by a user who has email notifications)
54[x] check phabricator logs for exceptions (that aren't usual noise)
55 (insert command / pathes)
56[x] test if CI works / "recheck" on a change in Gerrit
57
58finalizing
59
60[] merge patch to disable phd (and apache and php-fpm) on phab1001?
61[x] verify proper monitoring downtime on phab1001
62[x] reply to list emails and Slack that migration is done succesfully, link to ticket in case they see any issues
63[x] publish fingerprints on wikitech page
64after migration is done and grace period (how long?):
65
66[x] double check which settings can move to common Hiera, remove setting from hosts files in Hiera
67[] merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/824412 and check puppet run
68[] remove phab1001 from mysql grants, coordinate with DBA on merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/858419
69[x] create decom ticket for phab1001 - https://phabricator.wikimedia.org/T323418
70[x] remove production puppet role from phab1001, merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/824804
71[x] run decom cookbook from a cumin host on phab1001
72 [cumin2002:~] $ sudo cookbook sre.hosts.decommission phab1001.eqiad.wmnet -t T323418
73[x] remove phab1001 from site.pp https://gerrit.wikimedia.org/r/c/operations/puppet/+/858421
74[x] check all the SRE boxes on decom ticket, assign to dcops in eqiad, add dcops tag
75[x] resolve https://phabricator.wikimedia.org/T280597
76[x] set OKR to 100% in Betterworks, profit

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+6 -26
operations/puppetproduction+4 -4
operations/puppetproduction+11 -10
operations/puppetproduction+1 -39
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+8 -8
operations/puppetproduction+1 -1
operations/puppetproduction+3 -0
operations/puppetproduction+1 -1
operations/dnsmaster+2 -2
operations/puppetproduction+2 -0
operations/puppetproduction+0 -1
operations/puppetproduction+14 -14
operations/puppetproduction+1 -1
operations/puppetproduction+3 -2
operations/puppetproduction+7 -2
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/dnsmaster+1 -1
phabricator/deploymentwmf/stable+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+4 -1
operations/dnsmaster+0 -1
phabricator/deploymentwmf/stable+1 -0
operations/puppetproduction+3 -1
operations/puppetproduction+10 -9
operations/puppetproduction+0 -1
operations/puppetproduction+4 -4
operations/puppetproduction+5 -6
operations/puppetproduction+1 -0
operations/puppetproduction+20 -19
operations/puppetproduction+2 -2
operations/puppetproduction+1 -2
operations/puppetproduction+1 -1
operations/puppetproduction+8 -5
operations/puppetproduction+8 -2
operations/puppetproduction+32 -20
operations/puppetproduction+5 -5
operations/puppetproduction+4 -0
operations/puppetproduction+62 -6
operations/puppetproduction+5 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 852965 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[phabricator/deployment@wmf/stable] scap targets: add phab1004.eqiad.wmnet

https://gerrit.wikimedia.org/r/852965

Change 852965 merged by Brennen Bearnes:

[phabricator/deployment@wmf/stable] scap targets: add phab1004.eqiad.wmnet

https://gerrit.wikimedia.org/r/852965

Change 853010 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] remove phab1001-aphlict.eqiad.wmnet

https://gerrit.wikimedia.org/r/853010

Change 853010 merged by Dzahn:

[operations/dns@master] remove phab1001-aphlict.eqiad.wmnet

https://gerrit.wikimedia.org/r/853010

Mentioned in SAL (#wikimedia-operations) [2022-11-07T21:26:40Z] <mutante> DNS - removing phab1001-aphlict.eqiad.wmnet - should have no effect because we use aphlict.discovery.wmnet - but if it does, then it's Phabricator realtime notifications - T280597

mysql privileges have now been granted / fixed. T315713#8388257

We have what we need.. EXCEPT... and this explains a lot why we could never connect to a database from codfw even _regardless_ of the grants:

If you connect to the readonly-DB, m3-slave, you need to use a different port, -P 3323 instead of the default 3306, which you have to use when connecting to m3-master.

We have always just changed the host name and our puppet code knows parameters, db_host, db_user, db_name, db_pass... but we never had an option or perceived need to change the port.

So.. we have to add that now and create a db_port parameter and set it based on "active_server" / "phabricator_server" which already changes the db_host name.

Change 856013 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: add parameter for mysql port, set it to 3323 if using slave

https://gerrit.wikimedia.org/r/856013

Change 856013 merged by Dzahn:

[operations/puppet@production] phabricator: add parameter for mysql port, set it to 3323 if using slave

https://gerrit.wikimedia.org/r/856013

after https://gerrit.wikimedia.org/r/c/operations/puppet/+/856013 set the mysql_port to 3323 we still can NOT start the phd service on say phab2002.

Manual debugging with:

sudo -u phd /srv/phab/phabricator/bin/phd start

shows it is still the DB connection:

[2022-11-14 21:39:45] PHLOG: 'Retrying database connection to "m3-slave.codfw.wmnet" after connection failure (attempt 2; "AphrontConnectionQueryException"; error #2002): Attempt to connect to phabricatorphd@m3-slave.codfw.wmnet failed with error #2002: Connection timed out' at [/srv/deployment/phabricator/deployment-cache/revs/3137c9217337d2b6fd1312f328f7e366abbf0d75/phabricator/src/infrastructure/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:136]

phabricatorphd@m3-slave.codfw.wmnet failed with error #2002: Connection timed out

Change 857079 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: enable dumping on phab1004

https://gerrit.wikimedia.org/r/857079

Change 857081 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: use systemd::sysuser for phd user, also on phab1004

https://gerrit.wikimedia.org/r/857081

Change 857081 merged by Dzahn:

[operations/puppet@production] phabricator: use systemd::sysuser for phd user, also on phab1004

https://gerrit.wikimedia.org/r/857081

Change 857734 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[phabricator/deployment@wmf/stable] local settings: add mysql.port

https://gerrit.wikimedia.org/r/857734

Change 857736 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: pass missing mysql.port paramater to local settings

https://gerrit.wikimedia.org/r/857736

Change 857736 merged by Dzahn:

[operations/puppet@production] phabricator: pass missing mysql.port paramater to local settings

https://gerrit.wikimedia.org/r/857736

Change 857734 merged by Brennen Bearnes:

[phabricator/deployment@wmf/stable] local settings: add mysql.port

https://gerrit.wikimedia.org/r/857734

Change 858397 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] hieradata: switch active Phabricator server to phab1004

https://gerrit.wikimedia.org/r/858397

We have also been hoping that we get a readonly Phabricator out of this in the inactive DC but this is now sourced out to T323312.

Change 858409 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] phabricator: switch from phab1001 to phab1004, discovery and SPF

https://gerrit.wikimedia.org/r/858409

Change 858412 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] update SPF record for phabricator.wikimedia.org, phab2001->phab2002

https://gerrit.wikimedia.org/r/858412

Change 858412 merged by Dzahn:

[operations/dns@master] update SPF record for phabricator.wikimedia.org, phab2001->phab2002

https://gerrit.wikimedia.org/r/858412

Change 824412 had a related patch set uploaded (by Dzahn; author: jbond):

[operations/puppet@production] O:phabricator: move common settings to role hiera

https://gerrit.wikimedia.org/r/824412

Change 858432 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: switch logmail mails from phab1001 to phab1004

https://gerrit.wikimedia.org/r/858432

Change 858432 merged by Dzahn:

[operations/puppet@production] phabricator: switch logmail mails from phab1001 to phab1004

https://gerrit.wikimedia.org/r/858432

Change 857079 merged by Dzahn:

[operations/puppet@production] phabricator: enable dumping on phab1004

https://gerrit.wikimedia.org/r/857079

Change 858656 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: remove hardcoded ports, use parameters in my.cnf for admins

https://gerrit.wikimedia.org/r/858656

Change 852259 merged by Dzahn:

[operations/puppet@production] dumps/distribution: move hardcoded host names to parameters

https://gerrit.wikimedia.org/r/852259

Change 858663 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: stop creating public dump on phab1001

https://gerrit.wikimedia.org/r/858663

Change 824805 merged by Dzahn:

[operations/puppet@production] dumps/phabricator: switch phab dumps host from phab1001 to phab1004

https://gerrit.wikimedia.org/r/824805

Mentioned in SAL (#wikimedia-operations) [2022-11-18T23:13:38Z] <mutante> clouddumps1001 - manually ran /usr/local/bin/dump-fetch-phabdumps.sh and confirmed fetching works from new phab host phab1004 after gerrit:824805 T280597

Change 858663 merged by Dzahn:

[operations/puppet@production] phabricator: stop creating public dump on phab1001

https://gerrit.wikimedia.org/r/858663

Change 858662 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] dumps: remove phab1001 from rsync clients

https://gerrit.wikimedia.org/r/858662

Change 858656 merged by Dzahn:

[operations/puppet@production] phabricator: remove hardcoded ports, use parameters in my.cnf for admins

https://gerrit.wikimedia.org/r/858656

Change 858662 merged by Dzahn:

[operations/puppet@production] dumps: remove phab1001 from rsync clients

https://gerrit.wikimedia.org/r/858662

Mentioned in SAL (#wikimedia-operations) [2022-11-21T22:21:27Z] <dzahn@cumin2002> START - Cookbook sre.hosts.downtime for 2:00:00 on phab1001.eqiad.wmnet with reason: T280597

Mentioned in SAL (#wikimedia-operations) [2022-11-21T22:21:42Z] <dzahn@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab1001.eqiad.wmnet with reason: T280597

Mentioned in SAL (#wikimedia-operations) [2022-11-21T22:21:55Z] <brennen> downtiming and disabling phab1001 in preparation for migration to phab1004 (T280597)

Change 859147 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: enable vcs on phab1004

https://gerrit.wikimedia.org/r/859147

Change 859147 merged by Dzahn:

[operations/puppet@production] phabricator: enable vcs on phab1004

https://gerrit.wikimedia.org/r/859147

Mentioned in SAL (#wikimedia-operations) [2022-11-22T01:25:03Z] <brennen> reverting to phab1001; short phabricator downtime incoming while DNS changes are made (T280597)

Mentioned in SAL (#wikimedia-operations) [2022-11-22T22:17:03Z] <mutante> phab1004 - rsyncing /srv/repos from phab1001 with 2Mbit bwlimit - pulling - rsync -avp --bwlimit=2m --delete rsync://phab1001.eqiad.wmnet/srv-repos/ /srv/repos/ - T280597

Mentioned in SAL (#wikimedia-operations) [2022-11-22T22:34:31Z] <mutante> phabricator: on phab1001 user 'phd' is UID 497, on pahb1004 user 'phd' is UID 920 (this is desired and a fix!) - but also..because uid 497 was now free.. it became the UID of user 'vcs' on phab1004 while on phab1001 user 'vcs' is uid 498. so we use "find /srv/repos -uid 497 -exec chown phd {} \;" to give files owned by 497 to phd. T280597

@brennen @thcipriani I checked everything again with the UID changes and rsync.

First I ran a fresh rsync pulling from phab1001 to phab1004, all of /srv/repos, with --delete as logged above.

Then made this table to show the transitions:

host-olduseruidgidgroupshost-newuseruidgidgroups
phab1001apache2modsec496497497 (apache2modsec)phab1004apache2modsecn/an/an/a
phab1001phd497498498 (phd)phab1004phd920920920 (phd)
phab1001vcs498498498 (phd)phab1004vcs497497920 (phd)
phab1001phab-deploy499499499 (phab-deploy)phab1004phab-deploy499499499 (phab-deploy)

notes:

  • 500 and up is where shell users start
  • system users should NOT use low UIDs like on phab1001, they should all be created with systemd-sysuser and get a high 9xx UID
  • "phd" on phab1004 having UID 920 with matching GID 920 is the correct way and it was created by systemd-sysuser.
  • Once phab1001 is gone we should not have these problems anymore because finally UIDs stay the same globally. So no need to worry about changing rsync behaviour in regards to the numeric IDs.

So what you can see above is:

  • apache2modsec does not exist on phab1004. it does on phab1001 and historically used to get the 496/497 pair
  • phd was created after that by puppet and used to get the next free ones, the 497/498 pair
  • vcs then got 498, and is member in the group phd, which is now confusingly "498:498" even though user and group are not the same
  • phab-deploy finally got 499

but, that's all wrong nowadays.

None of these users should use low UIDs like that.

Also we never want to have this issue again that UIDs change between servers.

So we fixed all this for the phd user, which is why it's a "nice" 920:920 on the new host.

We also had used "find .. -exec chmod" to make sure all files that used be owned by phd are also now owned by phd.

But we did not do the same for the "vcs" user. And we did not account for it also getting a new UID number because 497 became free after we stopped
using it for phd.

So the fix here is to run more of those commands like this:

find /srv/repos -uid 497 -exec chown phd {} \;

which currently runs to give all the files owned by "vcs" to "phd" on phab1004.

If we do this carefully for all the transitions below and do NOT use just -R recursive we should be fine.

We need to add puppet fixes to also have "vcs" created by systemd-sysuser (and not sure about phab-deploy, it seems fine but it's also a system user with a 'bad' UID).

Once phab1001 is gone and we have done the above we will only use 920:920 / 921:921 or so and they will finally stay the same on all servers globally.

So there is no need to worry about rsync using numeric IDs or not.

Change 859145 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: set mysql master port for eqiad

https://gerrit.wikimedia.org/r/859145

Change 859628 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: let phd run on phab1004

https://gerrit.wikimedia.org/r/859628

Change 859631 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: move some more settings from host file to common

https://gerrit.wikimedia.org/r/859631

files by user 497 should be owned by phd:

find /srv/repos -uid 497 -exec chown phd {} \;

files by group 498 should have group phd:

find /srv/repos -gid 498 -exec chgrp phd {} \;

After these commands have run we can verify by either:

tree -upfg (and grep -v for lines without phd and/or without www-data)

We can also use the output of tree -upfg and diff it on both machines.

Or we can just use find and negate user and/or gid:

find /srv/repos ! -user phd

This shows on _both_ servers that there is only one repo that has root-owned files, and it's PHEX.

So only under /srv/repos/PHEX some files have weird permissions but this is the case on the origin server, before rsync, as well.

root@phab1001:/# find /srv/repos ! -user phd
/srv/repos/PHEX/objects/1b/cac80416fa72fc977ff5202b475a96b8d64b30
/srv/repos/PHEX/objects/a6/5bf552a15cf6bc744978eced8d7e7fd7c8124a
/srv/repos/PHEX/objects/78/7ecb5eba924bbc4ec6db3294c29a220130016c
/srv/repos/PHEX/objects/04/0d2e3b91b2710ef83fccf73654618e8b259a3c
/srv/repos/PHEX/objects/e0/75ce74ec752e7f564f38d182ecfd739af17dab
/srv/repos/PHEX/objects/8d/ea79dcb6cb922adec4b6241825fd2eadc02522
/srv/repos/PHEX/objects/cc/3c04457cb90ebd15a079b987f16c65ce52aa19
/srv/repos/PHEX/objects/08/46ff4b8ad566d836f2202260f39bdc8fa598a9
/srv/repos/PHEX/objects/fb/6b0ccb36349949e6434d4c9f5c38d8007921bf
/srv/repos/PHEX/objects/17/563164cfe8441d6c203540ae796fb59714fbf6
/srv/repos/PHEX/objects/47/53f61bfcc840925a2f15edbb4b73c0fb6a566f
/srv/repos/PHEX/objects/c2/2c04c8a5e842a5c0fdfb020cddc3418f82c3d6
/srv/repos/PHEX/objects/fe/3c359efb09b0008a21c99c1cdf42930363d95e
/srv/repos/PHEX/objects/d3/bc472b2c9f7b4f1c5610c3caca391dbd404748
/srv/repos/PHEX/objects/95/d9a265d12483b03b5264c5c83f3d150f84f7e5
/srv/repos/PHEX/objects/bf/abc05599455b0e33b4400feb4207d00f1749e7
/srv/repos/PHEX/objects/6c/f7de432bda3a2070a22a59dd7f192a63083139
/srv/repos/PHEX/objects/7c/6de9ad4daab84d1b4434eec768630e29460c63
/srv/repos/PHEX/objects/94/26180b65b94e252365bd86e66115f9ab4c9ec2
/srv/repos/PHEX/objects/94/29927ec1b15de217b0dc29e01372728dc149c6
/srv/repos/PHEX/objects/c0/30e214a39ef4614fa2867a4bd45893b3bf0abc
root@phab1001:/#

created migration plan v2:

1[x] announce migration window to: ops list, wikitech-l list, Slack
2[x] schedule downtime via cookbook for phab1001 and all services on it, via cookbook:
3 [cumin2002:~] $ sudo cookbook sre.hosts.downtime -D 14 -r 'T322250' phab1001.eqiad.wmnet
4[x] confirm downtime is active in Icinga web UI (https://icinga.wikimedia.org)
5[x] disable puppet on phab1001: sudo disable-puppet 'T280597'
6[x] stop Apache, PHP-FPM and phd on phab1001
7 [phab1001:~] sudo systemctl stop apache2
8 [phab1001:~] sudo systemctl stop php7.3-fpm
9 [phab1001:~] sudo systemctl stop phd
10[x] confirm there are no more PHP processes running
11 [phab1001:~] sudo ps aux | grep php
12[x] rsync /srv/repos diff by pulling on phab1004 from phab1001:
13 [phab1004:/] (as root) rsync -avp --bwlimit=2m --delete rsync://phab1001.eqiad.wmnet/srv-repos/ /srv/repos/
14[x] check on phab1004 if any files under /srv/repos owned by UID 497 (vcs). if so, give them to user phd
15 [phab1004:/] find /srv/repos -uid 497
16 [phab1004:/] find /srv/repos -uid 497 -exec chown phd {} \;
17 - find proved far too slow on a fresh rsync of the repos data. We used chmod -R phd:phd instead, accepting that everything is phd:phd and not some mix of phd:phd and phd:www-data
18[x] check on phab1004 if any files under /srv/repos owned by GID 498 (aphlict). if so, give them to group phd
19 [phab1004:/] find /srv/repos -gid 498
20 [phab1004:/] find /srv/repos -gid 498 -exec chgrp phd {} \;
21 - find proved far too slow on a fresh rsync of the repos data. We used chmod -R phd:phd instead, accepting that everything is phd:phd and not some mix of phd:phd and phd:www-data
22[x] check on phab1004 if any files under /srv/repos are owned by a user that is NOT phd
23 [phab1004:/] find /srv/repos ! -user phd
24[x] expect this to show the PHEX repo but nothing else. decide what to do with PHEX (root-owned)
25 - Decision here: Only some stuff under here was root-owned, that seems likely to have been an artifact of some manual operation on phab1001
26[x] output the full tree of /srv/repos and compare number of directories / files between both servers
27 [phab1001:/] tree -upfg > /root/repos-tree (this file will be just under 500MB of text)
28 [phab1001:/] tail /root/repos-tree
29 [phab1004:/] tree -upfg > /root/repos-tree
30 [phab1004:/] tail /root/repos-tree
31[] optional: if not satisfied yet: copy result file from old server to new server (scp -3 ...) and run an actual diff between them
32[x] set mysql ports for master and slave, specifically for eqiad (currently this happens in codfw but not in common hiera)
33 merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/859145 run-puppet-agent, check what happens on phab1004
34[x] merge re-revert of the phabricator server name in common Hiera, run puppet, watch the changes on phab1004 and phab2002
35 https://gerrit.wikimedia.org/r/c/operations/puppet/+/860031
36[x] run a scap deploy to phab1004
37 (insert command, deployment server name)
38[x] enable phd service on phab1004
39 merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/859628 and run-puppet-agent
40[x] wait a couple minutes and check phd is still running (how long?)
41 (if killed by puppet for any reason, it'll be every puppet run...)
42[x] merge re-revert of the DNS/SPF change
43 https://gerrit.wikimedia.org/r/c/operations/dns/+/860032 and run "authdns-update" on ns0.wikimedia.org, syncs to other DNS servers
44[x] wait about a minute and optionally use "dig phabricator.discovery.wmnet @ns0.wikimedia.org" to see it change from alias for phab1001 to an alias for phab1004
45[x] informational: dumps don't need to switch, they are already on phab1004, this has happened before
46[x] informational: stats emails don't need to switch, they are already on phab1004, this has happened before
47
48testing
49
50[x] check https://phabricator.wikimedia.org works, watch out for yellow exclamation marks / warnings for admins
51[x] test aphlict works by moving something on a workboard while someone else watches
52[x] test if a ticket update shows up on IRC
53[x] test if email from a ticket update arrives (by a user who has email notifications)
54[x] check phabricator logs for exceptions (that aren't usual noise)
55 (insert command / pathes)
56[x] test if CI works / "recheck" on a change in Gerrit
57
58finalizing
59
60[] merge patch to disable phd (and apache and php-fpm) on phab1001?
61[x] verify proper monitoring downtime on phab1001
62[x] reply to list emails and Slack that migration is done succesfully, link to ticket in case they see any issues
63[x] publish fingerprints on wikitech page
64after migration is done and grace period (how long?):
65
66[x] double check which settings can move to common Hiera, remove setting from hosts files in Hiera
67[] merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/824412 and check puppet run
68[] remove phab1001 from mysql grants, coordinate with DBA on merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/858419
69[x] create decom ticket for phab1001 - https://phabricator.wikimedia.org/T323418
70[x] remove production puppet role from phab1001, merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/824804
71[x] run decom cookbook from a cumin host on phab1001
72 [cumin2002:~] $ sudo cookbook sre.hosts.decommission phab1001.eqiad.wmnet -t T323418
73[x] remove phab1001 from site.pp https://gerrit.wikimedia.org/r/c/operations/puppet/+/858421
74[x] check all the SRE boxes on decom ticket, assign to dcops in eqiad, add dcops tag
75[x] resolve https://phabricator.wikimedia.org/T280597
76[x] set OKR to 100% in Betterworks, profit

Mentioned in SAL (#wikimedia-operations) [2022-11-28T22:00:01Z] <brennen> phabricator: phab1001 -> phab1004 migration starting soon; downtime expected (T280597)

Change 861489 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: quote mysql port numbers

https://gerrit.wikimedia.org/r/861489

Change 861489 merged by Dzahn:

[operations/puppet@production] phabricator: quote mysql port numbers

https://gerrit.wikimedia.org/r/861489

Change 861490 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: change db ports to strings in tools class

https://gerrit.wikimedia.org/r/861490

Change 861490 merged by Dzahn:

[operations/puppet@production] phabricator: change db ports to strings in tools class

https://gerrit.wikimedia.org/r/861490

Change 861491 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: switch mysql slave port for logmail to string

https://gerrit.wikimedia.org/r/861491

Change 861491 merged by Dzahn:

[operations/puppet@production] phabricator: switch mysql slave port for logmail to string

https://gerrit.wikimedia.org/r/861491

Mentioned in SAL (#wikimedia-operations) [2022-11-28T23:22:26Z] <brennen@deploy1002> Started deploy [phabricator/deployment@f68dc24]: deploy config changes for mysql-port-as-string (T280597)

Mentioned in SAL (#wikimedia-operations) [2022-11-28T23:23:21Z] <brennen@deploy1002> Finished deploy [phabricator/deployment@f68dc24]: deploy config changes for mysql-port-as-string (T280597) (duration: 00m 55s)

phabricator switched to phab1004

This has happened. It will just stay open until the old server is actually decom'ed. This will continue after a grace period of a couple days.

and (less important) https://wikitech.wikimedia.org/wiki/Phab1004 ? :) TIA

Do we still need server pages on wikitech? It's a bit of a carry over from the days before current tools such as netbox and puppet

@taavi: Uh, neat. I wasn't aware! I was looking for the ED25519 fingerprint because of the local ssh terminal output

The authenticity of host 'phab1004 (<no hostip for proxy command>)' can't be established.
ED25519 key fingerprint is SHA256:4o62NwzZw98w4u/DF1vkgMM58LOKQ/U0Ne0XwNiW70Y.

and that doesn't seem to be on those two pages linked above...

Do we still need server pages on wikitech? It's a bit of a carry over from the days before current tools such as netbox and puppet

Personally I think they are still useful for the (few) servers where we have SSH shell users outside of SRE, mainly deploy*, mwmaint* and this case here but I wouldn't bother with them for clusters like mw*. So basically "if it has admin groups besides root".

May I ask for https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/phab1004.eqiad.wmnet

done. https://wikitech.wikimedia.org/wiki/Phab1004

Change 824804 merged by Dzahn:

[operations/puppet@production] phabricator: remove production role from phab1001

https://gerrit.wikimedia.org/r/824804

Mentioned in SAL (#wikimedia-operations) [2022-12-05T19:57:54Z] <mutante> phab1004 (prod) - removing phab1001 from firewall rules, rsync config | phab1001 (formerly prod) - removing prod role T323418 T280597

Change 824412 merged by Dzahn:

[operations/puppet@production] O:phabricator: move host based settings to role hiere

https://gerrit.wikimedia.org/r/824412

Mentioned in SAL (#wikimedia-operations) [2022-12-05T21:55:14Z] <mutante> deleting special DNS entries for "phab10010-vcs.eqiad.wmnet", IPv4 and IPv6 (Role: VIP), from netbox - T280597

Change 859631 abandoned by Dzahn:

[operations/puppet@production] phabricator: move some more settings from host file to common

Reason:

now duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/824412

https://gerrit.wikimedia.org/r/859631

Change 865208 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: rm code from before system user was created with systemd

https://gerrit.wikimedia.org/r/865208

Change 865208 merged by Dzahn:

[operations/puppet@production] phabricator: rm code from before system user was created with systemd

https://gerrit.wikimedia.org/r/865208

Related commit b3d114e1 pushed by brennen (author: Brennen Bearnes):

[ repos/phabricator/deployment@wmf/stable ] scap targets: add phab1004.eqiad.wmnet

Related commit 1fa5fd74 pushed by brennen (author: Brennen Bearnes):

[ repos/phabricator/deployment@wmf/stable ] local settings: add mysql.port