Page MenuHomePhabricator

Setup rsync for phab data on disk
Closed, ResolvedPublic3 Estimated Story Points

Description

Differential repos and home directories (must. save. @mmodell's bash history.) on phabricator need to be rsynced to our replacement hardware

Steps

  • Determine what data needs to be synced (/srv/repos, /srv/dumps, /home, others?) and what users own the data (phd/www-data)
  • rsync::quickdatacopy used in a (migration) puppet role
  • Ensure firewall rules exist for rsync between phab{1,2}00{1,2}
  • Check privileges for UID matching across hosts for rsync (if needed) OR ensure users have the same UID on both hosts (new UID 920 on new hosts, properly reserved but has to change)
  • Run a sync pre-maintenance window (in progress)
  • Ensure there's a step in the migration for running rsync during the maintenance window

Acceptance criteria

  • Identical data exists on the new hosts owned by the correct user

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@thcipriani if that particular bash history is critical, should we persist it somewhere more resilient?

@thcipriani if that particular bash history is critical, should we persist it somewhere more resilient?

In the short-term it will be. My hope is it will be obsolete Soon™. I believe preserving home directories for the move should be sufficient.

Earlier this week I have mentioned rsync issue about username/group names being mixed up. That caused us a lot of havoc when switching the contint machines a couple years ago and I think when we did the move of Gerrit. We had to find/chown pretty much all files.

I finally found the comment I made at the time T224591#6124875 with the fix being:

update rsyncd.conf which has use chroot which defaults to force numeric ids and thus prevent the name/id mapping to occur.

I think that comes from the quickdatacopy Puppet class being applied and the rsync module defaulting to use chroot. Since rsync runs in a chroot, it does not have access to the /etc/passwd and only transfers based on UID. So it should run out of a chroot (and ensuring numeric ids is not used).

Or as Daniel said: ensure all owners of files that are to be synced have the same UID and GID on the source and target hosts.

thcipriani set the point value for this task to 3.

(must. save. @mmodell's bash history.)

I made a phab1001-home-twentyafterfour.tar.gz so the entire home and then copied it over to home/dzahn and /home/thcipriani on people1003.

people* hosts have a backup::set for home, so that ends up in Bacula soonish.

phab* hosts have /srv/repos in Bacula but not home dirs or other pathes

from syncing data last time back in 2019

https://gerrit.wikimedia.org/r/c/operations/puppet/+/554628

@Dzahn how far have you progressed? @jnuche and I are pairing on sprint work tomorrow and I thought to work on this task together, but I don't want to duplicate work.

@dduvall You guys could double check if you think we need anything _in addition _ to /srv/repos. Because that is what we did last time and what we have in backups. Besides that you can leave the rest to me. I have not uploaded the patches yet but already know what they will look like.

So I knew either we already had this code from last time or we had applied it in a special "migration" puppet role to then remove it again after migration was complete. I checked puppet code and I found it is already there except we don't use the "rsync::quickdatacopy" abstraction here that we use in lots of other places including mwmaint, releases etc. and is mentioned in the ticket description.

For Phabricator we are using rsync::server::modules directly. There are 2 of them, one for the aforementioned /srv/repos from other Phabricator servers and then one for /srv/dumps from dumps servers.

The setup is always that any _other_ servers (can) pull from the one source of truth. Not the other way around, no "push to new servers", only "pull from current server".

checking the firewall rules. an iptables -L on phab1001 shows:

ACCEPT     tcp  --  phab1001.eqiad.wmnet  anywhere             tcp dpt:rsync
ACCEPT     tcp  --  phab2001.codfw.wmnet  anywhere             tcp dpt:rsync
ACCEPT     tcp  --  phab2002.codfw.wmnet  anywhere             tcp dpt:rsync
ACCEPT     tcp  --  labstore1006.wikimedia.org  anywhere             tcp dpt:rsync
ACCEPT     tcp  --  labstore1007.wikimedia.org  anywhere             tcp dpt:rsync

so the current AND new codfw server and the dumps server are allowed on the rsync port.

Then the next step is rsyncd itself and hosts allowed in it.

/etc/rsync.d/frag-srv-repos is the relevant snippet for /srv/repos and:

hosts allow = phab1001.eqiad.wmnet phab2001.codfw.wmnet phab2002.codfw.wmnet localhost

path            = /srv/repos
read only       = yes
write only      = no

^ all of the allowed hosts can pull from phab1001, none of them can write to it.

This means we should start with pulling _on phab2002 from phab1001_ and replace phab2001.

No puppet change needed..should just work.

regarding the UIDs and possible privilege issues after rsync:

  • the owner of directories and files under /srv/repos is phd:www-data or phd:phd on phab1001
  • the owner of directories and files under /srv/repos is apache2modsec:www-data or apache2modsec:apache2modsec on phab2001
  • the UID of user phd is uid=497(phd) gid=498(phd) groups=498(phd) on both phab1001 and phab2001, no mismatch here but see above
  • phab2002 does not have a phd user yet since that will only be created once we can apply the phabricator puppet role

Change 817811 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: add phabricator-roots on new phabricator hardware

https://gerrit.wikimedia.org/r/817811

Dzahn changed the task status from Open to In Progress.Jul 27 2022, 3:59 PM

Change 818183 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] admin: add gerrit access groups to gerrit migration role

https://gerrit.wikimedia.org/r/818183

^ sorry, I mixed up tickets for phabricator and gerrit because we are working on both at the same time. Did we also have one for rsync and gerrit?

Change 818183 merged by Dzahn:

[operations/puppet@production] admin: add gerrit access groups to gerrit migration role

https://gerrit.wikimedia.org/r/818183

Change 817811 merged by Dzahn:

[operations/puppet@production] phabricator: add phabricator-roots on new phabricator hardware

https://gerrit.wikimedia.org/r/817811

members of the shell admin group "phabricator-roots" (@20after4, @thcipriani, @brennen, @jnuche, @hashar, @demon, @dancy, @dduvall)

You just got new ssh access to new replacement hardware hosts:

  • phab1004.eqiad.wmnet
  • phab2002.codfw.wmnet
Dzahn triaged this task as High priority.Jul 29 2022, 6:36 PM

With the changes above (T313360#8105754) it's now possible to rsync from old to new phab hosts.

I will next do a "pre" rsync of /srv/repos from phab1001 to phab1004 (and from phab2001 to phab2002? or rather from phab1001 to both new hosts? also see /T313360#8106017)

Question for you guys would be if I'm missing anything. So far we are talking about /srv/repos and /srv/dumps (T313360#8105645) both for what is in backups and what we are copying for the migration. That's also what was done last time since it was already puppetized. Do we need to add anything though?

Change 818513 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: use migration role for pre-syncing data

https://gerrit.wikimedia.org/r/818513

Change 818513 merged by Dzahn:

[operations/puppet@production] phabricator: use migration role for pre-syncing data

https://gerrit.wikimedia.org/r/818513

Mentioned in SAL (#wikimedia-operations) [2022-07-29T22:01:12Z] <mutante> phab1001 - rsync -avp --bwlimit=1000 /srv/repos/ rsync://phab1004.eqiad.wmnet/phabricator-srv-repos (running slowly inside a screen session as root) (T313360, T280597)

Mentioned in SAL (#wikimedia-operations) [2022-08-01T20:57:51Z] <mutante> phab1001 - rsyncing repo data /srv/repos/ to phab2002 (in addition to phab1004 previously) T313360

Mentioned in SAL (#wikimedia-operations) [2022-08-15T22:33:51Z] <mutante> rsyncing /srv/repos and /srv/dumps from phab1001 to phab2002 before applying prod puppet role (T313360)

@thcipriani So far I am expecting to copy /srv/repos and /srv/dumps from old to new phab servers. (and it has already happened from 1001 -> 1004 and from 1001 -> 2002) with /home/ being a "maybe" still.

I would be nice to get a couple other eyes on this whether there is more than that we should copy.

Other things in /srv on phab1001 are:

871M	/srv/deployment
1.6G	/srv/dumps
4.0K	/srv/git.wikimedia.org
16K	/srv/lost+found
0	/srv/phab
57G	/srv/repos

/srv/repos is also 57G on phab1004 and phab2002 already

/srv/dumps is also 1.6G on phab1004 and phab2002 already

regarding the UIDs.. user 'phd' has a reserved UID of 498. per docs (https://wikitech.wikimedia.org/wiki/UID) it's supposed to be 498:498.

But .. it's not in existing prod. It is:

phab1001: 497:498
phab2001: 497:498
phab2002: 498:498
phab1004: n/a, not yet created

So basically it's wrong right now and correct on the new server but that means _not the same_.

Mentioned in SAL (#wikimedia-operations) [2022-08-16T23:37:44Z] <mutante> phab2002 - chown -R phd:www-data /srv/repos/ (because of UID mismatch) T313360

Mentioned in SAL (#wikimedia-operations) [2022-08-16T23:44:17Z] <mutante> phab1001 - repeated rsync of /srv/repos to phab2002, then chown -R phd /srv/repos/ (without setting the group) - this way UID is fixed and privs match exactly phab1001 - T313360

Change 823765 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator::migration: add phd user with sysmted::sysuser

https://gerrit.wikimedia.org/r/823765

Change 823767 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: replace user{} with systemd::sysuser for daemon user

https://gerrit.wikimedia.org/r/823767

regarding the UIDs.. user 'phd' has a reserved UID of 498. per docs (https://wikitech.wikimedia.org/wiki/UID) it's supposed to be 498:498.

But .. it's not in existing prod. It is:

phab1001: 497:498
phab2001: 497:498
phab2002: 498:498
phab1004: n/a, not yet created

So basically it's wrong right now and correct on the new server but that means _not the same_.

After review comments on https://gerrit.wikimedia.org/r/c/operations/puppet/+/823765 turns out we shoudl use a new UID in the range over 900.

So what we are going to do now is change it to 920 on new servers! (and not change it on old servers)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/823765

edited here and in admin.yaml

https://wikitech.wikimedia.org/w/index.php?title=UID&type=revision&diff=2004642&oldid=2004195

Change 823765 merged by Dzahn:

[operations/puppet@production] phabricator::migration: add phd with systemd::sysuser, reserve UID 920

https://gerrit.wikimedia.org/r/823765

Mentioned in SAL (#wikimedia-operations) [2022-08-17T23:23:17Z] <mutante> phab2002 - chmod -R phd /srv/repos | find /srv/repos/ -gid 498 -exec chown phd:phd {} \; T313360

Change 823767 merged by Dzahn:

[operations/puppet@production] phabricator: replace user{} with systemd::sysuser, only on new hosts

https://gerrit.wikimedia.org/r/823767

Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)

Change 824782 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: avoid duplicate declaration mixing group{} and systemd::sysuser

https://gerrit.wikimedia.org/r/824782

Change 824782 merged by Dzahn:

[operations/puppet@production] phabricator: avoid duplicate declaration mixing group{} and systemd::sysuser

https://gerrit.wikimedia.org/r/824782

Change 824796 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: don't use systemd::sysuser on phab2002 for now

https://gerrit.wikimedia.org/r/824796

Change 824796 merged by Dzahn:

[operations/puppet@production] phabricator: don't use systemd::sysuser on phab2002 for now

https://gerrit.wikimedia.org/r/824796

after we removed the LVS (git-ssh) setup part, worked on reserving a proper UID for the phd systemuser, using the new systemd::sysuser instead of user:: and some follow-up issues.. we could apply the phabricator role to phab2002

BUT.. it still lead to a new problem.. which actually broke sshd config on the server. This is also related to the fact that phab servers have (soon had) 2 SSH separate ssh servers, one for git-ssh. Then the main sshd gets reconfigured to only listen on its specific IP and not the service IP.. but the IP is in Hiera and was applied to all of codfw.. so if you have more than one host in codfw you get the IP for 2001 on 2002.. it can't apply it.. still tries to restart sshd.. fails.. sshd is down... icinga starts prod alerts about systemd status.. then about sshd being down...

I still had a session open so I could stop puppet, fix sshd_config, restart ssh...

23:03 <+icinga-wm> RECOVERY - SSH on phab2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) 
                   https://wikitech.wikimedia.org/wiki/SSH/monitoring
23:04 <+icinga-wm> RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational 
                   https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

Change 824797 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: fix sshd listen address for phab codfw

https://gerrit.wikimedia.org/r/824797

Change 824797 merged by Dzahn:

[operations/puppet@production] phabricator: fix sshd listen address for phab codfw

https://gerrit.wikimedia.org/r/824797

Change 826915 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] admin: add reserved gid 920 for phd, phabricator user

https://gerrit.wikimedia.org/r/826915

Change 826915 merged by Dzahn:

[operations/puppet@production] admin: add reserved gid 920 for phd, phabricator user

https://gerrit.wikimedia.org/r/826915

Mentioned in SAL (#wikimedia-operations) [2022-10-13T19:38:54Z] <mutante> rsyncing /srv/repos from phab1001 to 3 other phab servers (with bw limit) - T313360

Change 842873 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: rename rsync module for dumps

https://gerrit.wikimedia.org/r/842873

Change 842875 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: move list of dumps rsync clients to parameter and Hiera

https://gerrit.wikimedia.org/r/842875

Change 842878 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: use anchor/alias to add phab servers to dump clients list

https://gerrit.wikimedia.org/r/842878

Change 842873 merged by Dzahn:

[operations/puppet@production] phabricator: rename rsync module for dumps

https://gerrit.wikimedia.org/r/842873

Change 842875 merged by Dzahn:

[operations/puppet@production] phabricator: move list of dumps rsync clients to parameter and Hiera

https://gerrit.wikimedia.org/r/842875

Change 844048 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: temp add other phab hosts to dump client hosts

https://gerrit.wikimedia.org/r/844048

Change 844048 merged by Dzahn:

[operations/puppet@production] phabricator: temp add other phab hosts to dump client hosts

https://gerrit.wikimedia.org/r/844048

Mentioned in SAL (#wikimedia-operations) [2022-10-18T18:58:48Z] <mutante> rsyncing phab dump file - pull from phab1000 to all other hosts T313360

Change 844057 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: create /srv/homes and allow rsyncing it

https://gerrit.wikimedia.org/r/844057

Change 844057 merged by Dzahn:

[operations/puppet@production] phabricator: create /srv/homes and allow rsyncing it

https://gerrit.wikimedia.org/r/844057

Mentioned in SAL (#wikimedia-operations) [2022-10-18T20:50:01Z] <mutante> phabricator - on new machines, find / -uid 497 -exec chown phd {}\; to fix privileges. (and then the same for -gid 498) The user phd used to be 497:498 (pid:gid) on old hosts but has been replaced with proper systemd system user using 920:920 T313360

@thcipriani @LSobanski status update:

  • /srv/repos has been synced from phab1001 to phab1004 and phab2002
  • /srv/dumps has been synced from phab1001 to phab1004 and phab2002
  • home directories on both phab1001 and phab2001 have been tarballed and copied over to both phab1004 and phab2002 in /srv/homes, root can read them, not users each other
  • phd (920:920) is using a proper ID number, same for uid and gid and is properly created by modern systemd::sysuser class
  • all files previously owned by "uid 497 (phd)" have been switched to be owned by "uid 920 (phd)"
  • all files previously owned by "gid 498 (phd)" have been switched to be group-owner by "gid 920 (phd)"
  • this was done in 2 steps, NOT with recursive chmod/chgrp. Instead it was done once for UID and once for GID. So the state has been exactly copied from phab1001 for any combination of "phd:www-data" etc when it comes to ownersship of files under /srv/repos

@thcipriani Now all that's left is the "Ensure there's a step in the migration for running rsync during the maintenance window". I am just not sure anymore if we already started a doc for that elsewhere.

@thcipriani Now all that's left is the "Ensure there's a step in the migration for running rsync during the maintenance window". I am just not sure anymore if we already started a doc for that elsewhere.

\o/

I don't think we have started a doc with a step-by-step for the maintenance switchover window. I was just talking with @brennen about what's left to be done here. It sounds like we should setup a time to talk about+prep the actual switchover, is that right?

@thcipriani Yes, next thing that is needed here is a deployment of phabricator to host phab1004. It looks like it has already happened on phab2002 but not on 1004 yet.

Dzahn updated the task description. (Show Details)

new etherpad started, rsync commands added:

https://etherpad.wikimedia.org/p/Phabricator-migration-2022

this resolves this subtask

Change 842878 abandoned by Dzahn:

[operations/puppet@production] phabricator: use anchor/alias to add phab servers to dump clients list

Reason:

already added the list "manually"

https://gerrit.wikimedia.org/r/842878

Mentioned in SAL (#wikimedia-releng) [2022-10-28T06:23:33Z] <hashar> devtools: set profile::phabricator::main::dumps_rsync_clients: [] project wide to fix up Puppet. Settings got moved to a role ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/842875 | T313360 )

Change 850542 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] devtools: add profile::phabricator::main::dumps_rsync_clients: []

https://gerrit.wikimedia.org/r/850542

Change 850542 merged by Dzahn:

[operations/puppet@production] devtools: add profile::phabricator::main::dumps_rsync_clients: []

https://gerrit.wikimedia.org/r/850542

Mentioned in SAL (#wikimedia-releng) [2022-10-28T06:23:33Z] <hashar> devtools: set profile::phabricator::main::dumps_rsync_clients: [] project wide to fix up Puppet. Settings got moved to a role ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/842875 | T313360 )

@hashar thanks for this fix! my bad. added in repo!