Page MenuHomePhabricator

Setup rsync for phab data on disk
Open, In Progress, HighPublic3 Estimated Story Points

Description

Differential repos and home directories (must. save. @mmodell's bash history.) on phabricator need to be rsynced to our replacement hardware

Steps

  • Determine what data needs to be synced (/srv/repos, /srv/dumps, /home, others?) and what users own the data (phd/www-data)
  • rsync::quickdatacopy used in a (migration) puppet role
  • Ensure firewall rules exist for rsync between phab{1,2}00{1,2}
  • Check privileges for UID matching across hosts for rsync (if needed) OR ensure users have the same UID on both hosts (new UID 920 on new hosts, properly reserved but has to change)
  • Run a sync pre-maintenance window (in progress)
  • Ensure there's a step in the migration for running rsync during the maintenance window

Acceptance criteria

  • Identical data exists on the new hosts owned by the correct user

Event Timeline

@thcipriani if that particular bash history is critical, should we persist it somewhere more resilient?

@thcipriani if that particular bash history is critical, should we persist it somewhere more resilient?

In the short-term it will be. My hope is it will be obsolete Soon™. I believe preserving home directories for the move should be sufficient.

Earlier this week I have mentioned rsync issue about username/group names being mixed up. That caused us a lot of havoc when switching the contint machines a couple years ago and I think when we did the move of Gerrit. We had to find/chown pretty much all files.

I finally found the comment I made at the time T224591#6124875 with the fix being:

update rsyncd.conf which has use chroot which defaults to force numeric ids and thus prevent the name/id mapping to occur.

I think that comes from the quickdatacopy Puppet class being applied and the rsync module defaulting to use chroot. Since rsync runs in a chroot, it does not have access to the /etc/passwd and only transfers based on UID. So it should run out of a chroot (and ensuring numeric ids is not used).

Or as Daniel said: ensure all owners of files that are to be synced have the same UID and GID on the source and target hosts.

thcipriani set the point value for this task to 3.

(must. save. @mmodell's bash history.)

I made a phab1001-home-twentyafterfour.tar.gz so the entire home and then copied it over to home/dzahn and /home/thcipriani on people1003.

people* hosts have a backup::set for home, so that ends up in Bacula soonish.

phab* hosts have /srv/repos in Bacula but not home dirs or other pathes

from syncing data last time back in 2019

https://gerrit.wikimedia.org/r/c/operations/puppet/+/554628

@Dzahn how far have you progressed? @jnuche and I are pairing on sprint work tomorrow and I thought to work on this task together, but I don't want to duplicate work.

@dduvall You guys could double check if you think we need anything _in addition _ to /srv/repos. Because that is what we did last time and what we have in backups. Besides that you can leave the rest to me. I have not uploaded the patches yet but already know what they will look like.

So I knew either we already had this code from last time or we had applied it in a special "migration" puppet role to then remove it again after migration was complete. I checked puppet code and I found it is already there except we don't use the "rsync::quickdatacopy" abstraction here that we use in lots of other places including mwmaint, releases etc. and is mentioned in the ticket description.

For Phabricator we are using rsync::server::modules directly. There are 2 of them, one for the aforementioned /srv/repos from other Phabricator servers and then one for /srv/dumps from dumps servers.

The setup is always that any _other_ servers (can) pull from the one source of truth. Not the other way around, no "push to new servers", only "pull from current server".

checking the firewall rules. an iptables -L on phab1001 shows:

ACCEPT     tcp  --  phab1001.eqiad.wmnet  anywhere             tcp dpt:rsync
ACCEPT     tcp  --  phab2001.codfw.wmnet  anywhere             tcp dpt:rsync
ACCEPT     tcp  --  phab2002.codfw.wmnet  anywhere             tcp dpt:rsync
ACCEPT     tcp  --  labstore1006.wikimedia.org  anywhere             tcp dpt:rsync
ACCEPT     tcp  --  labstore1007.wikimedia.org  anywhere             tcp dpt:rsync

so the current AND new codfw server and the dumps server are allowed on the rsync port.

Then the next step is rsyncd itself and hosts allowed in it.

/etc/rsync.d/frag-srv-repos is the relevant snippet for /srv/repos and:

hosts allow = phab1001.eqiad.wmnet phab2001.codfw.wmnet phab2002.codfw.wmnet localhost

path            = /srv/repos
read only       = yes
write only      = no

^ all of the allowed hosts can pull from phab1001, none of them can write to it.

This means we should start with pulling _on phab2002 from phab1001_ and replace phab2001.

No puppet change needed..should just work.

regarding the UIDs and possible privilege issues after rsync:

  • the owner of directories and files under /srv/repos is phd:www-data or phd:phd on phab1001
  • the owner of directories and files under /srv/repos is apache2modsec:www-data or apache2modsec:apache2modsec on phab2001
  • the UID of user phd is uid=497(phd) gid=498(phd) groups=498(phd) on both phab1001 and phab2001, no mismatch here but see above
  • phab2002 does not have a phd user yet since that will only be created once we can apply the phabricator puppet role

Change 817811 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: add phabricator-roots on new phabricator hardware

https://gerrit.wikimedia.org/r/817811

Dzahn changed the task status from Open to In Progress.Wed, Jul 27, 3:59 PM

Change 818183 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] admin: add gerrit access groups to gerrit migration role

https://gerrit.wikimedia.org/r/818183

^ sorry, I mixed up tickets for phabricator and gerrit because we are working on both at the same time. Did we also have one for rsync and gerrit?

Change 818183 merged by Dzahn:

[operations/puppet@production] admin: add gerrit access groups to gerrit migration role

https://gerrit.wikimedia.org/r/818183

Change 817811 merged by Dzahn:

[operations/puppet@production] phabricator: add phabricator-roots on new phabricator hardware

https://gerrit.wikimedia.org/r/817811

members of the shell admin group "phabricator-roots" (@20after4, @thcipriani, @brennen, @jnuche, @hashar, @demon, @dancy, @dduvall)

You just got new ssh access to new replacement hardware hosts:

  • phab1004.eqiad.wmnet
  • phab2002.codfw.wmnet
Dzahn triaged this task as High priority.Fri, Jul 29, 6:36 PM

With the changes above (T313360#8105754) it's now possible to rsync from old to new phab hosts.

I will next do a "pre" rsync of /srv/repos from phab1001 to phab1004 (and from phab2001 to phab2002? or rather from phab1001 to both new hosts? also see /T313360#8106017)

Question for you guys would be if I'm missing anything. So far we are talking about /srv/repos and /srv/dumps (T313360#8105645) both for what is in backups and what we are copying for the migration. That's also what was done last time since it was already puppetized. Do we need to add anything though?

Change 818513 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: use migration role for pre-syncing data

https://gerrit.wikimedia.org/r/818513

Change 818513 merged by Dzahn:

[operations/puppet@production] phabricator: use migration role for pre-syncing data

https://gerrit.wikimedia.org/r/818513

Mentioned in SAL (#wikimedia-operations) [2022-07-29T22:01:12Z] <mutante> phab1001 - rsync -avp --bwlimit=1000 /srv/repos/ rsync://phab1004.eqiad.wmnet/phabricator-srv-repos (running slowly inside a screen session as root) (T313360, T280597)

Mentioned in SAL (#wikimedia-operations) [2022-08-01T20:57:51Z] <mutante> phab1001 - rsyncing repo data /srv/repos/ to phab2002 (in addition to phab1004 previously) T313360

Mentioned in SAL (#wikimedia-operations) [2022-08-15T22:33:51Z] <mutante> rsyncing /srv/repos and /srv/dumps from phab1001 to phab2002 before applying prod puppet role (T313360)

@thcipriani So far I am expecting to copy /srv/repos and /srv/dumps from old to new phab servers. (and it has already happened from 1001 -> 1004 and from 1001 -> 2002) with /home/ being a "maybe" still.

I would be nice to get a couple other eyes on this whether there is more than that we should copy.

Other things in /srv on phab1001 are:

871M	/srv/deployment
1.6G	/srv/dumps
4.0K	/srv/git.wikimedia.org
16K	/srv/lost+found
0	/srv/phab
57G	/srv/repos

/srv/repos is also 57G on phab1004 and phab2002 already

/srv/dumps is also 1.6G on phab1004 and phab2002 already

regarding the UIDs.. user 'phd' has a reserved UID of 498. per docs (https://wikitech.wikimedia.org/wiki/UID) it's supposed to be 498:498.

But .. it's not in existing prod. It is:

phab1001: 497:498
phab2001: 497:498
phab2002: 498:498
phab1004: n/a, not yet created

So basically it's wrong right now and correct on the new server but that means _not the same_.

Mentioned in SAL (#wikimedia-operations) [2022-08-16T23:37:44Z] <mutante> phab2002 - chown -R phd:www-data /srv/repos/ (because of UID mismatch) T313360

Mentioned in SAL (#wikimedia-operations) [2022-08-16T23:44:17Z] <mutante> phab1001 - repeated rsync of /srv/repos to phab2002, then chown -R phd /srv/repos/ (without setting the group) - this way UID is fixed and privs match exactly phab1001 - T313360

Change 823765 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator::migration: add phd user with sysmted::sysuser

https://gerrit.wikimedia.org/r/823765

Change 823767 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: replace user{} with systemd::sysuser for daemon user

https://gerrit.wikimedia.org/r/823767

regarding the UIDs.. user 'phd' has a reserved UID of 498. per docs (https://wikitech.wikimedia.org/wiki/UID) it's supposed to be 498:498.

But .. it's not in existing prod. It is:

phab1001: 497:498
phab2001: 497:498
phab2002: 498:498
phab1004: n/a, not yet created

So basically it's wrong right now and correct on the new server but that means _not the same_.

After review comments on https://gerrit.wikimedia.org/r/c/operations/puppet/+/823765 turns out we shoudl use a new UID in the range over 900.

So what we are going to do now is change it to 920 on new servers! (and not change it on old servers)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/823765

edited here and in admin.yaml

https://wikitech.wikimedia.org/w/index.php?title=UID&type=revision&diff=2004642&oldid=2004195

Change 823765 merged by Dzahn:

[operations/puppet@production] phabricator::migration: add phd with systemd::sysuser, reserve UID 920

https://gerrit.wikimedia.org/r/823765

Mentioned in SAL (#wikimedia-operations) [2022-08-17T23:23:17Z] <mutante> phab2002 - chmod -R phd /srv/repos | find /srv/repos/ -gid 498 -exec chown phd:phd {} \; T313360

Change 823767 merged by Dzahn:

[operations/puppet@production] phabricator: replace user{} with systemd::sysuser, only on new hosts

https://gerrit.wikimedia.org/r/823767

Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)