Investigate and reduce resource use by rsync of dumps between peers, labs, mirrors
Open, MediumPublic
Actions

Assigned To

None

Authored By

	ArielGlenn
	Oct 12 2017, 9:10 AM

Description

I recall folks being concerned that rsyncs might contribute to memory pressure on dataset1001, increasing the possibility of NFS lockup discussed in T169680.
In any case, if there are easy things to do that can reduce resource use, we should do them.

Related Objects

Mentioned Here: T228575: Decrease number of open tickets with assignee field set for more than two years (aka cookie licking) (March-June 2020 edition)
T169680: NFS on dataset1001 overloaded, high load on the hosts that mount it

Event Timeline

ArielGlenn created this task.Oct 12 2017, 9:10 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 12 2017, 9:10 AM

ArielGlenn triaged this task as Medium priority.Oct 12 2017, 9:10 AM

One obvious fix is to avoid copying over files that are still being written. We can now easily tell which ones those are, at least for the regular xml/sql dump runs. This patch implements it for that case: https://gerrit.wikimedia.org/r/#/c/385203/

I did some research earlier and looked at the rsync code; versions 3.0.0 and greater create the file list incrementally, which uses much less memory than the older versions. Anything running precise and up will have 3.0.0, so we can practically rule out use of older versions by our mirrors. I tried checking the filecount lookahead and some other details, but tl;dr is that I still wonder if doing smaller subdirs at a time would be less resource-intensive. Needs some testing.

ArielGlenn moved this task from Short-term backlog to This week on the User-ArielGlenn board.Nov 6 2017, 10:29 PM

Setup time before file list transmission seems nearly the same for top level directories and subdirs, tested on dataset1001 which has a very large filesystem. Things yet to be tested; making the include/exclude list less complex or shorter; using separate rsync stanzas for subdirectories to see if that's faster.

ArielGlenn moved this task from This week to Short-term backlog on the User-ArielGlenn board.Mar 29 2018, 2:40 PM

We're in pretty good shape now, rsycning only complete files, as soon as they are produced, and with less populated filesystems on the dumpsdata server side. Subdirectory rsyncing didn't help any. Steps forward should now be coordinated with @Bstorm to see what might be done on the labstore side of these rsyncs.

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

Investigate and reduce resource use by rsync of dumps between peers, labs, mirrorsOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Investigate and reduce resource use by rsync of dumps between peers, labs, mirrors
Open, MediumPublic
Actions