Restructure rsyncs of dumps to the labstore boxes
Open, MediumPublic
Actions

Assigned To

None

Authored By

	ArielGlenn
	Jun 9 2020, 9:39 AM

Description

Currently the box where dumps are generated (dumpsdata1001) rsyncs with a tight bandwidth cap to three other servers in serial, one of these being our public-facing webserver and one being our fallback server. There are two issues with this: the web server sometimes remains very out of date, as today when it does not show any completed stub files for the wikidata run, even though those were completed on June 2, and the fallback host also may be out of data by up to two days.

We should rsync to the fallback host with a bandwidth cap, so that dumps generation is not impacted, and on that host rsync to the labstore boxes with a much higher or no cap if they can handle it. The fallback host should probably get a 10GB NIC too, although then disk iops on its end will be the limiting factor.

Or, we should look for some other way of moving files around that ensures more timely updates to the fallback host and the labstore boxes.

Things we can look into:

moving to nfsv4; how would this improve performance? are any cache race issues present in the current codebase?
designing a different mechanism to transfer over index.html and status files preserved from a 'snapshot' of the dir, over after all content files have finished,
rsync from primary nfs server to a second server only which handles all other rsyncs, potentially with higher/no bandwidth caps to the other servers
deploying two primary nfs servers which each store about half of the content, allowing more clients to write to each one without throttling cpu/iops/bandwidth on the primaries when more clients are needed to complete dumps of more content in the same period of time
??

Details

Subject	Repo	Branch	Lines +/-
script for rsyncing dumps via secondary storage server	operations/puppet	production	+183 -0
rename the dump rsyncer script preparing for new one that rsyncs via secondary	operations/puppet	production	+27 -52
dumps rsync refactor, better opts and flags handling	operations/puppet	production	+70 -94
start restructure of dumps rsync	operations/puppet	production	+201 -96
fix the long-running dumps exception checker issue	operations/puppet	production	+4 -4
alert on high load on xml dumps nfs primary	operations/puppet	production	+6 -0
the dumps exception checker should not start if it's already running	operations/puppet	production	+63 -1
restructure rsync of xml/sql dumps from primary source to other servers	operations/puppet	production	+231 -0

Customize query in gerrit

Related Objects

Mentioned In: T289048: Restructure rsync of XML/SQL dumps and dumpsdata space/network/disk use

Event Timeline

ArielGlenn triaged this task as Medium priority.Jun 9 2020, 9:39 AM

ArielGlenn created this task.

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptJun 9 2020, 9:39 AM

Adding @Bstorm as she tuned the labstore boxes often enough and will surely have some good comments/ideas.

Στιγμιότυπο από 2020-06-09 12-40-40.png (708×880 px, 180 KB)

for the record, screenshot taken just now of the top level web page (served from labstore1006 I guess).

Assuming we had the bandwidth, could we beat the disk iops cap by running a couple of rsyncs at once, given that there are disk arrays at both ends?

Also for the record, I attach the start and end dates and times of rsyncs from dumpsdata1001 to the three servers for the last few days. Note that the rsyncs occur in two passes, first all of the dump sql/xml outut files are copied over, which generally takes a while if any large wikis have output, and then a small tarball of status and html files is copied over to be unpacked later; this is a very fast copy. These are all marked in the attached file of logs.

start-end.txt12 KBDownload

You can see that we rsync to each host in order, dumpsdata1003 (the fallback), labstore1006, then labstore1007 and then we loop around again for the next run.
The June 6 rsync to dumpsdata1003 took over a day and a half, and the subsequent rsync to labstore1006 took over 4 days.

I have two parts to my plan. The first is to use the fallback host to pull from the primary with the usual bandwidth cap, and then turn around and rsync to the labstore boxes one at a atime as fast as its disks and 1g nic will permit. This is subject to tweaking if the labstore boxes seem unhappy.
The idea behind this is to reduce the wait time between rsyncs to the fallback host as much as possible, as well as reducing the amount of data that can accumulate in the "backlog" for the next host during rsync of the current one. This would mean we lose less data and time redoing dumps if we do have to rely on the fallback host at some point, as well as data getting out to the public sooner.

The second part is to rsync one wiki at a time, sending first the wiki and then the status tarball for that wiki, for the bigger wikis; the rest could be done in large groups (all the a's, all the b's and so on). This isn't really feasible with the current setup because an rsync of an average wiki like say elwiki to labstore1006 or 7, just to generate the file list takes over 2 minutes, and so with 26 groups and each big wiki separate you're already adding on 2 hours to each rsync pass. But rsyncs to dumpsdata1003 get through that step of generating the file list in about 15 seconds, which adds up to around 15 minutes of delay total, very doable.
The idea behind this is to keep the status files tarball of each wiki from being too outdated. We necessarily bundle it up before starting the rsync of the files referenced in the various html and json status files, so that when the status files are unpacked remotely, they don't refer to or link things that aren't available for the user. But if we bundle it up once at the start of the rsync, it could be days old. This way we can limit the age to a few hours, with any luck.

Here's a file showing rsync times to generate file lists from dumpsdata1001 to the various servers, for a group of wikis, for one large wiki, and for some smallish wiki, just so we have it. The real rsync commands have a bunch of excludes but those don't affect the run times.

samples.txt2 KBDownload

Change 605990 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] restructure rsync of xml/sql dumps from primary source to other servers

https://gerrit.wikimedia.org/r/605990

gerritbot added a project: Patch-For-Review.Jun 16 2020, 6:36 PM

Welp. While this restructure needs to happen, the real cause for rsyncs running slow from the dumpsdata1001 server was a bunch of dumps_check_exception scripts that had piled up, apparently one failing to finish before the next one started. I'll be adding a check for that now. Load was a ridiculous 200 over there, and is now back down to less than 2.

Change 606955 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] the dumps exception checker should not start if it's already running

https://gerrit.wikimedia.org/r/606955

Change 606955 merged by ArielGlenn:
[operations/puppet@production] the dumps exception checker should not start if it's already running

https://gerrit.wikimedia.org/r/606955

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Jun 22 2020, 10:57 AM

Change 606994 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] alert on high load on xml dumps nfs primary

https://gerrit.wikimedia.org/r/606994

Change 606994 merged by ArielGlenn:
[operations/puppet@production] alert on high load on xml dumps nfs primary

https://gerrit.wikimedia.org/r/606994

Change 608885 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] fix the long-running dumps exception checker issue

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608885

Change 608885 merged by ArielGlenn:
[operations/puppet@production] fix the long-running dumps exception checker issue

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608885

Change 613639 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] start restructure of dumps rsync

https://gerrit.wikimedia.org/r/613639

Change 613639 merged by ArielGlenn:
[operations/puppet@production] start restructure of dumps rsync

https://gerrit.wikimedia.org/r/613639

Change 614755 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] dumps rsync refactor, better opts and flags handling

https://gerrit.wikimedia.org/r/614755

Change 614826 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] rename the main rsyncer script i prep for script that rsyncs via secondary

https://gerrit.wikimedia.org/r/614826

Change 614839 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] script for rsyncing dumps via secondary storage server

https://gerrit.wikimedia.org/r/614839

Change 614755 merged by ArielGlenn:
[operations/puppet@production] dumps rsync refactor, better opts and flags handling

https://gerrit.wikimedia.org/r/614755

Change 614826 merged by ArielGlenn:
[operations/puppet@production] rename the dump rsyncer script preparing for new one that rsyncs via secondary

https://gerrit.wikimedia.org/r/614826

ArielGlenn moved this task from Active to Backlog on the Dumps-Generation board.Mar 17 2021, 1:15 PM

ArielGlenn mentioned this in T289048: Restructure rsync of XML/SQL dumps and dumpsdata space/network/disk use.Aug 17 2021, 10:31 AM

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Jan 24 2022, 10:11 AM

ArielGlenn updated the task description. (Show Details)Feb 2 2022, 9:11 AM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 7:32 PM

fnegri moved this task from Kanban to Inbox on the cloud-services-team board.

	F31858917: Στιγμιότυπο από 2020-06-09 12-40-40.png
	Jun 9 2020, 9:41 AM

	F31868094: samples.txt
	Jun 16 2020, 2:30 PM

	F31868046: start-end.txt
	Jun 16 2020, 1:44 PM

Restructure rsyncs of dumps to the labstore boxesOpen, MediumPublicActions

Description

Details

Related Objects

Event Timeline

Restructure rsyncs of dumps to the labstore boxes
Open, MediumPublic
Actions