Currently the box where dumps are generated (dumpsdata1001) rsyncs with a tight bandwidth cap to three other servers in serial, one of these being our public-facing webserver and one being our fallback server. There are two issues with this: the web server sometimes remains very out of date, as today when it does not show any completed stub files for the wikidata run, even though those were completed on June 2, and the fallback host also may be out of data by up to two days.
We should rsync to the fallback host with a bandwidth cap, so that dumps generation is not impacted, and on that host rsync to the labstore boxes with a much higher or no cap if they can handle it. The fallback host should probably get a 10GB NIC too, although then disk iops on its end will be the limiting factor.
Or, we should look for some other way of moving files around that ensures more timely updates to the fallback host and the labstore boxes.
Things we can look into:
- moving to nfsv4; how would this improve performance? are any cache race issues present in the current codebase?
- designing a different mechanism to transfer over index.html and status files preserved from a 'snapshot' of the dir, over after all content files have finished,
- rsync from primary nfs server to a second server only which handles all other rsyncs, potentially with higher/no bandwidth caps to the other servers
- deploying two primary nfs servers which each store about half of the content, allowing more clients to write to each one without throttling cpu/iops/bandwidth on the primaries when more clients are needed to complete dumps of more content in the same period of time
- ??