The tools backup failed again last night. I'm not sure why, possibly related to high overnight load. I tried to restart it manually and labstore1001 surged in load and become barely responsive. At this time we hover around 10 for load or greater on an 8 core box at the best of times. But whether we can handle the backup jobs with our normal overload may be secondary, there are some pieces of the replicate setup that need attention
- Some options with rsync seem nonfunctional or errant like "--filter=._/etc/replication-rsync.conf"
- Tools at rest on the remote end is consistently larger than source (why?)
- It is ioniced and working now that I have changed to CFQ for disk io scheduling but not nice'd otherwise and causes load issues of its own
- We keep snapshots locally with a loose cleanup logic
- How much history we keep on labstore1002 seems not well understood. we keep some snapshots from tools though at this moment it seems random?
tools20160209020010 backup swi-a-s--- 1.00t tools 65.86 tools20160219020015 backup swi-a-s--- 1.00t tools 7.03 tools20160219211007 backup swi-a-s--- 1.00t tools 0.00
we seem to keep only 1 days history for non-tools backups.
- We continually see issues during our backup process where the snapshot'ng/backup processes causes high load (sometimes really high load) and effects NFS operations
- Monitoring does not capture the result but rather the action of the backup job
- Backups are staggered by only by an hour and they often end up running concurrently
- snapshots kept on the remote labstore2001 or local labstore1001 filling up can kill the backup jobs