The rsync-published-datasets cron is scheduled to run every 15 minutes on stat1005 and stat1006. If an rsync takes more than 15 minutes to run, due to new large files, a new process will still be started. We should make sure that only one distinct rsync-published-datasets process can run at any given time.
Re-happened again several times today, I killed the rsync processes and now everything looks good. It seems that when a lot of rsyncs use disk cache for IO then the OOM killer prefers to target anonymous memory (like processes) causing damages to thorium websites.
There is currently one huge rsync job running, I'll leave it going until it finishes before restarting puppet:
elukey@thorium:~$ sudo lsof | grep 17453 elukey@thorium:~$ sudo lsof | grep 17453 [..] rsync 17453 root 4r REG 253,0 25169879040 25310634 /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json rsync 17453 root 7u REG 253,0 29348593664 25304187 /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json.ByiCiO elukey@thorium:~$ du -hs /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json 24G /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json elukey@thorium:~$ du -hs /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json.ByiCiO 29G /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json.ByiCiO
My understanding was wrong, I thought that thorium was rsyncing from stat, meanwhile is the other way around (stat1006 -> thorium). Maybe disabling stat1006's puppet until the file enwiki_20170420_reverts.json is rsynced might help, trying it.