The rsync-published-datasets cron is scheduled to run every 15 minutes on stat1005 and stat1006. If an rsync takes more than 15 minutes to run, due to new large files, a new process will still be started. We should make sure that only one distinct rsync-published-datasets process can run at any given time.
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +4 -2 | flock before attempting to run rsync of published-datasets |
Event Timeline
Re-happened again several times today, I killed the rsync processes and now everything looks good. It seems that when a lot of rsyncs use disk cache for IO then the OOM killer prefers to target anonymous memory (like processes) causing damages to thorium websites.
Mentioned in SAL (#wikimedia-operations) [2017-09-01T10:35:31Z] <elukey> stop puppet on thorium and disable root rsyncs - T174756
There is currently one huge rsync job running, I'll leave it going until it finishes before restarting puppet:
elukey@thorium:~$ sudo lsof | grep 17453 elukey@thorium:~$ sudo lsof | grep 17453 [..] rsync 17453 root 4r REG 253,0 25169879040 25310634 /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json rsync 17453 root 7u REG 253,0 29348593664 25304187 /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json.ByiCiO elukey@thorium:~$ du -hs /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json 24G /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json elukey@thorium:~$ du -hs /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json.ByiCiO 29G /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json.ByiCiO
My understanding was wrong, I thought that thorium was rsyncing from stat, meanwhile is the other way around (stat1006 -> thorium). Maybe disabling stat1006's puppet until the file enwiki_20170420_reverts.json is rsynced might help, trying it.
Change 379234 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] flock before attempting to run rsync of published-datasets
Change 379234 merged by Ottomata:
[operations/puppet@production] flock before attempting to run rsync of published-datasets