Page MenuHomePhabricator

rsync-published-datasets cron should not launch multiple rsync processes
Closed, ResolvedPublic


The rsync-published-datasets cron is scheduled to run every 15 minutes on stat1005 and stat1006. If an rsync takes more than 15 minutes to run, due to new large files, a new process will still be started. We should make sure that only one distinct rsync-published-datasets process can run at any given time.

Event Timeline

Re-happened again several times today, I killed the rsync processes and now everything looks good. It seems that when a lot of rsyncs use disk cache for IO then the OOM killer prefers to target anonymous memory (like processes) causing damages to thorium websites.

elukey triaged this task as High priority.Sep 1 2017, 10:13 AM

Mentioned in SAL (#wikimedia-operations) [2017-09-01T10:35:31Z] <elukey> stop puppet on thorium and disable root rsyncs - T174756

There is currently one huge rsync job running, I'll leave it going until it finishes before restarting puppet:

elukey@thorium:~$ sudo lsof | grep 17453
elukey@thorium:~$ sudo lsof | grep 17453
rsync     17453                   root    4r      REG              253,0 25169879040   25310634 /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json

rsync     17453                   root    7u      REG              253,0 29348593664   25304187 /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json.ByiCiO

elukey@thorium:~$ du -hs /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json
24G	/srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json

elukey@thorium:~$ du -hs /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json.ByiCiO
29G	/srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json.ByiCiO

My understanding was wrong, I thought that thorium was rsyncing from stat, meanwhile is the other way around (stat1006 -> thorium). Maybe disabling stat1006's puppet until the file enwiki_20170420_reverts.json is rsynced might help, trying it.

The file has been removed by @Halfak so everything should be back to normal.

Change 379234 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] flock before attempting to run rsync of published-datasets

Ottomata edited projects, added Analytics-Kanban; removed Analytics.
Ottomata moved this task from Next Up to In Code Review on the Analytics-Kanban board.

Change 379234 merged by Ottomata:
[operations/puppet@production] flock before attempting to run rsync of published-datasets