rsync-published-datasets cron should not launch multiple rsync processes
Open, HighPublic

Description

The rsync-published-datasets cron is scheduled to run every 15 minutes on stat1005 and stat1006. If an rsync takes more than 15 minutes to run, due to new large files, a new process will still be started. We should make sure that only one distinct rsync-published-datasets process can run at any given time.

Ottomata created this task.Fri, Sep 1, 2:44 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFri, Sep 1, 2:44 AM
elukey added a subscriber: elukey.Fri, Sep 1, 10:10 AM

Re-happened again several times today, I killed the rsync processes and now everything looks good. It seems that when a lot of rsyncs use disk cache for IO then the OOM killer prefers to target anonymous memory (like processes) causing damages to thorium websites.

elukey triaged this task as High priority.Fri, Sep 1, 10:13 AM

Mentioned in SAL (#wikimedia-operations) [2017-09-01T10:35:31Z] <elukey> stop puppet on thorium and disable root rsyncs - T174756

There is currently one huge rsync job running, I'll leave it going until it finishes before restarting puppet:

elukey@thorium:~$ sudo lsof | grep 17453
elukey@thorium:~$ sudo lsof | grep 17453
[..]
rsync     17453                   root    4r      REG              253,0 25169879040   25310634 /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json

rsync     17453                   root    7u      REG              253,0 29348593664   25304187 /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json.ByiCiO

elukey@thorium:~$ du -hs /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json
24G	/srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/enwiki_20170420_reverts.json

elukey@thorium:~$ du -hs /srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json.ByiCiO
29G	/srv/published-datasets-rsynced/stat1006/archive/public-datasets/all/bot_conflict/.enwiki_20170420_reverts.json.ByiCiO

My understanding was wrong, I thought that thorium was rsyncing from stat, meanwhile is the other way around (stat1006 -> thorium). Maybe disabling stat1006's puppet until the file enwiki_20170420_reverts.json is rsynced might help, trying it.

elukey added a subscriber: Halfak.Fri, Sep 1, 1:56 PM

The file has been removed by @Halfak so everything should be back to normal.