Page MenuHomePhabricator

Rotate Grid Engine accounting file
Closed, ResolvedPublic

Description

Grid Engine writes a record of every job that is run to an accounting log that is placed in /data/project/.system/accounting on our deployment. This file gets quite large; it's currently 1.5G. We should figure out how to rotate/truncate it.

Its nice to have 30+ days of history for things like https://tools.wmflabs.org/grid-jobs/, but we really don't need more history than that. The rotation script that http://gridscheduler.sourceforge.net/howto/rotatelogs.html documents looks like it could be most of what we need.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Could we simply use logrotate to rotate the file every month? Current month would be accounting and previous month accounting.1.

$SGE_ROOT/$SGE_CELL/common/accounting {
    compress
    missingok
    notifempty
    maxage 400
    monthly
    rotate 14
}

https://arc.liv.ac.uk/repos/darcs/sge/source/dist/util/resources/configs/logrotate.sge

Could we simply use logrotate to rotate the file every month? Current month would be accounting and previous month accounting.1.

https://arc.liv.ac.uk/repos/darcs/sge/source/dist/util/resources/configs/logrotate.sge

Having just manually saved a ~1 month portion of >1 year of data that had built up and then truncated the file... yes, logrotate should work fine. Because this file lives in the tools project NFS share I think the log rotation should be done on the NFS master (currently labstore1004) rather than from a tools project instance. Honestly I think we can ignore rotating the file for the Trusty grid since we will be killing it off by April 1st, but we should add this rotation to the new grid from the start.

I think this rotation setup would work out fine:

/exp/project/tools/project/.system_sge/gridengine/default/common/accounting {
    compress
    delaycompress
    missingok
    notifempty
    rotate 8
    weekly
}

Change 485697 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: Rotate SGE accounting file from NFS master

https://gerrit.wikimedia.org/r/485697

Discussion of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/485697/2 copied from gerrit for better visibility and historic tracking:

@Bstorm wrote:

Patch Set 2:

Since I once manually truncated the log and found that it basically broke the grid for a bit, my first thought is to use the system script (which we have installed and is documented here http://gridscheduler.sourceforge.net/howto/rotatelogs.html) and a systemd timer -- but reading that script suggests it works by using 'mv' and 'rm'! Not exactly "safe" tools. I could check that more if we want, but that didn't inspire confidence beyond basic logrotate.

We should probably rotate the messages file as well (which gets even longer than accounting at times, and did recently)...but I'd rather do it on the master? The shadow being active would be a fairly broken condition (and would only happen after an outage that happened for at least 2 minutes). If we find out we need to cycle the service or some such nonsense, we'd have the option if we ran this on the master itself.

@bd808 wrote:

Patch Set 2:

Since I once manually truncated the log and found that it basically broke the grid for a bit, my first thought is to use the system script (which we have installed and is documented here http://gridscheduler.sourceforge.net/howto/rotatelogs.html) and a systemd timer -- but reading that script suggests it works by using 'mv' and 'rm'! Not exactly "safe" tools. I could check that more if we want, but that didn't inspire confidence beyond basic logrotate.

I have manually rotated the accounting file at least once in the past by copying a portion of the old file (using a fork of valhallasw's vanaf.py script) and the truncate shell utility. We could easily change the behavior of the logrotate script to copy + truncate if that seems safer somehow. Giovanni found https://arc.liv.ac.uk/repos/darcs/sge/source/dist/util/resources/configs/logrotate.sge which seems to align with a normal rename rotation, but maybe we have weirdness that needs other care.

We should probably rotate the messages file as well (which gets even longer than accounting at times, and did recently)...but I'd rather do it on the master? The shadow being active would be a fairly broken condition (and would only happen after an outage that happened for at least 2 minutes). If we find out we need to cycle the service or some such nonsense, we'd have the option if we ran this on the master itself.

I would be happy add more files and/or move this from the NFS master to the grid master. Running the rotate on the NFS server directly was just an attempt to reduce network/NFS IO. Let's chat about this task generally and figure out what we are comfortable doing. Mostly I just don't want to push this to the back of our neverending list of things to fix and forget about it. Giant accounting files have caused us problems on multiple occasions, and switching over to the new grid seems like a good time to fix this.

Change 485697 merged by Bstorm:
[operations/puppet@production] toolforge: Rotate SGE accounting file from NFS master

https://gerrit.wikimedia.org/r/485697

bd808 assigned this task to Bstorm.

Change 486328 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: rotating the log files needs perms and create

https://gerrit.wikimedia.org/r/486328

Change 486328 merged by Bstorm:
[operations/puppet@production] sonofgridengine: rotating the log files needs perms and create

https://gerrit.wikimedia.org/r/486328