Page MenuHomePhabricator

Simple logrotate service for users of Tools as stopgap before central logging
Open, MediumPublic

Description

Now that we have migrated the tools to the new cluster we have some overhead to do processing for something like log rotation. This is going to be relatively primitive but at the moment I have been forced to say we have no option for this in the case of many runaway logs. A small practical solution will allow us to focus on the actual intended central logging outcome instead of constantly auditing and mitigating logs that could easily be truncated on a schedule.

Basic premise:

  • Create a 'logs' directory in every tools directly (by default and retroactively)
  • Handle files ending in .log and .err (keep a months worth unless x size type of thing. Basically, allow us to keep the lights on and stop saturating the Tools share)
  • Make jsub put .out and .err into 'logs' as .log and .err
  • Make other services were we control the default output path put logs in this dir
  • Be able to offload cases where a user has no ability to manage their own logs to this programattic mechanism

Event Timeline

I think this is a duplicate of T68623. Note that blocking that issue is not missing man-power, but technical impossibilities.

@scfc thanks, I had not seen that but yes very similar. I'm avoiding logrotate for this reason (though their is a potential solution there w/ setfacl but I'd rather not go there as it gets hairy over NFS)

I know jsub already has an option to log to a specific directory (Ive been using this for a while to avoid sge logs flooding my $HOME perhaps setting a global option using a username variable?

Of note some tools are already using the internal logs dir scheme

ls -ald /srv/tools/shared/tools/project/**/logs | wc

76     684    7228

In an effort to create good defaults without stepping on existing toes at the moment I'm excluding anyone who is currently using a 'logs' directory intra tool home. P4580

This is an existing file instead of a dir:

touch: cannot touch ‘/srv/tools/shared/tools/project/afcbot/logs/.norotate’: Not a director

Otherwise all of them now have a '.norotate' file and I the mechanism will ignore the dir.

Change 326153 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] tools: job to copytruncate logs in place

https://gerrit.wikimedia.org/r/326153

du -sh /srv/tools/shared/tools/project/admin/*

2.2G	/srv/tools/shared/tools/project/admin/access.log
0	/srv/tools/shared/tools/project/admin/admin
123M	/srv/tools/shared/tools/project/admin/archived-packages
12K	/srv/tools/shared/tools/project/admin/bin
49M	/srv/tools/shared/tools/project/admin/error.log
4.0K	/srv/tools/shared/tools/project/admin/logs
0	/srv/tools/shared/tools/project/admin/public_html
4.0K	/srv/tools/shared/tools/project/admin/replica.my.cnf
4.0K	/srv/tools/shared/tools/project/admin/service.log
4.0K	/srv/tools/shared/tools/project/admin/service.manifest
12M	/srv/tools/shared/tools/project/admin/tool-admin-web
484K	/srv/tools/shared/tools/project/admin/toolhistory.err
2.2M	/srv/tools/shared/tools/project/admin/toolhistory.out
4.0K	/srv/tools/shared/tools/project/admin/toolinfo.json
60M	/srv/tools/shared/tools/project/admin/var

Change 326153 merged by Rush:
[operations/puppet@production] tools: job to copytruncate logs in place

https://gerrit.wikimedia.org/r/326153

Change 371633 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] tools: followup to 326153

https://gerrit.wikimedia.org/r/371633

Change 371633 merged by Rush:
[operations/puppet@production] tools: followup to 326153

https://gerrit.wikimedia.org/r/371633

/usr/local/sbin/logcleanup --config /etc/logcleanup-config.yaml --debug

2017-08-12 17:56:41,618 DEBUG Namespace(config='/etc/logcleanup-config.yaml', debug=True, dir=None, end_with=None, max_copytruncate=None, min_rotate_size=None, rotation_day=None, tail_lines=None)
2017-08-12 17:56:41,621 DEBUG {'rotation_day': 'wednesday', 'end_with': ['log', 'err', 'out'], 'max_copytruncate': 10000000, 'min_rotate_size': 1000, 'debug': True, 'config': '/etc/logcleanup-config.yaml', 'tail_lines': 10000, 'dir': ['/srv/tools/shared/tools/project/admin']}
2017-08-12 17:56:41,621 DEBUG found 1 valid paths
2017-08-12 17:56:41,621 DEBUG Found 5 valid files from
2017-08-12 17:56:41,621 DEBUG /srv/tools/shared/tools/project/admin/toolhistory.err
2017-08-12 17:56:41,621 DEBUG /srv/tools/shared/tools/project/admin/toolhistory.err is 488563 bytes
2017-08-12 17:56:41,622 DEBUG /srv/tools/shared/tools/project/admin/error.log
2017-08-12 17:56:41,622 DEBUG /srv/tools/shared/tools/project/admin/error.log is 51244464 bytes
2017-08-12 17:56:41,622 WARNING /srv/tools/shared/tools/project/admin/error.log is larger than 10000000
2017-08-12 17:56:42,872 DEBUG /srv/tools/shared/tools/project/admin/error.log tailed to /srv/tools/shared/tools/project/admin/error.log.1
2017-08-12 17:56:42,976 INFO truncate /srv/tools/shared/tools/project/admin/error.log
2017-08-12 17:56:42,977 DEBUG /srv/tools/shared/tools/project/admin/toolhistory.out
2017-08-12 17:56:42,977 DEBUG /srv/tools/shared/tools/project/admin/toolhistory.out is 2235922 bytes
2017-08-12 17:56:42,977 DEBUG /srv/tools/shared/tools/project/admin/access.log
2017-08-12 17:56:42,977 DEBUG /srv/tools/shared/tools/project/admin/access.log is 2287822262 bytes
2017-08-12 17:56:42,978 WARNING /srv/tools/shared/tools/project/admin/access.log is larger than 10000000
2017-08-12 17:56:43,031 DEBUG /srv/tools/shared/tools/project/admin/access.log tailed to /srv/tools/shared/tools/project/admin/access.log.1
2017-08-12 17:56:44,262 INFO truncate /srv/tools/shared/tools/project/admin/access.log
2017-08-12 17:56:44,263 DEBUG /srv/tools/shared/tools/project/admin/service.log
2017-08-12 17:56:44,264 DEBUG /srv/tools/shared/tools/project/admin/service.log is 332 bytes
2017-08-12 17:56:44,264 DEBUG /srv/tools/shared/tools/project/admin/service.log is too small to rotate
2017-08-12 17:56:44,264 DEBUG processed 5 logs

du -sh /srv/tools/shared/tools/project/admin/*

12K	/srv/tools/shared/tools/project/admin/access.log
2.4M	/srv/tools/shared/tools/project/admin/access.log.1
0	/srv/tools/shared/tools/project/admin/admin
123M	/srv/tools/shared/tools/project/admin/archived-packages
12K	/srv/tools/shared/tools/project/admin/bin
0	/srv/tools/shared/tools/project/admin/error.log
2.4M	/srv/tools/shared/tools/project/admin/error.log.1
4.0K	/srv/tools/shared/tools/project/admin/logs
0	/srv/tools/shared/tools/project/admin/public_html
4.0K	/srv/tools/shared/tools/project/admin/replica.my.cnf
4.0K	/srv/tools/shared/tools/project/admin/service.log
4.0K	/srv/tools/shared/tools/project/admin/service.manifest
12M	/srv/tools/shared/tools/project/admin/tool-admin-web
484K	/srv/tools/shared/tools/project/admin/toolhistory.err
2.2M	/srv/tools/shared/tools/project/admin/toolhistory.out
4.0K	/srv/tools/shared/tools/project/admin/toolinfo.json
60M	/srv/tools/shared/tools/project/admin/var

Change 371634 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] tools: remote logcleanup active cron for now

https://gerrit.wikimedia.org/r/371634

2017-08-12 18:01:08,214 DEBUG /srv/tools/shared/tools/project/admin/access.log
2017-08-12 18:01:08,214 DEBUG /srv/tools/shared/tools/project/admin/access.log is 2287903326 bytes
2017-08-12 18:01:08,214 WARNING /srv/tools/shared/tools/project/admin/access.log is larger than 10000000

but

140K /srv/tools/shared/tools/project/admin/access.log

Change 371634 merged by Rush:
[operations/puppet@production] tools: remote logcleanup active cron for now

https://gerrit.wikimedia.org/r/371634

So python still thinks the original access.log is the 2.2G from pre rotation?

fpath_size = os.path.getsize(fpath)
/srv/tools/shared/tools/project/admin/access.log 2288075289

edit:

du -b /srv/tools/shared/tools/project/admin/access.log

2288273468 /srv/tools/shared/tools/project/admin/access.log

du -b access.log
2288534827 access.log

webservice restart

du -b access.log
2288568410 access.log

access.log

tools.admin@tools-bastion-03:~$ du -b access.log
0 access.log

sleep 30

du -b access.log
2288607847 access.log


tools.admin@tools-bastion-03:~$ du -h access.log
32K access.log
tools.admin@tools-bastion-03:~$ du -b access.log
2288630767 access.log

tools.admin@tools-bastion-03:~$ ls -i access.log
80122951 access.log

tools.admin@tools-bastion-03:~$ truncate --size 0 access.log
tools.admin@tools-bastion-03:~$ du -b access.log
0	access.log
tools.admin@tools-bastion-03:~$ du -b access.log
2288733107	access.log
tools.admin@tools-bastion-03:~$ python
Python 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os.path; os.path.getsize('access.log')
2288755665

tools.admin@tools-bastion-03:~$ ls -i access.log
80122951 access.log

tools.admin@tools-bastion-03:~$ du -b access.log
2288823046	access.log
tools.admin@tools-bastion-03:~$ ls -i access.log
80122951 access.log
tools.admin@tools-bastion-03:~$ webservice stop
Stopping webservice.
tools.admin@tools-bastion-03:~$ > acc
access.log    access.log.5
tools.admin@tools-bastion-03:~$ > access.log
tools.admin@tools-bastion-03:~$ webservice start
Starting webservice.
tools.admin@tools-bastion-03:~$ du -b access.log && sleep 60
0	access.log
tools.admin@tools-bastion-03:~$ du -b access.log
39680	access.log

wut

root@labstore1004:~# du -b /srv/tools/shared/tools/project/admin/access.log
287995	/srv/tools/shared/tools/project/admin/access.log
root@labstore1004:~# truncate --size 0 /srv/tools/shared/tools/project/admin/access.log
root@labstore1004:~# du -b /srv/tools/shared/tools/project/admin/access.log
0	/srv/tools/shared/tools/project/admin/access.log
root@labstore1004:~# du -b /srv/tools/shared/tools/project/admin/access.log
307427	/srv/tools/shared/tools/project/admin/access.log

> lighttpd 26283 tools.admin 5w REG 0,30 599081 80122951 /mnt/nfs/labstore-secondary-tools-project/admin/access.log (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)

Mostly silly troubleshooting on my part here :) When truncating the file for lighttpd that is not using O_APPEND the seek position remains since the execution environment process still has the file handle open. Stopping the lighttpd process post truncation makes this clear. We have done cleanup of this variety a bunch of times and it has never really been sussed out, probably as the environment was not really stable enough for us to notice the long term effects and the cleanup is effective when the most serious offenders are usually scheduled jobs or jobs that do not maintain long running file handles on the log file.

I should have noticed this but I was confusing myself with a wc on the /wrong file/ and so the source data seemed to be there :) Faidon outlined the expected behavior of truncate on a file and I reversed my steps and then smacked my forehead.

O_APPEND is actually not recommened in NFS setups:

O_APPEND
              The file is opened in append mode.  Before each write(2), the
              file offset is positioned at the end of the file, as if with
              lseek(2).  The modification of the file offset and the write
              operation are performed as a single atomic step.

              O_APPEND may lead to corrupted files on NFS filesystems if
              more than one process appends data to a file at once.  This is
              because NFS does not support appending to a file, so the
              client kernel has to simulate it, which can't be done without
              a race condition.

The underlying logcleanup script itself makes a wrong assumption that if a file is x size that it also has n content for tailing. When it runs and sees an access.log that is larger than the max size it then tries to apply the tail count but if that file has already been truncated the operation fails as seek does not return as expected. A wc on the example access.log post truncate with the file handle open will return only /new lines/ since truncation and since this generally could not satisfy the count for the copy truncate function the logcleanup script locks up and load increases as it spins its wheels.

I explored a per-lighttpd* Toolforge execution host logcleanup script. i.e. using the sge supervising process to track down all running webservers to truncatethen send HUP. That gets messy potentially as it mucks about confusing SGE. Not sure what makes the most sense but doing this server side in an ongoing fashion has issues.

http://man7.org/linux/man-pages/man2/open.2.html