Page MenuHomePhabricator

Prevent overly-large log files
Closed, ResolvedPublic

Description

e.g. tools.magnus-toolserver's error.log which is ~10GB. We should do a more general scan of overly-large files, and prevent them from reaching these sizes.

See also: T98652: Clean up huge logs on toollabs


Proposing we set file size limit at 15G via limits.conf

Event Timeline

valhallasw updated the task description. (Show Details)
valhallasw raised the priority of this task from to Needs Triage.
valhallasw added a project: Toolforge.
valhallasw added a subscriber: valhallasw.
Restricted Application added a project: Cloud-Services. · View Herald TranscriptDec 28 2015, 2:42 PM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
valhallasw added a comment.EditedJan 10 2016, 10:08 PM

error.log files larger than 100MB:

103236  /data/project/icommons/error.log
109640  /data/project/freefiles/error.log
120448  /data/project/checkwiki/error.log
121680  /data/project/magnustools/error.log
121816  /data/project/wikidata-primary-sources/error.log
122000  /data/project/svgtranslate/error.log
124260  /data/project/ptwikis/error.log
138844  /data/project/templatetiger/error.log
144464  /data/project/timescale/error.log
153048  /data/project/not-in-the-other-language/error.log
153980  /data/project/persondata/error.log
156044  /data/project/enwp10/error.log
166792  /data/project/slumpartikel/error.log
211876  /data/project/jackbot/error.log
228720  /data/project/query2map/error.log
231256  /data/project/flickr2commons/error.log
236932  /data/project/anagrimes/error.log
261136  /data/project/h2bot/error.log
275616  /data/project/glamtools/error.log
290800  /data/project/robin/error.log
309592  /data/project/dnbtools/error.log
331844  /data/project/wikiviz/error.log
376824  /data/project/catscan-pop/error.log
407536  /data/project/ipp/error.log
422624  /data/project/awb/error.log
452376  /data/project/drtrigonbot/error.log
457612  /data/project/xtools-articleinfo/error.log
523708  /data/project/quentinv57-tools/error.log
535424  /data/project/denkmalliste/error.log
771096  /data/project/wikivoyage/error.log
798812  /data/project/wp-world/error.log
1239664 /data/project/fiwiki-tools/error.log
1324972 /data/project/sigma/error.log
1443236 /data/project/contributionsurveyor/error.log
1937664 /data/project/pirsquared/error.log
2241992 /data/project/makeref/error.log
2296660 /data/project/jury/error.log
2375792 /data/project/pinyin-wiki/error.log
3307928 /data/project/wikidata-todo/error.log
3456740 /data/project/zoomable-images/error.log
4042048 /data/project/joanjoc/error.log
5627316 /data/project/refill/error.log
6612788 /data/project/catfood/error.log
6882436 /data/project/catscan3/error.log
7037692 /data/project/geohack/error.log
7249268 /data/project/zoomviewer/error.log
7712608 /data/project/wiwosm/error.log
7975800 /data/project/checkpersondata/error.log
10743520        /data/project/magnus-toolserver/error.log
15187608        /data/project/citations/error.log
22835896        /data/project/osm4wiki/error.log

the largest being 22GB. The issue seems to be mostly errors/warnings from the code running under those projects.

Nemo_bis added a subscriber: Nemo_bis.EditedJan 12 2016, 9:13 AM

You could just run xz -9e on all of them so that nothing gets lost. LZMA reduces them by 3 orders of magnitude, see example with default 7za:

$ for file in *7z; do 7z l $file; done | grep error.log
Listing archive: citations_error.log.7z
Path = citations_error.log.7z
2016-01-11 21:32:38 ....A  15699769394     93802301  error.log
Listing archive: magnus-toolserver_error.log.7z
Path = magnus-toolserver_error.log.7z
2016-01-11 21:44:42 ....A  11146635793    101039521  error.log
Listing archive: osm4wiki_error.log.7z
Path = osm4wiki_error.log.7z
2016-01-11 21:36:32 ....A  24381822868     15969637  error.log
valhallasw renamed this task from Remove overly-large log files to Prevent overly-large log files.Jan 12 2016, 9:26 AM
valhallasw updated the task description. (Show Details)
valhallasw set Security to None.

Thanks, that's good to know. I misformulated this task -- the issue is not so much the space, but rather the NFS usage that is necessary to store so much (useless) data. If I remember correctly, this was one of the issues in last June's NFS crash.

valhallasw updated the task description. (Show Details)Jan 12 2016, 9:40 AM
scfc added a subscriber: scfc.Jan 12 2016, 9:45 AM

The problem with compressing those log files is that the jobs writing to them need to reliably reopen them after they are moved elsewhere, and that is not an easy task (otherwise we would be rotating the log files probably weekly; cf. T68623).

The problem with compressing those log files is that the jobs writing to them need to reliably reopen them after they are moved elsewhere

Then don't move them? Can use xz -k and a separate command (awk?) to remove the top of the file.

So we can do this with limits.conf

*               -       fsize           10000

This would set a hard limit of 10M for files for example.

Behavior is:

  • Writing a file that would be bigger returns exit code non-0 and the file itself is capped at that limit

dd if=/dev/urandom of=test bs=14M count=1; echo $?

File size limit exceeded
153

  • reading a file that already exists and is larger than the limit set is fine
  • appending new information to file that is larger than the limit fails

ulimit -a
file size (blocks, -f) 10000

14M test

echo "foo" >> test; echo $?
File size limit exceeded
153

This mitigates the damages we are fire fighting from overly large / abandoned logs / misc files of huge. We have seen single log files hit 100G all the way up to 1.5T. We have gone through another long and painful cleanup recently and here is where we are for large files:

1./persondata/access.log: 8.4G
2./oar/articles.tsv: 8.5G
3./wiwosm/access.log: 8.5G
4./wikidata-analysis/oldData1/src/wda/data/dump20140106/pages-meta-history.xml.bz2: 8.9G
5./wikidata-analysis/oldData1/src/wda/data/dump20140123/pages-meta-history.xml.bz2: 9.4G
6./merlbot/IaQSWorklistRun.out: 9.7G
7./templatetiger/public_html/dumps/svwiki-2016-01-11-4DIT.txt: 9.7G
8./currentevents/dumps/dewiki/20151102/dewiki-20151102-pages-meta-history4.xml.7z: 9.8G
9./merlbot/IaQSWorklistQuickRun.out: 9.8G
10./wikidata-analysis/oldData1/src/wda/data/dump20140210/pages-meta-history.xml.bz2: 9.8G
11./replacer/eswiki-20160501-pages-articles.xml: 9.9G
12./yifeibot/dumps/metawiki-20140306-pages-meta-current.xml: 9.9G
13./liangent-php/mw-log/20160303175852000000-13261-DispatchRecentChanges.log: 10G
14./viafbot/enwiki-20140304-pages-articles.xml.bz2: 10G
15./wikidata-analysis/oldData1/src/wda/data/dump20140226/pages-meta-history.xml.bz2: 11G
16./wikiedudashboard/access.log: 11G
17./wikiviewstats/bck/full/daystats/dump-2013-01-01-2013-11-19.sql: 11G
18./wp-world/access.log: 11G
19./lbenedix/public_html/clmb/bots/complete/data/2014-01-31/revision.csv.bz2: 12G
20./lbenedix/public_html/clmb/bots/complete/data/revision.csv.bz2: 12G
21./templatetiger/public_html/dumps/svwiki-2016-02-03-hlme.txt: 12G
22./liangent-php/mw-log/20160422031143000000-25875-DispatchRecentChanges.log: 13G
23./templatetiger/public_html/dumps/svwiki-2016-03-05-5shK.txt: 13G
24./liangent-php/mw-log/20151125233947000000-16347-DispatchRecentChanges.log: 14G
25./oar/repository_metadata_2014-06-13.tar.gz: 14G
26./templatetiger/public_html/dumps/svwiki-2016-04-07.txt: 14G
27./liangent-php/mw-log/20160505192608000000-4588-DispatchRecentChanges.log: 15G
28./merlbot/IaWorklistRun.out: 15G
29./templatetiger/public_html/dumps/svwiki-2016-05-01.txt: 15G
30./heritage/logs/update_monuments.log: 17G
31./ifttt/www/python/src/ifttt.log: 17G
32./orphan-groups/wiki-links-graph/report/enwiki-pagelinks.csv: 18G
33./templatetiger/public_html/dumps/splittest/enwiki-2014-04-02-lPqO.txt: 19G
34./osm4wiki/error.log: 21G
35./templatetiger/public_html/dumps/enwiki-2015-12-01.txt: 24G
36./paws/userhomes/17347449/data_requests/sorin_revisions/output.jsonl: 25G
37./wiwosm/error.log: 27G
38./osm/access.log: 29G
39./wikivoyage/access.log: 29G
40./.shared/wikidatajson/20140929.json: 31G
41./wikiviewstats/bck/full/current/dump_pagemap.sql: 33G
42./wikiviewstats/bck/full/old/dump_pagemap.sql: 33G
43./ifttt/uwsgi.log: 37G
44./merlbot/AuszeichnungsKategorieFehlt_weekly.out: 39G
45./splinetools/dumps/enwiki-20141106-pages-articles.xml: 48G
46./cyberbot/CyberbotI/admin-highlight.out: 50G
47./persondata/templatedata/debug.txt: 50G
48./toolserver-home-archive/archive-2014-06-05.tar.xz: 52G
49./oar/repository_text_2014-06-13.tar.gz: 75G
50./geohack/access.log: 127G

So only about the top 40 files out of many thousands are greater than 10G. 17 of those 40 are log, error, out files. 7 more are archives we really don't want people expanding/managing on NFS.


Being practical I think we should set a limit on all Tools hosts that files cannot be larger than 15G. Ideally at the moment we say 10G but that seems aggressive for the moment, so I propose 15G and then revisit. This would go a long way towards sanity. It is true then that a runaway tool spamming NFS with logs will start to get permission errors at this point. However, as things stand now we already resort to setting root only permissions on runaway tools, it's just that we do it once it has caused serious issues for all tools on the same node or labs in general.

chasemp triaged this task as High priority.May 26 2016, 6:24 PM
chasemp updated the task description. (Show Details)

On second thought as a blanket limit maybe 50G makes sense and allows the liberal use of scratch share, but either way the current 'unlimited' is not working out.

So we went from considering deletion too heavy-handed to being fine with unrecoverable errors above a certain threshold?

What is wrong with my middle ground?

Then don't move them? Can use xz -k and a separate command (awk?) to remove the top of the file.

If a timestamp gets added to the filename, it's trivial to delete those files later (there is no good reason to keep all that private data indefinitely anyway).

So we went from considering deletion too heavy-handed

I don't think we considered deletion too heavy-handed, but we try to do this while considering the needs of users.

to being fine with unrecoverable errors above a certain threshold?

This is not different from quota, where you would get 'Disk quota exceeded'. Keep in mind that the toolserver had a 256MB disk space limit, so allowing 15GB files is not exactly unreasonable, especially given that a 15GB file can already easily completely fill a hosts' NFS bandwith for at least half an hour. It also makes it super easy to accidentally kill tools-login, for example.

What is wrong with my middle ground?

First, it requires a significant engineering effort to build and maintain such a system, and we are overloaded as-is. It's easy to say something is 'trivial' when you're not the one doing the work.
Secondly, it actually places *more* load on NFS rather than less, and thus doesn't actually help to solve the underlying issue.

I'm not sure why it would be a significant engineering effort, it's just a matter of issuing one find command (with ionice?). But sure, disk quotas are a common solution.

Ladsgroup added a subscriber: Ladsgroup.EditedMay 27 2016, 11:38 AM

syslog and daemon.log work this way:

  • syslog <-- log of today
  • syslog.1 <-- log of yesterday
  • syslog.2.gz (zipped) <-- log of the day before

And so on.
It would be great to have such system (maybe in order of week not day) for jsub continuous logs. Or better, let's user choose if they want this way or not. What do you think? Beside having less size, it would be much easier to find logs of certain dates.

Unfortunately that's not possible without straining NFS even more. SGE does not allow move-and-write-to-new-file style logrotation, which means we have to use copy-and-truncate, which would double the NFS load. In addition, the same issues with human resources apply: someone has to build and maintain a syslog config to do so.

valhallasw moved this task from Triage to Backlog on the Toolforge board.May 27 2016, 1:11 PM

Thanks @valhallasw

Anything that is reactive here doesn't satisfy the task in my mind. If we setup a cron that finds and truncates, or compresses, or rotates it changes the nature of the problem but doesn't address it as the task title suggest. Log growth is not predictable in this environment, some bots go crazy and log >1T in a few days or even attempt this rate within a single day. That kind of load burns our server to the ground historically, and while we are doing much better due to other rate limiting in place it has still been very time consuming to manage this problem. We don't as of now generally have the overhead on the labstore server to run these find jobs or long tailed compression and rotation jobs. I have spent a lot of time stabilizing this design but the real issue is preventing the need for constant intervention so we can make things better in the longer term.

@Ladsgroup you are absolutely right that model is much better, but we need to move logs off of NFS and into an actual log management system to accomplish it. Logs on NFS is a bad solution, has been a bad solution, and will never be a good solution. This change is really about moving us further towards being able to accomplish that goal.

scfc added a comment.May 27 2016, 6:35 PM

[…]

to being fine with unrecoverable errors above a certain threshold?

This is not different from quota, where you would get 'Disk quota exceeded'. Keep in mind that the toolserver had a 256MB disk space limit, so allowing 15GB files is not exactly unreasonable, especially given that a 15GB file can already easily completely fill a hosts' NFS bandwith for at least half an hour. It also makes it super easy to accidentally kill tools-login, for example.
[…]

On Toolserver, webserver logs were shared between all tools and thus not included in the quota (as a side-effect of that, debugging could be much harder than in Tools where each tool has a dedicated log).

What I sincerely dislike about the proposal to cut at 50 GByte (or 15 GByte) is the disconnect between cause and effect. On Toolserver, when you reached your quota you were warned (on login and by mail) and could act. It was your problem when you did not. With this proposal, it's the tool developer's duty to constantly fear that one of their files ran away. They get blamed, but have no monitoring at their disposal (unless hundreds of developers each write their own every-5-minute cron job to alarm them).

If you have a central monitoring system, you can also use more helpful solutions than just denying write access. Just a short glimpse at the file will often indicate what needs to be done to prevent it in the future, so you can point the developer to what the problem is and how to fix it.

It is in no way a perfect solution, but it is a solution that prevents the largest issues with a minimum of engineering effort. Having a better logging solution is on the radar (T127367).

For what it's worth, this is what happens on the grid if the h_fsize limit is set:

  • ulimit's fsize is set to 15/1000 of the value of h_fsize
  • process core dump with 19225 File size limit exceeded(core dumped) dd count=10000 if=/dev/zero of=/dev/stdout
  • this error is written to error.log

Well, unless error.log is also too large, which could be an issue as well for lighttpd logs...

No, it's definitely not optimal, but I'm not aware of any other solutions that would prevent runaway log files without significant human effort.

bd808 moved this task from Backlog to Shared Storage on the Data-Services board.Jul 24 2017, 12:40 AM
Framawiki removed a subscriber: Framawiki.
Framawiki added a subscriber: Framawiki.
Bstorm added a comment.EditedMar 12 2019, 7:03 PM

Linking in recent pages and even outages due to log files in excess of 1TB. There are far more of those, but the recent ones provide context around why I'm looking at this old task again.

Change 496082 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: limit file size to 50 GB

https://gerrit.wikimedia.org/r/496082

Change 496082 merged by Bstorm:
[operations/puppet@production] toolforge: limit file size to 50 GB

https://gerrit.wikimedia.org/r/496082

Bstorm closed this task as Resolved.Mar 14 2019, 6:06 PM