Page MenuHomePhabricator

Cluebot writes massive logs that are making labstore run out of space and surge in load making toollabs unavailable
Closed, ResolvedPublic

Description

Cluebot has a 1.2T php error log file that's growing fast, and making tools run out of disk space. This is bad!

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a subscriber: yuvipanda.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 17 2016, 7:10 PM
root@labstore1001:/srv/project/tools/project/cluebot/logs# ls -lsh
total 1.2T
757M -rw-rw---- 1 51109 51109 757M Feb 17 19:15 cbng_bot.err
221M -rw-rw---- 1 51109 51109 221M Feb 17 19:15 cbng_bot.out
848K -rw-rw---- 1 51109 51109 841K Feb 17 19:15 cbng_core.err
   0 -rw-rw---- 1 51109 51109    0 Feb  9 19:57 cbng_core.out
856K -rw-rw---- 1 51109 51109 852K Feb 17 19:15 cbng_redis.err
   0 -rw-rw---- 1 51109 51109    0 Feb  9 19:57 cbng_redis.out
856K -rw-rw---- 1 51109 51109 852K Feb 17 19:15 cbng_relay.err
   0 -rw-rw---- 1 51109 51109    0 Feb  9 19:57 cbng_relay.out
1.2T -rw-rw---- 1 51109 51109 1.2T Feb 17 19:15 cluebot3.err
1.1G -rw-rw---- 1 51109 51109 1.1G Feb 17 19:15 cluebot3.out
4.0M -rw-rw---- 1 51109 51109 4.0M Feb 17 19:15 cron.err
   0 -rw-rw---- 1 51109 51109    0 Feb  9 19:57 cron.out
chasemp triaged this task as High priority.Feb 17 2016, 7:59 PM
chasemp set Security to None.
chasemp added a subscriber: chasemp.

It's growing at a rate of >100G a day, that is unsustainable

I've symlinked the log file to /dev/null, let's see how that goes!

I also restarted all the restartable jobs so that this change can take effect.

Samtar added a subscriber: Samtar.Feb 17 2016, 8:21 PM

Adding Rich as maintainer of that tool...

I'll look in to chilling down the log files a little bit...

RichSmith added a subscriber: Cobi.Feb 17 2016, 9:31 PM
DamianZaremba added a subscriber: DamianZaremba.EditedFeb 17 2016, 10:00 PM

I replied to the mailing list about this, in reply to the email about NFS running out of space.

Cleared down the (at the time) 1.5T log file. Currently it's at ~2G:

du -tools.cluebot@tools-bastion-01:~$ du -shc *
3.1M access.log
4.0K backup_database.sh
1.4M cluebot3
55M cluebotng
220M cluebotng-old
492K error.log
2.1G logs
38G mysql_backups
19M node
620K node_modules
1.3M public_html
4.0K replica.my.cnf
4.0K service.log
4.0K service.manifest
165M software
4.0K web-settings.php
40G total

Per the mailing list reply - do you have advice on the best practice for log rotation of files created by jsub's stdout/stderr - I don't really want to have to restart the process on rotation, a copy rotate based implementation is an option if a nice reload is not supported.

I'll have a look into doing nice log rotation/management in the app - currently the log grows at a couple of hundred MB an hour, so can manually truncate it until such as time that it nicely handles rotation (could also be a cron).

I replied to the mailing list about this, in reply to the email about NFS running out of space.

Cleared down the (at the time) 1.5T log file. Currently it's at ~2G:

Per the mailing list reply - do you have advice on the best practice for log rotation of files created by jsub's stdout/stderr - I don't really want to have to restart the process on rotation, a copy rotate based implementation is an option if a nice reload is not supported.

I'll have a look into doing nice log rotation/management in the app - currently the log grows at a couple of hundred MB an hour, so can manually truncate it until such as time that it nicely handles rotation (could also be a cron).

hey @DamianZaremba!

I think @yuvipanda sent some of the logging to /dev/null for now. The bad news is we have no great answer for this question. We have some longer term considerations up in the air like central logging but for now self-rotation is more or less the only option. I'm not sure but I think the rate at which this has grown increased somewhat dramatically within the last 6 weeks or so. Could it be indicative of more verbose logging or some pervasive small error filling things up?

It wouldn't be a good idea for everyone but if the logging is needed in this case, and considering the mass of it, a cron to rotate isn't the worst short term solution.

Thanks @chasemp! I can see the logs are current linked to /dev/null.

After a very quick look, it seems the MySQL connection is being a little un-reliable. Due to the db not returning a result it causes a number of large stack traces to be printed out.

A very quick manual run results in:

mysqli_ping(): MySQL server has gone away in
mysqli_query(): Premature end of data (mysqlnd_wireprotocol.c:1243)
mysqli_ping(): MySQL server has gone away in
Packets out of order. Expected 1 received 111. Packet size=7233394 in (mysqli_ping is on this line)

This would match up with sentry alerts:

I like the idea of central logging or such, would just need to be careful as to what's logged (currently login tokens etc in the output).

I'll work on some application based log handling and see if the MySQL state can be improved - this is probably causing a few other issues.

The logs were also almost 100% 'undefined variable' and 'undefined index' php errors along with associated stack traces, with very little mysql ones.

I've cleaned up the code in both bots and put in MonoLog with a daily rotation. I'll keep an eye on the size for a couple of days to make sure things are looking ok.

CBNG was creating a few GB due to MySQL issues, but the bulk was indeed CB III with php warnings.

An hour of running with the changes has created the combined log size of ~2M. A little excessive still, but manageable.

Joe added a subscriber: Joe.Feb 19 2016, 2:11 PM

For the record, this morning (UTC) cluebot3.php was writing tons of warnings about incorrect types to its logfile on nfs.

The effect can be seen here

the load on labstore skyrocketed repeatedly, and in the last two cases I backtraced it to cluebot.

So basically the excessive load from logging in cluebot3 has killed the performance of all of Toollabs and Labs in general for several days.

For now I've set the crontab to log errors on /dev/null, and please do not change that back.

Joe renamed this task from Cluebot writes massive logs that are making labstore run out of space to Cluebot writes massive logs that are making labstore run out of space and surge in load making toollabs unavailable.Feb 19 2016, 2:11 PM
Joe raised the priority of this task from High to Unbreak Now!.

I've updated the crontabs to restart the processes with no stdout/stderr logging enabled (https://github.com/DamianZaremba/cluebotng/commit/3d026466688691972563313a6f1f8720c3efe714) to ensure the /dev/null doesn't get reverted and re-deployed the code.

The application logs seem to be at a stable volume still:

tools.cluebot@tools-bastion-01:~/logs$ du -shc *-*.log
512K	cluebot3-2016-02-18.log
8.0M	cluebot3-2016-02-19.log
4.2M	cluebotng-2016-02-18.log
17M	cluebotng-2016-02-19.log
29M	total

Apologies for the disk spam.... I'll try to spend some more time improving the logging.

So I just chowned again the error log to root to prevent writes that caused labstore1001 to be swamped again (see README_NOW_MESSAGE_FROM_WMF_OPS); at the time I created that file I was unaware of this task.

Load dropped from 70 to 11 within a minute.

Time of last write (before I chowned it): Feb 19 14:39 error.log

mark added a subscriber: mark.Feb 19 2016, 3:17 PM
scfc added a subscriber: scfc.Feb 20 2016, 7:06 PM

(Side note: I have moved the code that tested for and submitted continuous jobs from crontab to ~/.bigbrotherrc; cf. https://github.com/DamianZaremba/cluebotng/issues/2. This is not related to this task, but while looking at the crontab I couldn't resist.)

No news here for two weeks.
Can this task get an update and/or an assignee, please?

yuvipanda lowered the priority of this task from Unbreak Now! to Medium.Mar 10 2016, 7:35 PM

Not really UBN anymore.

scfc added a comment.Nov 24 2016, 4:03 PM

A lot of error.log is:

2016-11-24 15:39:44: (mod_fastcgi.c.2673) FastCGI-stderr: PHP Deprecated:  mysql_connect(): The mysql extension is deprecated and will be removed in the future: use mysqli or PDO instead in /mnt/nfs/labstore-secondary-tools-project/cluebot/public_html/includes/header.php on line 15
2016-11-24 15:39:44: (mod_fastcgi.c.2673) FastCGI-stderr: PHP Stack trace:
2016-11-24 15:39:44: (mod_fastcgi.c.2673) FastCGI-stderr: PHP   1. {main}() /mnt/nfs/labstore-secondary-tools-project/cluebot/public_html/index.php:0
2016-11-24 15:39:44: (mod_fastcgi.c.2673) FastCGI-stderr: PHP   2. require_once() /mnt/nfs/labstore-secondary-tools-project/cluebot/public_html/index.php:2
2016-11-24 15:39:44: (mod_fastcgi.c.2673) FastCGI-stderr: PHP   3. mysql_connect() /mnt/nfs/labstore-secondary-tools-project/cluebot/public_html/includes/header.php:15

http://stackoverflow.com/questions/1390607/how-could-i-change-this-mysql-to-mysqli makes that sound doable; fgrep -R mysql_ /data/project/cluebot/public_html shows only 116 occurences of mysql_. If someone wants to volunteer, the code is at https://github.com/DamianZaremba/cluebotng-report.

scfc moved this task from Triage to Backlog on the Toolforge board.Dec 5 2016, 4:14 AM
bd808 closed this task as Resolved.Jul 12 2018, 5:04 PM
bd808 assigned this task to DamianZaremba.
bd808 added a subscriber: bd808.

Logs have been fine as far as I know for quite some time now.