Cluebot has a 1.2T php error log file that's growing fast, and making tools run out of disk space. This is bad!
Description
Event Timeline
root@labstore1001:/srv/project/tools/project/cluebot/logs# ls -lsh total 1.2T 757M -rw-rw---- 1 51109 51109 757M Feb 17 19:15 cbng_bot.err 221M -rw-rw---- 1 51109 51109 221M Feb 17 19:15 cbng_bot.out 848K -rw-rw---- 1 51109 51109 841K Feb 17 19:15 cbng_core.err 0 -rw-rw---- 1 51109 51109 0 Feb 9 19:57 cbng_core.out 856K -rw-rw---- 1 51109 51109 852K Feb 17 19:15 cbng_redis.err 0 -rw-rw---- 1 51109 51109 0 Feb 9 19:57 cbng_redis.out 856K -rw-rw---- 1 51109 51109 852K Feb 17 19:15 cbng_relay.err 0 -rw-rw---- 1 51109 51109 0 Feb 9 19:57 cbng_relay.out 1.2T -rw-rw---- 1 51109 51109 1.2T Feb 17 19:15 cluebot3.err 1.1G -rw-rw---- 1 51109 51109 1.1G Feb 17 19:15 cluebot3.out 4.0M -rw-rw---- 1 51109 51109 4.0M Feb 17 19:15 cron.err 0 -rw-rw---- 1 51109 51109 0 Feb 9 19:57 cron.out
I replied to the mailing list about this, in reply to the email about NFS running out of space.
Cleared down the (at the time) 1.5T log file. Currently it's at ~2G:
du -tools.cluebot@tools-bastion-01:~$ du -shc *
3.1M access.log
4.0K backup_database.sh
1.4M cluebot3
55M cluebotng
220M cluebotng-old
492K error.log
2.1G logs
38G mysql_backups
19M node
620K node_modules
1.3M public_html
4.0K replica.my.cnf
4.0K service.log
4.0K service.manifest
165M software
4.0K web-settings.php
40G total
Per the mailing list reply - do you have advice on the best practice for log rotation of files created by jsub's stdout/stderr - I don't really want to have to restart the process on rotation, a copy rotate based implementation is an option if a nice reload is not supported.
I'll have a look into doing nice log rotation/management in the app - currently the log grows at a couple of hundred MB an hour, so can manually truncate it until such as time that it nicely handles rotation (could also be a cron).
hey @DamianZaremba!
I think @yuvipanda sent some of the logging to /dev/null for now. The bad news is we have no great answer for this question. We have some longer term considerations up in the air like central logging but for now self-rotation is more or less the only option. I'm not sure but I think the rate at which this has grown increased somewhat dramatically within the last 6 weeks or so. Could it be indicative of more verbose logging or some pervasive small error filling things up?
It wouldn't be a good idea for everyone but if the logging is needed in this case, and considering the mass of it, a cron to rotate isn't the worst short term solution.
Thanks @chasemp! I can see the logs are current linked to /dev/null.
After a very quick look, it seems the MySQL connection is being a little un-reliable. Due to the db not returning a result it causes a number of large stack traces to be printed out.
A very quick manual run results in:
mysqli_ping(): MySQL server has gone away in mysqli_query(): Premature end of data (mysqlnd_wireprotocol.c:1243) mysqli_ping(): MySQL server has gone away in Packets out of order. Expected 1 received 111. Packet size=7233394 in (mysqli_ping is on this line)
This would match up with sentry alerts:
I like the idea of central logging or such, would just need to be careful as to what's logged (currently login tokens etc in the output).
I'll work on some application based log handling and see if the MySQL state can be improved - this is probably causing a few other issues.
The logs were also almost 100% 'undefined variable' and 'undefined index' php errors along with associated stack traces, with very little mysql ones.
I've cleaned up the code in both bots and put in MonoLog with a daily rotation. I'll keep an eye on the size for a couple of days to make sure things are looking ok.
CBNG was creating a few GB due to MySQL issues, but the bulk was indeed CB III with php warnings.
An hour of running with the changes has created the combined log size of ~2M. A little excessive still, but manageable.
For the record, this morning (UTC) cluebot3.php was writing tons of warnings about incorrect types to its logfile on nfs.
The effect can be seen here
the load on labstore skyrocketed repeatedly, and in the last two cases I backtraced it to cluebot.So basically the excessive load from logging in cluebot3 has killed the performance of all of Toollabs and Labs in general for several days.
For now I've set the crontab to log errors on /dev/null, and please do not change that back.
I've updated the crontabs to restart the processes with no stdout/stderr logging enabled (https://github.com/DamianZaremba/cluebotng/commit/3d026466688691972563313a6f1f8720c3efe714) to ensure the /dev/null doesn't get reverted and re-deployed the code.
The application logs seem to be at a stable volume still:
tools.cluebot@tools-bastion-01:~/logs$ du -shc *-*.log 512K cluebot3-2016-02-18.log 8.0M cluebot3-2016-02-19.log 4.2M cluebotng-2016-02-18.log 17M cluebotng-2016-02-19.log 29M total
Apologies for the disk spam.... I'll try to spend some more time improving the logging.
So I just chowned again the error log to root to prevent writes that caused labstore1001 to be swamped again (see README_NOW_MESSAGE_FROM_WMF_OPS); at the time I created that file I was unaware of this task.
Load dropped from 70 to 11 within a minute.
(Side note: I have moved the code that tested for and submitted continuous jobs from crontab to ~/.bigbrotherrc; cf. https://github.com/DamianZaremba/cluebotng/issues/2. This is not related to this task, but while looking at the crontab I couldn't resist.)
A lot of error.log is:
2016-11-24 15:39:44: (mod_fastcgi.c.2673) FastCGI-stderr: PHP Deprecated: mysql_connect(): The mysql extension is deprecated and will be removed in the future: use mysqli or PDO instead in /mnt/nfs/labstore-secondary-tools-project/cluebot/public_html/includes/header.php on line 15 2016-11-24 15:39:44: (mod_fastcgi.c.2673) FastCGI-stderr: PHP Stack trace: 2016-11-24 15:39:44: (mod_fastcgi.c.2673) FastCGI-stderr: PHP 1. {main}() /mnt/nfs/labstore-secondary-tools-project/cluebot/public_html/index.php:0 2016-11-24 15:39:44: (mod_fastcgi.c.2673) FastCGI-stderr: PHP 2. require_once() /mnt/nfs/labstore-secondary-tools-project/cluebot/public_html/index.php:2 2016-11-24 15:39:44: (mod_fastcgi.c.2673) FastCGI-stderr: PHP 3. mysql_connect() /mnt/nfs/labstore-secondary-tools-project/cluebot/public_html/includes/header.php:15
http://stackoverflow.com/questions/1390607/how-could-i-change-this-mysql-to-mysqli makes that sound doable; fgrep -R mysql_ /data/project/cluebot/public_html shows only 116 occurences of mysql_. If someone wants to volunteer, the code is at https://github.com/DamianZaremba/cluebotng-report.