Page MenuHomePhabricator

cirrus dump process generates too many log files
Closed, ResolvedPublic

Description

The script to generate json dumps from cirrus indices seems to confuse logrotate keeping a lot of log files in snapshot1007.eqiad.wmnet@/var/log/cirrusdump/
A possible cause is that the log files themselves contain the date in their name possibly confusing logrotate.

Event Timeline

Deskana subscribed.

This makes it hard to look at log files since there are tens of thousands of files in the relevant folders. This doesn't seem super pressing, but is worth a quick look some time.

Relatedly, we should only be keeping 90 days worth of these logs and that's not happening. I've cleaned up the files from 2016 and Jan 2017 manually but there should be a cron to clean these up.

I noticed many files ending in 201-67][MMDD] lying around; I've tossed these as well. They would not be caught by logrotate. I've also updated logrotate to rotate even the empty log files, since new ones are created on every run regardless. See https://gerrit.wikimedia.org/r/#/c/349922/

We ought to keep an eye on it and see what happens over the next few runs.

Well, we don't have any older files at least, but I don't understand where these come from:

-rw-rw-r-- 1 datasets datasets   0 May  2 06:25 cirrusdump-aawiki-20170206-cirrussearch-content.log
-rw-rw-r-- 1 datasets datasets  20 Apr 28 06:35 cirrusdump-aawiki-20170206-cirrussearch-content.log-20170429.gz
-rw-rw-r-- 1 datasets datasets  20 Apr 29 06:48 cirrusdump-aawiki-20170206-cirrussearch-content.log-20170430.gz
-rw-rw-r-- 1 datasets datasets  20 Apr 30 06:27 cirrusdump-aawiki-20170206-cirrussearch-content.log-20170501.gz
-rw-rw-r-- 1 datasets datasets   0 May  1 06:25 cirrusdump-aawiki-20170206-cirrussearch-content.log-20170502
-rw-rw-r-- 1 datasets datasets   0 May  2 06:25 cirrusdump-aawiki-20170206-cirrussearch-general.log
-rw-rw-r-- 1 datasets datasets  20 Apr 28 06:35 cirrusdump-aawiki-20170206-cirrussearch-general.log-20170429.gz
-rw-rw-r-- 1 datasets datasets  20 Apr 29 06:48 cirrusdump-aawiki-20170206-cirrussearch-general.log-20170430.gz
-rw-rw-r-- 1 datasets datasets  20 Apr 30 06:27 cirrusdump-aawiki-20170206-cirrussearch-general.log-20170501.gz
-rw-rw-r-- 1 datasets datasets   0 May  1 06:25 cirrusdump-aawiki-20170206-cirrussearch-general.log-20170502
-rw-rw-r-- 1 datasets datasets   0 May  2 06:25 cirrusdump-aawiki-20170213-cirrussearch-content.log

and lots more like them. These are the first in the list by date, i.e. 20170206 in the filename is as far back as they go. But why do we have these at all, generated on May 2?

I took another look at the cirrusdump logs today. Over 174k files, the oldest of them is from June 21. Something is very broken either in the naming or the rotation. The oldest are:


-rw-rw-r-- 1 datasets datasets 117 Jun 21 06:56 cirrusdump-viwiktionary-20170619-cirrussearch-general.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 196 Jun 21 06:56 cirrusdump-viwiktionary-20170619-cirrussearch-content.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 88 Jun 21 06:55 cirrusdump-viwikivoyage-20170619-cirrussearch-general.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 89 Jun 21 06:55 cirrusdump-viwikivoyage-20170619-cirrussearch-content.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 101 Jun 21 06:55 cirrusdump-viwikisource-20170619-cirrussearch-general.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 120 Jun 21 06:55 cirrusdump-viwikisource-20170619-cirrussearch-content.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 84 Jun 21 06:54 cirrusdump-viwikiquote-20170619-cirrussearch-general.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 83 Jun 21 06:54 cirrusdump-viwikiquote-20170619-cirrussearch-content.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 105 Jun 21 06:54 cirrusdump-viwikibooks-20170619-cirrussearch-general.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 84 Jun 21 06:54 cirrusdump-viwikibooks-20170619-cirrussearch-content.log-20170623.gz

but there are plenty of others with earlier dates in the names, e.g.


-rw-rw-r-- 1 datasets datasets 20 Jun 22 06:48 cirrusdump-abwiktionary-20170206-cirrussearch-general.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 20 Jun 22 06:48 cirrusdump-abwiktionary-20170206-cirrussearch-content.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 20 Jun 22 06:48 cirrusdump-abwiki-20170206-cirrussearch-general.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 20 Jun 22 06:48 cirrusdump-abwiki-20170206-cirrussearch-content.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 20 Jun 22 06:48 cirrusdump-aawiktionary-20170206-cirrussearch-content.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 20 Jun 22 06:48 cirrusdump-aawiktionary-20170206-cirrussearch-general.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 20 Jun 22 06:47 cirrusdump-aawikibooks-20170206-cirrussearch-general.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 20 Jun 22 06:47 cirrusdump-aawikibooks-20170206-cirrussearch-content.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 20 Jun 22 06:47 cirrusdump-aawiki-20170206-cirrussearch-general.log-20170623.gz
-rw-rw-r-- 1 datasets datasets 20 Jun 22 06:47 cirrusdump-aawiki-20170206-cirrussearch-content.log-20170623.gz

Change 361433 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] for cirrusdumps, do not create new logs after old ones are rotated

https://gerrit.wikimedia.org/r/361433

Going to do a manual cleanup in the meantime.

We're down to about 11k files now, much more reasonable.

Change 361433 merged by ArielGlenn:
[operations/puppet@production] for cirrusdumps, do not create new logs after old ones are rotated

https://gerrit.wikimedia.org/r/361433

Now live, will check how things look tomorrow after log rotation.

Because the naming scheme was really screwed up from the previous config, I have cleaned out a bunch more old files. It was hard to tell whether the new config works properly with al that cruft in there, though I *think* it was ok. Will check again tomorrow morning.

Well this was a nice idea but fail. The problem is with our understanding of what "dateext" does. It appends the date YYMMDD to every logfile present in the directory that matches the expression in the conf file (in our case, /var/log/cirrusdump/*.log). We say we will keep files around for 22 days which means that for 22 days everything ending in .log gets the new day's timestamp before any file ending in .log ever gets removed. We need rotation happening in here someplace :-P

Change 362383 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] actually rotate the cirrusdump log files

https://gerrit.wikimedia.org/r/362383

Change 362383 merged by ArielGlenn:
[operations/puppet@production] ditch dateext and just use normal rotation for cirrusdump logs

https://gerrit.wikimedia.org/r/362383

I should have done this in the beginning, that is, discussed removing the not so useful dateext, given that there's a datestring in the log filename already.

So after too many changesets, we'll see what logrot looks like tomorrow.

ArielGlenn claimed this task.

The weekly run of cirrussearch dumps failed because of the nfs incident in T169680: NFS on dataset1001 overloaded, high load on the hosts that mount it but logs were rotated as we expect them to be, at last.

I just had a look at the directory and I still don't like what I'm seeing in there. Files from July when maxage is 22. All with a .1 ending. Re-opening to look at it again.

Change 389984 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] use a simple find to toss old cirrussearch dump log files

https://gerrit.wikimedia.org/r/389984

Change 389984 merged by ArielGlenn:
[operations/puppet@production] use a simple find to toss old cirrussearch dump log files

https://gerrit.wikimedia.org/r/389984

I should have done that ^^ in the beginning instead of fooling around with logrotate all this time. I've cleaned up log files manually with the same find that's in the changeset. We'll wait a week to make sure things run as expected before closing the ticket.

Hi @ArielGlenn - is everything looking good for this work?

Nope. There's still some old cirrusdump-cirrus-metastore-* files left around. Not very many but I need to add them to the find. The rest looks ok.

Ok, I'll keep it open then, thanks!

Change 392808 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] include all log files in cirrussearch cron cleanup

https://gerrit.wikimedia.org/r/392808

Change 392808 merged by ArielGlenn:
[operations/puppet@production] include all log files in cirrussearch cron cleanup

https://gerrit.wikimedia.org/r/392808

At long last this is working as advertised. Closing for good.