Page MenuHomePhabricator

error.log is not rotated in beta
Closed, ResolvedPublic

Description

Today I needed to check details for an exception in beta, so I SSH'd to mwlog01 and tried running a grep on the error.log file. This took suspiciously long, so I checked:

$ ls -lhS
-rw-r--r-- 1 udp2log udp2log 8.1G Sep  4 13:12 error.log
-rw-r--r-- 1 udp2log udp2log 2.2G Sep  4 13:10 testwiki.log
-rw-r--r-- 1 udp2log udp2log 2.1G Sep  4 13:06 scap.log

And also:

$ head error.log
2023-06-07 00:11:28.047555 [856d79105d5cbb7009e335de] deployment-mwmaint02 [...]

$ tail error.log
[...]
2023-08-15 03:13:26.748092 [6a14043a1b9f774ea0564677]
[...]

$ ls archive | grep error
error.log-20230606.gz
error.log-20230607.gz

June 6 is today - 90 days, which I assume is the standard log retention period. I'm not sure how the log rotation is (supposed to be) set up for beta, but at first glance it looks like I've just been lucky and found out about this 89 days after the last rotation?

Also, error.log is not the only file affected: by examining archive/, I can see that other log files are stuck at 20230607, but not all of them.

Event Timeline

Daimona triaged this task as Unbreak Now! priority.Sep 4 2023, 1:22 PM

Not only is the file so large that it can't be easily grepped (though this can be worked around e.g. with tail if the entry you're looking for is recent), but apparently there are no new entries after Aug 15, which presumably means we aren't seeing errors from beta?

On Jun 7 r927746 was merged. AIUI this is for production, but a logrotate update happening on the last day that log rotation worked seems no coincidence to me.

On deployment-mwlog01.deployment-prep.eqiad1.wikimedia.cloud:

$ ls -l /etc/logrotate.d/udp2log-mw 
-r--r--r-- 1 root root 863 Jun  7 13:38 /etc/logrotate.d/udp2log-mw

That is consistent with the error.log no more being rotated since June 7th.

The configuration file is defined in Puppet by modules/udp2log/templates/logrotate_udp2log.erb which had two changes at that date

commit 9ddc4bc194e75e5f6eb4d5842e44c671f4af978d
Author: Keith Herron <kherron@wikimedia.org>
Date:   Tue Jun 6 13:31:19 2023 -0400

    mwlog: upgrade logrotate and use ignoreduplicates
    
    upgrade logrotate to bookworm version (3.21.0) and make use of
    ignoreduplicates config option
    
    also inverts mw-log config in order of most to least specific
    
    Bug: T338127
    Change-Id: I39147838759b3711b523bec0ba194e73028a065f
commit b2002a55936a397bed86745c8b244d84ac2adbb8
Author: Keith Herron <kherron@wikimedia.org>
Date:   Mon Jun 5 09:45:33 2023 -0400

    mwlog: fix mw-log logrotate glob
    
    current glob is being evaluated as literal filename, update
    to wildcard and let file age exclude api.log from daily rotate
    
    Bug: T338127
    Bug: T277445
    Change-Id: Id4f5539944bb0822c6bf1a6ae12138d962de6901
$ /usr/sbin/logrotate --version
logrotate 3.14.0

So it's using 3.14.0, but r927746 should've upgraded it to 3.21.0. Apparently that really was prod-only? Also, the ignoreduplicates config option was added, but it doesn't exist in 3.14.0:

$ /usr/sbin/logrotate --debug /etc/logrotate.d/udp2log-mw
WARNING: logrotate in debug mode does nothing except printing debug messages!  Consider using verbose mode (-v) instead if this is not what you want.

reading config file /etc/logrotate.d/udp2log-mw
olddir is now /srv/mw-log/archive
error: /etc/logrotate.d/udp2log-mw:18 unknown option 'ignoreduplicates' -- ignoring line
error: /etc/logrotate.d/udp2log-mw:27 duplicate log entry for /srv/mw-log/api.log
olddir is now /srv/mw-log/archive
error: /etc/logrotate.d/udp2log-mw:38 unknown option 'ignoreduplicates' -- ignoring line

@herron @fgiunchedi This seems to be a consequence of T338127, woud you have any ideas as to what's going on here? My ineducated guess at the moment is that because beta is still on buster, it's not receiving the updated logrotate version from r927746. And that for some reason, the current logrotate configuration doesn't work properly in 3.14.0.

FTR, @hashar and I talked about this earlier today in #wikimedia-releng. If the guess above is correct, then it might be easier to just migrate the host to bullseye, which is part of T327742.

@herron @fgiunchedi This seems to be a consequence of T338127, woud you have any ideas as to what's going on here? My ineducated guess at the moment is that because beta is still on buster, it's not receiving the updated logrotate version from r927746. And that for some reason, the current logrotate configuration doesn't work properly in 3.14.0.

I agree with your guess, namely that ignoreduplicates works only on logrotate >= bullseye and not buster. The easiest way forward I'd say is to upgrade beta's mwlog to bullseye

@Anonymous_ehacker congratulations on learning how to use ChatGPT! But please do it in private, comments like this are a waste of everyone's time.

Onack16888 renamed this task from error.log is not rotated in beta to error.log在測試版中沒有旋轉.Nov 19 2023, 8:58 PM
Onack16888 subscribed.
Daimona renamed this task from error.log在測試版中沒有旋轉 to error.log is not rotated in beta.Nov 19 2023, 9:11 PM
Daimona removed a subscriber: Onack16888.

Mentioned in SAL (#wikimedia-releng) [2023-12-07T00:24:38Z] <James_F> Created deployment-mwlog02 on Bullseye in Beta Cluster for T345566

Change 980965 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/mediawiki-config@master] [BETA CLUSTER] Switch log host from mwlog01 to mwlog02

https://gerrit.wikimedia.org/r/980965

Change 980966 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/puppet@production] deployment-prep: Switch mwlog01 to mwlog02

https://gerrit.wikimedia.org/r/980966

Change 980965 merged by jenkins-bot:

[operations/mediawiki-config@master] [BETA CLUSTER] Switch log host from mwlog01 to mwlog02

https://gerrit.wikimedia.org/r/980965

Change 980966 merged by Ori:

[operations/puppet@production] deployment-prep: Switch mwlog01 to mwlog02

https://gerrit.wikimedia.org/r/980966

Mentioned in SAL (#wikimedia-releng) [2023-12-07T02:08:44Z] <James_F> Switching deployment-mwlog01 off in Beta Cluster for T345566

OK, update: mwlog02 now exists, has the mount, all the config has switched over, and from the network trace it's receiving the files, but it's not updating the files.

Notes from last night:

  • Despite having the role::logging::mediawiki::udp2log role, mwlog02 didn't have Nagios installed; as a short-term fix, we applied a (non-puppetised) install of nagios-nrpe-server monitoring-plugins-basic monitoring-plugins-standard, but this is clearly a bug/issue.
  • Despite having the attached volume of mwlog-data-01 on /dev/sdb, the instance doesn't have it mounted; we temporarily mounted it, but this is also a bug.
  • All the files in mwlog-data-01 had a different owner than udp2log (presumably different UIDs from different OS flavours); we chowned to try to fix.
  • Restarting the service didn't kick it into appending to the files. Rebooting the instance also didn't (but as the mount isn't present post-reboot, that's unsurprising).

We theorised that if the underlying bug about the mount being fixed, this might Just Work™, but we ran out of debugging time at the hackathon.

tstarling claimed this task.
tstarling subscribed.

The issue with logs not being rotated is apparently fixed. If there are some remaining issues with mwlog02, they should probably be discussed on a task that is not "UBN" priority.

The issue with logs not being rotated is apparently fixed.

Is it fixed? We switched it off for the move to mwlog02 but didn't switch it back on (because it was broken). Maybe someone fixed it later but didn't note here?

Seems properly rotated now:

tgr@deployment-mwlog02:~$ ls -l /srv/mw-log/archive/err*
-rw-r--r-- 1 udp2log udp2log 39300 Mar 12 15:45 /srv/mw-log/archive/error.log-20240313.gz
-rw-r--r-- 1 udp2log udp2log  2528 Mar 13 09:21 /srv/mw-log/archive/error.log-20240314.gz
-rw-r--r-- 1 udp2log udp2log   574 Mar 14 01:20 /srv/mw-log/archive/error.log-20240315.gz
-rw-r--r-- 1 udp2log udp2log  2277 Mar 15 22:45 /srv/mw-log/archive/error.log-20240316.gz
-rw-r--r-- 1 udp2log udp2log  1371 Mar 16 07:23 /srv/mw-log/archive/error.log-20240317.gz
-rw-r--r-- 1 udp2log udp2log  1518 Mar 17 15:32 /srv/mw-log/archive/error.log-20240318.gz
-rw-r--r-- 1 udp2log udp2log  2804 Mar 18 16:19 /srv/mw-log/archive/error.log-20240319.gz
-rw-r--r-- 1 udp2log udp2log  1205 Mar 19 18:56 /srv/mw-log/archive/error.log-20240320.gz
-rw-r--r-- 1 udp2log udp2log  1928 Mar 20 18:32 /srv/mw-log/archive/error.log-20240321.gz
-rw-r--r-- 1 udp2log udp2log  2150 Mar 21 21:53 /srv/mw-log/archive/error.log-20240322.gz
-rw-r--r-- 1 udp2log udp2log  1400 Mar 22 18:21 /srv/mw-log/archive/error.log-20240323.gz
-rw-r--r-- 1 udp2log udp2log  3176 Mar 23 16:15 /srv/mw-log/archive/error.log-20240324.gz
-rw-r--r-- 1 udp2log udp2log   575 Mar 24 01:20 /srv/mw-log/archive/error.log-20240325.gz
-rw-r--r-- 1 udp2log udp2log  1824 Mar 25 16:46 /srv/mw-log/archive/error.log-20240326.gz
-rw-r--r-- 1 udp2log udp2log 52860 Mar 26 21:46 /srv/mw-log/archive/error.log-20240327

It seems to be managed by this puppet template; not sure what differs from mwlog01 that made it happen.