Page MenuHomePhabricator

mw-log-cleanup fails often with errors about files not found
Closed, ResolvedPublic

Description

I noticed mw-log-cleanup unit failing on mwlog hosts, likely because it is running during/close to file rotation:

`
Jul 06 02:00:02 mwlog1002 systemd[1]: Finished cleanup mediawiki logs.
Jul 07 02:00:00 mwlog1002 systemd[1]: Starting cleanup mediawiki logs...
Jul 07 02:04:03 mwlog1002 mw-log-cleanup[1320200]: find: ‘/srv/mw-log/archive/Renameuser.log-20230706’: No such file or directory
Jul 07 02:04:03 mwlog1002 mw-log-cleanup[1320200]: find: ‘/srv/mw-log/archive/SaveParse.log-20230706’: No such file or directory
Jul 07 02:04:03 mwlog1002 mw-log-cleanup[1320200]: find: ‘/srv/mw-log/archive/slow-parse.log-20230706’: No such file or directory
Jul 07 02:04:03 mwlog1002 mw-log-cleanup[1320200]: find: ‘/srv/mw-log/archive/SpamBlacklistHit.log-20230706’: No such file or directory
Jul 07 02:04:03 mwlog1002 mw-log-cleanup[1320200]: find: ‘/srv/mw-log/archive/SimpleAntiSpam.log-20230706’: No such file or directory
Jul 07 02:04:03 mwlog1002 mw-log-cleanup[1320200]: find: ‘/srv/mw-log/archive/resourceloader.log-20230706’: No such file or directory
Jul 07 02:04:03 mwlog1002 systemd[1]: mw-log-cleanup.service: Main process exited, code=exited, status=1/FAILURE
Jul 07 02:04:03 mwlog1002 systemd[1]: mw-log-cleanup.service: Failed with result 'exit-code'.
Jul 07 02:04:03 mwlog1002 systemd[1]: Failed to start cleanup mediawiki logs.
Jul 07 02:04:03 mwlog1002 systemd[1]: mw-log-cleanup.service: Consumed 3min 51.728s CPU time.
Jul 08 02:00:00 mwlog1002 systemd[1]: Starting cleanup mediawiki logs...
Jul 08 02:00:26 mwlog1002 mw-log-cleanup[1595206]: find: ‘/srv/mw-log/archive/runJobs.log-20230707’: No such file or directory
Jul 08 02:01:06 mwlog1002 mw-log-cleanup[1595206]: find: ‘/srv/mw-log/archive/security.log-20230707’: No such file or directory
Jul 08 02:01:06 mwlog1002 mw-log-cleanup[1595206]: find: ‘/srv/mw-log/archive/SimpleAntiSpam.log-20230707’: No such file or directory
Jul 08 02:01:06 mwlog1002 mw-log-cleanup[1595206]: find: ‘/srv/mw-log/archive/resourceloader.log-20230707’: No such file or directory
Jul 08 02:01:06 mwlog1002 mw-log-cleanup[1595206]: find: ‘/srv/mw-log/archive/slow-parsoid.log-20230707’: No such file or directory
Jul 08 02:01:06 mwlog1002 systemd[1]: mw-log-cleanup.service: Main process exited, code=exited, status=1/FAILURE
Jul 08 02:01:06 mwlog1002 systemd[1]: mw-log-cleanup.service: Failed with result 'exit-code'.
Jul 08 02:01:06 mwlog1002 systemd[1]: Failed to start cleanup mediawiki logs.
Jul 08 02:01:06 mwlog1002 systemd[1]: mw-log-cleanup.service: Consumed 1min 3.623s CPU time.
Jul 09 02:00:00 mwlog1002 systemd[1]: Starting cleanup mediawiki logs...
Jul 09 02:00:12 mwlog1002 systemd[1]: mw-log-cleanup.service: Succeeded.
Jul 09 02:00:12 mwlog1002 systemd[1]: Finished cleanup mediawiki logs.
Jul 09 02:00:12 mwlog1002 systemd[1]: mw-log-cleanup.service: Consumed 6.699s CPU time.
Jul 10 02:00:00 mwlog1002 systemd[1]: Starting cleanup mediawiki logs...
Jul 10 02:00:03 mwlog1002 systemd[1]: mw-log-cleanup.service: Succeeded.
Jul 10 02:00:03 mwlog1002 systemd[1]: Finished cleanup mediawiki logs.
Jul 11 02:00:00 mwlog1002 systemd[1]: Starting cleanup mediawiki logs...
Jul 11 02:00:04 mwlog1002 systemd[1]: mw-log-cleanup.service: Succeeded.
Jul 11 02:00:04 mwlog1002 systemd[1]: Finished cleanup mediawiki logs.
Jul 12 02:00:00 mwlog1002 systemd[1]: Starting cleanup mediawiki logs...
Jul 12 02:05:51 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/PageTriage.log-20230711’: No such file or directory
Jul 12 02:05:51 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/php7.2-fpm.log-20230711’: No such file or directory
Jul 12 02:05:51 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/rdbmapi.log-20230711’: No such file or directory
Jul 12 02:05:55 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/LoginNotify.log-20230711’: No such file or directory
Jul 12 02:07:25 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/memcached.log-20230711’: No such file or directory
Jul 12 02:07:28 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/objectcache.log-20230711’: No such file or directory
Jul 12 02:07:57 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/rdbmJobExecutor.log-20230711’: No such file or directory
Jul 12 02:07:57 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/Parsoid.log-20230711’: No such file or directory
Jul 12 02:08:18 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/PageViewInfo.log-20230711’: No such file or directory
Jul 12 02:08:20 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/mobile.log-20230711’: No such file or directory
Jul 12 02:08:20 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/preferences.log-20230711’: No such file or directory
Jul 12 02:08:20 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/resourceloader.log-20230711’: No such file or directory
Jul 12 02:08:20 mwlog1002 mw-log-cleanup[2633672]: find: ‘/srv/mw-log/archive/php7.4-fpm.log-20230711’: No such file or directory
Jul 12 02:08:21 mwlog1002 systemd[1]: mw-log-cleanup.service: Main process exited, code=exited, status=1/FAILURE
Jul 12 02:08:21 mwlog1002 systemd[1]: mw-log-cleanup.service: Failed with result 'exit-code'.
Jul 12 02:08:21 mwlog1002 systemd[1]: Failed to start cleanup mediawiki logs.
Jul 12 02:08:21 mwlog1002 systemd[1]: mw-log-cleanup.service: Consumed 7min 4.414s CPU time.

We should make sure that cleanup runs well after logrotate does its daily routines and/or not concurrently anyways. I think that declaring After=logrotate.service in the cleanup unit should be enough to achieve what we want in this case

Event Timeline

Change 938228 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] udp2log: run mw-log-cleanup after logrotate

https://gerrit.wikimedia.org/r/938228

Change 938228 merged by Filippo Giunchedi:

[operations/puppet@production] udp2log: run mw-log-cleanup after logrotate

https://gerrit.wikimedia.org/r/938228

fgiunchedi claimed this task.

This is working as expected (i.e. mw-log-cleanup waits for logrotate to be done, if it is running)

Jul 18 00:00:06 mwlog2002 systemd[1]: Starting Rotate log files...
Jul 18 03:06:55 mwlog2002 systemd[1]: logrotate.service: Succeeded.
Jul 18 03:06:55 mwlog2002 systemd[1]: Finished Rotate log files.
Jul 18 03:06:55 mwlog2002 systemd[1]: logrotate.service: Consumed 2h 45min 24.592s CPU time.
Jul 18 03:06:55 mwlog2002 systemd[1]: Starting Rotate log files...
Jul 18 03:13:32 mwlog2002 systemd[1]: logrotate.service: Succeeded.
Jul 18 03:13:32 mwlog2002 systemd[1]: Finished Rotate log files.
Jul 18 03:13:32 mwlog2002 systemd[1]: logrotate.service: Consumed 6min 7.209s CPU time.
Jul 18 03:13:32 mwlog2002 systemd[1]: Starting cleanup mediawiki logs...
Jul 18 03:13:36 mwlog2002 systemd[1]: mw-log-cleanup.service: Succeeded.
Jul 18 03:13:36 mwlog2002 systemd[1]: Finished cleanup mediawiki logs.
lmata triaged this task as Medium priority.Jul 18 2023, 5:35 PM
lmata moved this task from Inbox to Prioritized on the Observability-Logging board.
lmata moved this task from Inbox to Done on the SRE Observability (FY2023/2024-Q1) board.