Page MenuHomePhabricator

every Sunday at 00:00 UTC, logrotate fails on netflow hosts
Closed, ResolvedPublic

Description

This happens every Sunday 00:00 UTC, and fixes itself every Monday 00:00 UTC:

00:04:03	<+icinga-wm>	PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
00:04:25	<+icinga-wm>	PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
00:04:57	<+icinga-wm>	PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
00:04:59	<+icinga-wm>	PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
00:05:33	<+icinga-wm>	PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi claimed this task.
ayounsi subscribed.

T262751 had a more verbose error log:

Sep 13 00:00:01 netflow1001 systemd[1]: Starting Rotate log files...
Sep 13 00:00:01 netflow1001 logrotate[25118]: kafkatee-webrequest: unrecognized service
Sep 13 00:00:01 netflow1001 logrotate[25118]: error: error running non-shared postrotate script for /var/cache/kafkatee/webrequest/kafkatee.stats.json of '/var/cache/kafkatee/webrequest/kafkatee.stats.json '
Sep 13 00:00:02 netflow1001 systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Sep 13 00:00:02 netflow1001 systemd[1]: logrotate.service: Failed with result 'exit-code'.
Sep 13 00:00:02 netflow1001 systemd[1]: Failed to start Rotate log files.

kafkatee-webrequest got removed from all the netflow hosts, so I removed that leftover with:

sudo rm /etc/logrotate.d/kafkatee-webrequest
sudo service logrotate restart

And confirmed with a Puppet run that the file isn't re-created.