Page MenuHomePhabricator

The prune_old_srv_syslog_directories.service can't delete non-empty directories on centrallog instances
Closed, InvalidPublic

Description

The prune_old_srv_syslog_directories.service fails when it attempts to delete non-empty directories.
The journalctl -u prune_old_srv_syslog_directories.service output is /usr/bin/find: cannot delete ‘/srv/syslog/.linux.dhcp.DictModel’: Directory not empty.
Looking inside of that directory there's a log file from November 30 2023 which is old enough for the service to delete the file and directory.

Output of ls -la /srv/syslog/.linux.dhcp.DictModel/:

total 13832
drwxr-x--- 2 root ops      4096 Nov 27 19:42 .
drwxr-xr-x 1 root root 13926400 Apr 12 00:00 ..
-rw-r----- 1 root ops    219051 Nov 30 11:13 syslog.log

This is the output of of the /lib/systemd/system/prune_old_srv_syslog_directories.service unit:

[Unit]
Description=clean up logs from old hosts
Documentation=https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state

[Service]
Type=oneshot
User=root
ExecStart=/usr/bin/find /srv/syslog/ -mtime +135 -delete

I don't think this is a permissions issue as the unit is ran as the root user. I think the issue has to do with a missing argument for the find command to delete non-empty directories.
I think that either forcing the delete or a recursive delete of the directories may also work.

From the find man pages: The -delete action will fail to remove a directory unless it is empty. This explains why the service is failing.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2024-04-12T01:05:50Z] <denisse> Manually deleting /srv/syslog/.linux.dhcp.DictModel/syslog.log from November 30 on centrallog1002 and centrallog2002 after the prune_old_srv_syslog_directories.service failed to delete the non-empty directory - T362376

I've been testing this on a Pontoon host but the arguments and actions specified in the unit are working as expected. I'm still debugging the issue.

I was wondering if the filesystem where /srv is mounted had any mount flags that would prevent the deletion of those files/directories such as ro or noexec but looking at the mount options of /dev/mapper/vg0-srv they are rw,relatime,stripe=256, none of them seem to be the culprit so the issue resides somewhere else.

Change #1019139 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] syslog: Update log cleanup command in syslog central server

https://gerrit.wikimedia.org/r/1019139

andrea.denisse changed the task status from Open to In Progress.Apr 12 2024, 2:32 AM
andrea.denisse claimed this task.
andrea.denisse triaged this task as Low priority.

Change #1019139 abandoned by Andrea Denisse:

[operations/puppet@production] syslog: Update log cleanup command in syslog central server

Reason:

This is fixed by another patch.

https://gerrit.wikimedia.org/r/1019139

I've fixed the issue with the following on centrallog hosts:

find /srv/syslog -ipath '/srv/syslog/.[^.]*' -type f -delete

Subsequent runs of this timer will work as expected going forward