I noticed a thanos-be2001 filesystem alarm yesterday and indeed swift logs in `/var/log/swift` were big and not rotated. As a bandaid I truncated 10G off server.log.1 though the problem persists. Before truncating I saved the old head/tail of the file to show rotation isn't always happening:
```
root@thanos-be2001:~# head -5 /root/server.log.1_head1000
Feb 8 13:07:46 thanos-be2001 container-server: [pid: 25744|app: 0|req: 38/1799] 10.64.0.136 () {28 vars in 450 bytes} [Tue Feb 8 13:07:46 2022] HEAD /sdb3/50430/AUTH_dispersion/dispersion_0_949 => generated 0 bytes in 1 msecs (HTTP/1.1 507) 2 headers in 96 bytes (1 switches on core 0)
Feb 8 13:07:46 thanos-be2001 container-server: [pid: 25770|app: 0|req: 38/1800] 10.64.0.136 () {28 vars in 452 bytes} [Tue Feb 8 13:07:46 2022] HEAD /sdb3/13950/AUTH_dispersion/dispersion_1_3376 => generated 0 bytes in 1 msecs (HTTP/1.1 507) 2 headers in 96 bytes (1 switches on core 0)
Feb 8 13:07:46 thanos-be2001 container-server: [pid: 25779|app: 0|req: 38/1801] 10.64.0.136 () {28 vars in 448 bytes} [Tue Feb 8 13:07:46 2022] HEAD /sdb3/9096/AUTH_dispersion/dispersion_0_939 => generated 0 bytes in 1 msecs (HTTP/1.1 507) 2 headers in 96 bytes (1 switches on core 0)
Feb 8 13:07:46 thanos-be2001 container-server: [pid: 25778|app: 0|req: 38/1802] 10.64.0.136 () {28 vars in 452 bytes} [Tue Feb 8 13:07:46 2022] HEAD /sdb3/29505/AUTH_dispersion/dispersion_1_3383 => generated 0 bytes in 1 msecs (HTTP/1.1 507) 2 headers in 96 bytes (1 switches on core 0)
Feb 8 13:07:46 thanos-be2001 container-server: [pid: 25765|app: 0|req: 38/1803] 10.64.0.136 () {28 vars in 452 bytes} [Tue Feb 8 13:07:46 2022] HEAD /sda3/49443/AUTH_dispersion/dispersion_1_3382 => generated 0 bytes in 1 msecs (HTTP/1.1 507) 2 headers in 96 bytes (1 switches on core 0)
```
```
root@thanos-be2001:~# tail -5 /root/server.log.1_tail1000
Feb 13 00:00:01 thanos-be2001 object-server: 10.192.0.192 - - [13/Feb/2022:00:00:01 +0000] "HEAD /sdn1/9829/AUTH_thanos/thanos/01FQQ3ZE4Z34PQNMVHK6Z1165C/meta.json" 200 6932 "HEAD http://thanos-swift.discovery.wmnet/v1/AUTH_thanos/thanos/01FQQ3ZE4Z34PQNMVHK6Z1165C/meta.json" "txcd5a24830a1340cf98564-0062084a01" "proxy-server 263746" 0.0012 "-" 1879 0
Feb 13 00:00:01 thanos-be2001 object-server: 10.192.0.192 - - [13/Feb/2022:00:00:01 +0000] "GET /sdi1/11883/AUTH_thanos/thanos/01EJDJ1YPR4Q11CEN814MHPKPM/no-compact-mark.json" 404 70 "GET http://thanos-swift.discovery.wmnet/v1/AUTH_thanos/thanos/01EJDJ1YPR4Q11CEN814MHPKPM/no-compact-mark.json" "txf705e6031d8f4e00ade0d-0062084a01" "proxy-server 263766" 0.0008 "-" 1857 0
Feb 13 00:00:01 thanos-be2001 object-server: 10.192.0.192 - - [13/Feb/2022:00:00:01 +0000] "HEAD /sdj1/41390/AUTH_thanos/thanos/01EHAJXBR63ET3CC3W1PHT5HWF/meta.json" 200 6771 "HEAD http://thanos-swift.discovery.wmnet/v1/AUTH_thanos/thanos/01EHAJXBR63ET3CC3W1PHT5HWF/meta.json" "txf06b4c25dce1479b8d6e2-0062084a01" "proxy-server 263752" 0.0009 "-" 1860 0
Feb 13 00:00:01 thanos-be2001 object-server: 10.192.0.192 - - [13/Feb/2022:00:00:01 +0000] "HEAD /sdg1/14346/AUTH_thanos/thanos/01FQPVM3SDKXHXMMY5Q28EV4GQ/meta.json" 200 8702 "HEAD http://thanos-swift.discovery.wmnet/v1/AUTH_thanos/thanos/01FQPVM3SDKXHXMMY5Q28EV4GQ/meta.json" "tx56dad1b384e7463eb856f-0062084a01" "proxy-server 263746" 0.0010 "-" 1807 0
Feb 13 00:00:01 thanos-be2001 object-server: 10.192.0.192 - - [13/Feb/2022:00:00:01 +0000] "GET /sdl1/2436/AUTH_thanos/thanos/01EZZFV77S7V5VQ33PYK1KGG4V/no-compact-mark.json" 404 70 "GET http://thanos-swift.discovery.wmnet/v1/AUTH_thanos/thanos/01EZZFV77S7V5VQ33PYK1KGG4V/no-compact-mark.json" "tx2fe09675b13644d9ad152-0062084a01" "proxy-server 263754" 0.0009 "-" 1869 0
```
Since yesterday some form of rotation did happen (the `.log` files are in place) though rsyslog has kept logging to the old files (but the root filesystem space now is fine since compression happened)
```
root@thanos-be2001:~# ls -latr /var/log/swift/server.log* /var/log/swift/background.log*
-rw-r--r-- 1 root root 3733934 Feb 12 23:59 /var/log/swift/background.log.2.gz
-rw-r--r-- 1 root root 4564556338 Feb 13 15:38 /var/log/swift/server.log.2.gz
-rw-r--r-- 1 root root 0 Feb 14 00:11 /var/log/swift/server.log
-rw-r--r-- 1 root root 0 Feb 14 00:11 /var/log/swift/background.log
-rw-r--r-- 1 root root 9136349 Feb 14 08:43 /var/log/swift/background.log.1
-rw-r--r-- 1 root root 12239141629 Feb 14 08:43 /var/log/swift/server.log.1
root@thanos-be2001:~# lsof /var/log/swift/server.log.1
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
rsyslogd 1216 root 12w REG 9,0 12241448112 262171 /srv/log/swift/server.log.1
root@thanos-be2001:~# lsof /var/log/swift/background.log.1
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
rsyslogd 1216 root 28w REG 9,0 9137791 262170 /srv/log/swift/background.log.1
```
Note that this problem seems to be present on ms-be hosts too, though my guess is that due to less logging activity the root filesystem doesn't tend to get full:
```
root@cumin1001:~# cumin 'thanos-be* or ms-be*' 'lsof /var/log/swift/server.log.1 >/dev/null && echo yes || echo no'
85 hosts will be targeted:
ms-be[2028-2065].codfw.wmnet,ms-be[1028-1033,1035-1067].eqiad.wmnet,thanos-be[2001-2004].codfw.wmnet,thanos-be[1001-1004].eqiad.wmnet
Ok to proceed on 85 hosts? Enter the number of affected hosts to confirm or "q" to quit 85
===== NODE GROUP =====
(35) ms-be[2028,2030,2032,2037-2038,2040,2046-2047,2050-2051,2053-2054,2057,2060,2063,2065].codfw.wmnet,ms-be[1028-1031,1035-1038,1042,1046,1048-1049,1054,1058-1060,1065,1067].eqiad.wmnet,thanos-be2001.codfw.wmnet
----- OUTPUT of 'lsof /var/log/sw...o yes || echo no' -----
yes
===== NODE GROUP =====
(50) ms-be[2029,2031,2033-2036,2039,2041-2045,2048-2049,2052,2055-2056,2058-2059,2061-2062,2064].codfw.wmnet,ms-be[1032-1033,1039-1041,1043-1045,1047,1050-1053,1055-1057,1061-1064,1066].eqiad.wmnet,thanos-be[2002-2004].codfw.wmnet,thanos-be[1001-1004].eqiad.wmnet
----- OUTPUT of 'lsof /var/log/sw...o yes || echo no' -----
no
================
PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (85/85) [00:01<00:00, 42.91hosts/s]
FAIL | | 0% (0/85) [00:01<?, ?hosts/s]
100.0% (85/85) success ratio (>= 100.0% threshold) for command: 'lsof /var/log/sw...o yes || echo no'.
100.0% (85/85) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
```