I noticed a thanos-be2001 filesystem alarm yesterday and indeed swift logs in /var/log/swift were big and not rotated. As a bandaid I truncated 10G off server.log.1 though the problem persists. Before truncating I saved the old head/tail of the file to show rotation isn't always happening:
root@thanos-be2001:~# head -5 /root/server.log.1_head1000 Feb 8 13:07:46 thanos-be2001 container-server: [pid: 25744|app: 0|req: 38/1799] 10.64.0.136 () {28 vars in 450 bytes} [Tue Feb 8 13:07:46 2022] HEAD /sdb3/50430/AUTH_dispersion/dispersion_0_949 => generated 0 bytes in 1 msecs (HTTP/1.1 507) 2 headers in 96 bytes (1 switches on core 0) Feb 8 13:07:46 thanos-be2001 container-server: [pid: 25770|app: 0|req: 38/1800] 10.64.0.136 () {28 vars in 452 bytes} [Tue Feb 8 13:07:46 2022] HEAD /sdb3/13950/AUTH_dispersion/dispersion_1_3376 => generated 0 bytes in 1 msecs (HTTP/1.1 507) 2 headers in 96 bytes (1 switches on core 0) Feb 8 13:07:46 thanos-be2001 container-server: [pid: 25779|app: 0|req: 38/1801] 10.64.0.136 () {28 vars in 448 bytes} [Tue Feb 8 13:07:46 2022] HEAD /sdb3/9096/AUTH_dispersion/dispersion_0_939 => generated 0 bytes in 1 msecs (HTTP/1.1 507) 2 headers in 96 bytes (1 switches on core 0) Feb 8 13:07:46 thanos-be2001 container-server: [pid: 25778|app: 0|req: 38/1802] 10.64.0.136 () {28 vars in 452 bytes} [Tue Feb 8 13:07:46 2022] HEAD /sdb3/29505/AUTH_dispersion/dispersion_1_3383 => generated 0 bytes in 1 msecs (HTTP/1.1 507) 2 headers in 96 bytes (1 switches on core 0) Feb 8 13:07:46 thanos-be2001 container-server: [pid: 25765|app: 0|req: 38/1803] 10.64.0.136 () {28 vars in 452 bytes} [Tue Feb 8 13:07:46 2022] HEAD /sda3/49443/AUTH_dispersion/dispersion_1_3382 => generated 0 bytes in 1 msecs (HTTP/1.1 507) 2 headers in 96 bytes (1 switches on core 0)
root@thanos-be2001:~# tail -5 /root/server.log.1_tail1000 Feb 13 00:00:01 thanos-be2001 object-server: 10.192.0.192 - - [13/Feb/2022:00:00:01 +0000] "HEAD /sdn1/9829/AUTH_thanos/thanos/01FQQ3ZE4Z34PQNMVHK6Z1165C/meta.json" 200 6932 "HEAD http://thanos-swift.discovery.wmnet/v1/AUTH_thanos/thanos/01FQQ3ZE4Z34PQNMVHK6Z1165C/meta.json" "txcd5a24830a1340cf98564-0062084a01" "proxy-server 263746" 0.0012 "-" 1879 0 Feb 13 00:00:01 thanos-be2001 object-server: 10.192.0.192 - - [13/Feb/2022:00:00:01 +0000] "GET /sdi1/11883/AUTH_thanos/thanos/01EJDJ1YPR4Q11CEN814MHPKPM/no-compact-mark.json" 404 70 "GET http://thanos-swift.discovery.wmnet/v1/AUTH_thanos/thanos/01EJDJ1YPR4Q11CEN814MHPKPM/no-compact-mark.json" "txf705e6031d8f4e00ade0d-0062084a01" "proxy-server 263766" 0.0008 "-" 1857 0 Feb 13 00:00:01 thanos-be2001 object-server: 10.192.0.192 - - [13/Feb/2022:00:00:01 +0000] "HEAD /sdj1/41390/AUTH_thanos/thanos/01EHAJXBR63ET3CC3W1PHT5HWF/meta.json" 200 6771 "HEAD http://thanos-swift.discovery.wmnet/v1/AUTH_thanos/thanos/01EHAJXBR63ET3CC3W1PHT5HWF/meta.json" "txf06b4c25dce1479b8d6e2-0062084a01" "proxy-server 263752" 0.0009 "-" 1860 0 Feb 13 00:00:01 thanos-be2001 object-server: 10.192.0.192 - - [13/Feb/2022:00:00:01 +0000] "HEAD /sdg1/14346/AUTH_thanos/thanos/01FQPVM3SDKXHXMMY5Q28EV4GQ/meta.json" 200 8702 "HEAD http://thanos-swift.discovery.wmnet/v1/AUTH_thanos/thanos/01FQPVM3SDKXHXMMY5Q28EV4GQ/meta.json" "tx56dad1b384e7463eb856f-0062084a01" "proxy-server 263746" 0.0010 "-" 1807 0 Feb 13 00:00:01 thanos-be2001 object-server: 10.192.0.192 - - [13/Feb/2022:00:00:01 +0000] "GET /sdl1/2436/AUTH_thanos/thanos/01EZZFV77S7V5VQ33PYK1KGG4V/no-compact-mark.json" 404 70 "GET http://thanos-swift.discovery.wmnet/v1/AUTH_thanos/thanos/01EZZFV77S7V5VQ33PYK1KGG4V/no-compact-mark.json" "tx2fe09675b13644d9ad152-0062084a01" "proxy-server 263754" 0.0009 "-" 1869 0
Since yesterday some form of rotation did happen (the .log files are in place) though rsyslog has kept logging to the old files (but the root filesystem space now is fine since compression happened)
root@thanos-be2001:~# ls -latr /var/log/swift/server.log* /var/log/swift/background.log* -rw-r--r-- 1 root root 3733934 Feb 12 23:59 /var/log/swift/background.log.2.gz -rw-r--r-- 1 root root 4564556338 Feb 13 15:38 /var/log/swift/server.log.2.gz -rw-r--r-- 1 root root 0 Feb 14 00:11 /var/log/swift/server.log -rw-r--r-- 1 root root 0 Feb 14 00:11 /var/log/swift/background.log -rw-r--r-- 1 root root 9136349 Feb 14 08:43 /var/log/swift/background.log.1 -rw-r--r-- 1 root root 12239141629 Feb 14 08:43 /var/log/swift/server.log.1 root@thanos-be2001:~# lsof /var/log/swift/server.log.1 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME rsyslogd 1216 root 12w REG 9,0 12241448112 262171 /srv/log/swift/server.log.1 root@thanos-be2001:~# lsof /var/log/swift/background.log.1 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME rsyslogd 1216 root 28w REG 9,0 9137791 262170 /srv/log/swift/background.log.1
Note that this problem seems to be present on ms-be hosts too, though my guess is that due to less logging activity the root filesystem doesn't tend to get full:
root@cumin1001:~# cumin 'thanos-be* or ms-be*' 'lsof /var/log/swift/server.log.1 >/dev/null && echo yes || echo no' 85 hosts will be targeted: ms-be[2028-2065].codfw.wmnet,ms-be[1028-1033,1035-1067].eqiad.wmnet,thanos-be[2001-2004].codfw.wmnet,thanos-be[1001-1004].eqiad.wmnet Ok to proceed on 85 hosts? Enter the number of affected hosts to confirm or "q" to quit 85 ===== NODE GROUP ===== (35) ms-be[2028,2030,2032,2037-2038,2040,2046-2047,2050-2051,2053-2054,2057,2060,2063,2065].codfw.wmnet,ms-be[1028-1031,1035-1038,1042,1046,1048-1049,1054,1058-1060,1065,1067].eqiad.wmnet,thanos-be2001.codfw.wmnet ----- OUTPUT of 'lsof /var/log/sw...o yes || echo no' ----- yes ===== NODE GROUP ===== (50) ms-be[2029,2031,2033-2036,2039,2041-2045,2048-2049,2052,2055-2056,2058-2059,2061-2062,2064].codfw.wmnet,ms-be[1032-1033,1039-1041,1043-1045,1047,1050-1053,1055-1057,1061-1064,1066].eqiad.wmnet,thanos-be[2002-2004].codfw.wmnet,thanos-be[1001-1004].eqiad.wmnet ----- OUTPUT of 'lsof /var/log/sw...o yes || echo no' ----- no ================ PASS |██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (85/85) [00:01<00:00, 42.91hosts/s] FAIL | | 0% (0/85) [00:01<?, ?hosts/s] 100.0% (85/85) success ratio (>= 100.0% threshold) for command: 'lsof /var/log/sw...o yes || echo no'. 100.0% (85/85) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
My hunch is that for some reason rsyslog doesn't receive SIGHUP at all (via logrotate's postrotate scripts) or similarly rsyslog does receive SIGHUP but doesn't create new destination files; and/or new destination files are created but moving writes to the new files never happens.