Page MenuHomePhabricator

syslog / centrallog log volume growth
Open, MediumPublic

Assigned To
None
Authored By
fgiunchedi
May 25 2022, 7:40 AM
Referenced Files
Restricted File
Thu, Jun 16, 10:02 AM
Restricted File
Thu, Jun 16, 10:02 AM
Tokens
"Love" token, awarded by jcrespo.

Description

While investigating T300056 I took a look at what is causing so much (sys)log growth over time. Turns out, it is swift logs (!)

root@centrallog2002:~# du -hcs /srv/syslog | grep total
926G	total
root@centrallog2002:~# du -hcs /srv/syslog/*/swift* | grep total
712G	total

Event Timeline

lmata triaged this task as Medium priority.May 25 2022, 4:30 PM

Followup from o11y team meeting yesterday:

  • Consider if it is doable and viable to trim retention for swift only
  • Medium/longer term consider ingesting swift logs in logstash (in terms of log volume and used space)

Mentioned in SAL (#wikimedia-operations) [2022-06-08T15:13:42Z] <godog> trim swift logs older than 30d from centrallog2002 - T309171

Mentioned in SAL (#wikimedia-operations) [2022-06-08T15:17:00Z] <godog> trim swift logs older than 30d from centrallog1001 - T309171

This is back, I tracked it down to a lot of logs from container-server essentially the REPLICATE access log:

root@centrallog2002:~# grep -c REPLICATE /srv/syslog/ms-be1061/swift.log
2723041
root@centrallog2002:~# wc -l !$
wc -l /srv/syslog/ms-be1061/swift.log
3541507 /srv/syslog/ms-be1061/swift.log
root@centrallog2002:~#

This is new logging activity on Bullseye systems, whereas before REPLICATE activity wasn't there. I'm for dropping these logs altogether since it is essentially spam IMHO, very much like we did in https://gerrit.wikimedia.org/r/340142 . What do you think @MatthewVernon

Change 806166 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: drop REPLICATE 'access log' from container-server

https://gerrit.wikimedia.org/r/806166

Change 806166 merged by Filippo Giunchedi:

[operations/puppet@production] swift: drop REPLICATE 'access log' from container-server

https://gerrit.wikimedia.org/r/806166

For context:

root@centrallog2002:~$ du -hcs /srv/syslog/ms-be* | grep total
804G    total
✔️ root@centrallog2002:~$ du -hcs /srv/syslog | grep total
1.1T    total
✔️

{F35246396}
{F35246398}

Change 806173 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: introduce rsyslog config to ban logs before centrallog

https://gerrit.wikimedia.org/r/806173

Change 806173 merged by Filippo Giunchedi:

[operations/puppet@production] swift: introduce rsyslog config to ban logs before centrallog

https://gerrit.wikimedia.org/r/806173

Mentioned in SAL (#wikimedia-operations) [2022-06-16T11:35:21Z] <godog> trim swift logs older than 25d from centrallog hosts - T309171

The immediate issue has been mitigated, we'll be seeing filesystem space going back to reasonable levels in the next few days, I'll leave the task open in the meantime