Page MenuHomePhabricator

Thumbor high log volume and unstructured logging
Open, LowPublic

Description

Thumbor produces about 50 million log events per day. In addition, it does not produce structured logs inflating this number with individual lines of stack traces.

It's unclear to me what in this log stream is useful, but there are dashboards with recent updates.

Event Timeline

Marking high priority as these logs are a significant portion of the logstash-k8s indexes. We either need to expand the size of the cluster or reduce the volume of logs.

Change #1048592 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: move thumbor logs to logstash-thumbor partition

https://gerrit.wikimedia.org/r/1048592

Mentioned in SAL (#wikimedia-operations) [2024-06-21T23:54:02Z] <cwhite> delete remaining 2024.03 log indexes to make room on logstash eqiad and codfw T368180

Change #1048592 merged by Cwhite:

[operations/puppet@production] logstash: move thumbor logs to logstash-thumbor partition

https://gerrit.wikimedia.org/r/1048592

Change #1048605 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: reduce thumbor replicas

https://gerrit.wikimedia.org/r/1048605

Change #1048605 merged by Cwhite:

[operations/puppet@production] logstash: reduce thumbor replicas

https://gerrit.wikimedia.org/r/1048605

Change #1049108 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: reduce log spam

https://gerrit.wikimedia.org/r/1049108

Change #1049108 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: reduce log spam

https://gerrit.wikimedia.org/r/1049108

Change #1049117 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/software/thumbor-plugins@master] logging: actually use critical when swiftclient/tornado.access use info

https://gerrit.wikimedia.org/r/1049117

Log level reduced to stop the bleeding:

image.png (357×1 px, 48 KB)

The main problem at hand is that the imported swiftclient module logs 4 lines on each 404 (which thumbor gets, frequently!), and tornado.access will log a line on a 200, which also happens frequently. Changing log level has fixed this somewhat, and this fix will filter that stuff out even if we want to use INFO in future

So far, the previous fix is ~18x reduction in logs. Thank you!

We should be able to re-incorporate thumbor into the k8s indexes with this reduced amount of logs.

Change #1049272 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: fix curator typo

https://gerrit.wikimedia.org/r/1049272

Change #1049272 merged by Cwhite:

[operations/puppet@production] logstash: fix curator typo

https://gerrit.wikimedia.org/r/1049272

Change #1051214 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: route thumbor logs in routing filter

https://gerrit.wikimedia.org/r/1051214

colewhite lowered the priority of this task from High to Medium.Jul 1 2024, 11:10 PM

There's still a lot of python stack traces emitted as unstructured logs. Logs generated in this way are largely unhelpful as the stack trace cannot be easily reconstructed from multiple log events.

Change #1058123 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: filter duplicate error messages

https://gerrit.wikimedia.org/r/1058123

Change #1058123 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: filter duplicate error messages

https://gerrit.wikimedia.org/r/1058123

The thumbor logs continue to be largely useless as they are not structured. Given that the load on the logging platform is a multiple of events generated, I propose we drop them.

Change #1153322 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logstash: drop thumbor unstructured logs

https://gerrit.wikimedia.org/r/1153322

Change #1153322 merged by Cwhite:

[operations/puppet@production] logstash: drop thumbor unstructured logs

https://gerrit.wikimedia.org/r/1153322

colewhite lowered the priority of this task from Medium to Low.Jun 12 2025, 2:24 PM

The Thumbor unstructured logs are dropped now. Keeping open, but reprioritizing as it maybe useful to someone someday.