Page MenuHomePhabricator

Switch k8s logs to their own kafka topics
Closed, ResolvedPublic

Description

Logs for all k8s clusters currently flow into rsyslog-* kafka topics (split by severity). While this is simple and works under normal circumstances, in case of even a single spammy producer then all other producers are affected by the caused lag.

Similarly to what we do with prometheus, we should instead switch to a model where kafka-logging topics are isolated/split at least by k8s cluster, if not even more (e.g. cluster + namespace).

As a bonus side effect, moving to this model also effectively will increase the logstash ingestion capacity since we will be able to consume from more topics concurrently, as opposed to a single funnel/topic. Also at the moment normally we have 6 partitions and 6 logstash consumers, so effectively each consumes single-thread from a given topic.

Event Timeline

Change #1040170 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] k8s: send logs to per-cluster kafka topics

https://gerrit.wikimedia.org/r/1040170

Change #1040170 merged by Filippo Giunchedi:

[operations/puppet@production] k8s: send logs to per-cluster kafka topics

https://gerrit.wikimedia.org/r/1040170

Change #1042917 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] logstash: add auto_offset_reset to kafka input

https://gerrit.wikimedia.org/r/1042917

Change #1042918 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] logstash: consume k8s logs topics

https://gerrit.wikimedia.org/r/1042918

Change #1042917 merged by Filippo Giunchedi:

[operations/puppet@production] logstash: add auto_offset_reset to kafka input

https://gerrit.wikimedia.org/r/1042917

Change #1042918 merged by Filippo Giunchedi:

[operations/puppet@production] logstash: consume k8s logs topics

https://gerrit.wikimedia.org/r/1042918

Change #1057819 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] rsyslog: send all k8s logs to dedicated kafka topics

https://gerrit.wikimedia.org/r/1057819

Change #1057819 merged by Filippo Giunchedi:

[operations/puppet@production] rsyslog: send all k8s logs to dedicated kafka topics

https://gerrit.wikimedia.org/r/1057819

Change #1059025 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] rsyslog: fix kafka-k8s double logging

https://gerrit.wikimedia.org/r/1059025

Change #1059025 merged by Filippo Giunchedi:

[operations/puppet@production] rsyslog: fix kafka-k8s double logging

https://gerrit.wikimedia.org/r/1059025

fgiunchedi claimed this task.

This is done, from dashboards now we can tell which k8s cluster is generating activity and using space for example (https://grafana.wikimedia.org/goto/19eEXF9SR?orgId=1)

2024-08-02-103058_2456x1519_scrot.png (1×2 px, 491 KB)

The topics are not perfectly balanced in terms of throughput, and we'll probably need to keep an eye on space usage / retention. Something we can do as followups as required, resolving