Page MenuHomePhabricator

rsyslog receiver on centrallog hosts misplaces some log host entries
Open, Stalled, Needs TriagePublic

Description

While investigating something else we've noticed syslog.log files on centrallog hosts containing files not pertaining to the expected host, for example:

centrallog1002:~$ zgrep -v '[0-9][0-9] titan2001 ' /srv/syslog/titan2001/syslog.log | wc -l
grep: /srv/syslog/titan2001/syslog.log: binary file matches
7158
centrallog1002:~$ wc -l /srv/syslog/titan2001/syslog.log
1386612 /srv/syslog/titan2001/syslog.log

This problem shows up for old syslog files, e.g back from the beginning of syslog retention (October 2024 at the time of writing).

Followup actions:

  • test a recent version of rsyslog, e.g. 8.2504.0 from unstable/trixie . Backported packages 8.2504.0-1~bpo12+1 available in ~filippo on centrallog hosts.
  • if the bug still shows up, then report upstream [[[ https://github.com/rsyslog/rsyslog/issues/5645 | upstream issue ]]]

Event Timeline

I took a look at logs in centrallog1002 and the problem is still showing for logs as recent as May 5 for several hosts (thanos-be1004, ms-be2066, elastic1082, etc.).

I'm still digging on this issue but so far, I've noticed that this is happening for logs of other hosts as well, for example. I can see thanos-be2001 logs in /srv/syslog/grafana1002/syslog.log. Looking at our code receiver.pp, receiver.erb.conf, etc. I don't think the issue resides there, my hypothesis so far is that this could be some sort of race condition where, due to the high number of logs, some logs may be reusing the lastly evaluated hostname and storing logs in that folder. I'll continue investigating to see if this could be the case.

I wrote a script to search for other instances where we have a log storing mismatch and this seems to be happening for plenty of hosts, you can find the full list in here. The list is so long it's not humanly possible to look at all of them but I looked at some of the instances shown (ex. thanos-be* logs in the Grafana folder) and I can confirm that the logs belong to a different hostname and that the thanos-be* match is not because it's part of the log.

Thank you for the extensive investigation @andrea.denisse !

Reporting after the discussion at the team meeting yesterday: we could certainly chase the bug in rsyslog, and we should certainly test the latest rsyslog version and report the problem upstream. However for me personally this is kind of the nail in the coffin for rsyslog in its role as the central syslog receiver. Looking for alternatives to rsyslog-receiver also ties in the investigations we have been doing wrt doing a better job closer to the log producers (e.g. T333731).

Thank you for the extensive investigation @andrea.denisse !

Reporting after the discussion at the team meeting yesterday: we could certainly chase the bug in rsyslog, and we should certainly test the latest rsyslog version and report the problem upstream. However for me personally this is kind of the nail in the coffin for rsyslog in its role as the central syslog receiver. Looking for alternatives to rsyslog-receiver also ties in the investigations we have been doing wrt doing a better job closer to the log producers (e.g. T333731).

To clarify what I meant: for the purposes of this task and the bug I think it is worthwhile testing out new rsyslog versions as outlined in the task description. Mid/long term though rsyslog for central syslog receiver should be re-evaluated

Thank you for the extensive investigation @andrea.denisse !

Reporting after the discussion at the team meeting yesterday: we could certainly chase the bug in rsyslog, and we should certainly test the latest rsyslog version and report the problem upstream. However for me personally this is kind of the nail in the coffin for rsyslog in its role as the central syslog receiver. Looking for alternatives to rsyslog-receiver also ties in the investigations we have been doing wrt doing a better job closer to the log producers (e.g. T333731).

To clarify what I meant: for the purposes of this task and the bug I think it is worthwhile testing out new rsyslog versions as outlined in the task description. Mid/long term though rsyslog for central syslog receiver should be re-evaluated

Hi Filippo, thanks for the guidance on this.

I looked into the compatibility of the rsyslog_8.2504.0-1~bpo12+1 package on Bookworm (using dpkg-deb -I rsyslog_8.2504.0-1~bpo12+1_amd64.deb and apt-cache showpkg rsyslog) and all of the runtime dependencies are met. It seems like the only conflict would reside with rsyslog-openssl and rsyslog-kafka as they depend on the oldler rsyslog version (8.2302.0-1) but I noticed you have those in your home folder as well, thanks for putting up everything together, I'll test them with those versions.

Mentioned in SAL (#wikimedia-operations) [2025-05-12T21:31:15Z] <denisse> Testing rsyslog_8.2504.0-1~bpo12+1 on centrallog1002 - T383309

The current configuration is valid with rsyslog 8.2504.0:

sudo /usr/sbin/rsyslogd -N1
rsyslogd: version 8.2504.0, config validation run (level 1), master config /etc/rsyslog.conf
rsyslogd: End of config validation run. Bye.

Looking at lsof output after the upgrade the Kafka module (omkafka) and OpenSSL are working correctly (I used the appropriate packages matching the rsyslog version).

I tried to manually create a log entry: logger "denisse: rsyslog upgrade test"
And it was correctly logged: 2025-05-12T21:37:37.856903+00:00 centrallog1002 denisse: denisse: rsyslog upgrade test

So far, the upgrade seems to have worked. I'll continue to monitor it and if it's stable and the issue is fixed I think we could integrate it into reprepro to update the other centrallog instances. What do you think?

Thank you for taking a look and the upgrade @andrea.denisse ! Should be easy to verify soon whether the bug shows up indeed. +1 to add to reprepro, please note we already have a component/rsyslog which should/can be reused for this purpose

Mentioned in SAL (#wikimedia-operations) [2025-05-19T20:23:19Z] <denisse> Downgrade rsyslog, rsyslog-kafka, and rsyslog-openssl to 8.2302.0-1+deb12u1_amd64 - T383309

I rechecked after the rsyslog upgrade (performed on May 13) since we should now have enough logs to confirm whether the issue was resolved. Unfortunately, the upgrade did not fix the problem — I'm still seeing misplaced logs across all host folders.

For example, we have this log May 19 19:59:36 prometheus1005 in /srv/syslog/grafana1002/syslog.log.

I also verified that the rsyslog version was not downgraded:

denisse@centrallog1002:~$ dpkg -l | grep rsyslog
ii  prometheus-rsyslog-exporter          1.0.0+git20221110-1                  amd64        Export rsyslog metrics to Prometheus
ii  rsyslog                              8.2504.0-1~bpo12+1                   amd64        reliable system and kernel logging daemon
ii  rsyslog-kafka                        8.2504.0-1~bpo12+1                   amd64        Apache Kafka support for rsyslog
ii  rsyslog-openssl                      8.2504.0-1~bpo12+1                   amd64        TLS/DTLS protocol support for rsyslog (OpenSSL)

The host is still running the upgraded version, so upgrading the rsyslog version doesn't fix the issue. I downgraded rsyslog, rsyslog-kafka, and rsyslog-openssl to 8.2302.0-1+deb12u1_amd64 (the version provided by Debian Bookworm) so it's consistent with the other centrallog hosts.

I created an upstream issue.

andrea.denisse changed the task status from Open to Stalled.May 19 2025, 8:24 PM
andrea.denisse claimed this task.
andrea.denisse updated the task description. (Show Details)

Change #1151386 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Add a temporary rsyslog debug config file

https://gerrit.wikimedia.org/r/1151386

Change #1151386 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Add a temporary rsyslog debug config file

https://gerrit.wikimedia.org/r/1151386

Change #1163901 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Disable temporary rsyslog debug config file.

https://gerrit.wikimedia.org/r/1163901

Change #1163901 abandoned by Andrea Denisse:

[operations/puppet@production] centrallog: Disable temporary rsyslog debug config file.

https://gerrit.wikimedia.org/r/1163901

Change #1163901 restored by Andrea Denisse:

[operations/puppet@production] centrallog: Disable temporary rsyslog debug config file.

https://gerrit.wikimedia.org/r/1163901

Change #1163901 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Log with standard and custom template

https://gerrit.wikimedia.org/r/1163901

Change #1163901 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Log with standard and custom template

https://gerrit.wikimedia.org/r/1163901

Change #1173442 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Add debug template to rsyslog config

https://gerrit.wikimedia.org/r/1173442

Change #1173442 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Add sampling rules for debug logging

https://gerrit.wikimedia.org/r/1173442

Change #1178932 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Add sampling rules for debug logging

https://gerrit.wikimedia.org/r/1178932

Change #1178932 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Add sampling rules for debug logging

https://gerrit.wikimedia.org/r/1178932

Change #1178969 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Remove debug sampling

https://gerrit.wikimedia.org/r/1178969

Change #1178969 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Remove debug sampling

https://gerrit.wikimedia.org/r/1178969

Mentioned in SAL (#wikimedia-operations) [2025-08-15T22:22:44Z] <denisse> Remove log debug file from host - T383309

Hi folks,

After enabling the debug template and analyzing the logs, the only anomaly I found was that some logs have FROMHOST containing an IP address instead of a FQDN. This seems to indicate that inconsistent hostname resolution may be the cause of logs occasionally being written to the wrong host folders.

I updated upstream with the findings and asked for advice. I'll continue to think about what could be causing the issue while we receive a reply.

For now, I have stopped the debug logging since we have gathered enough data and have deleted the debug file from the host to prevent it from occupying unnecessary space.

Change #1179228 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Remove unused debug logging config

https://gerrit.wikimedia.org/r/1179228

@andrea.denisse Do you still have any debug logs available? I’m just curious...

Change #1179228 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Remove unused debug logging config

https://gerrit.wikimedia.org/r/1179228

@andrea.denisse Do you still have any debug logs available? I’m just curious...

Hi Tiziano, I don't have any debug logs. I captured and analyzed them in the host so the logs didn't leave the prod infra but I'll enable debug logging again to share them with the upstream maintainers as they would like to see the logs headers.

My plan is to enable debug logging and to share a sample of sanitized logs with the rsyslog maintainers for their advice. I can leave the file on the host if you'd like to analyze it, any findings you make could be pretty useful to further understanding or solving the issue.

Change #1179753 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] centrallog: Enable debug logging for the rsyslog-receiver

https://gerrit.wikimedia.org/r/1179753

Change #1179753 merged by Andrea Denisse:

[operations/puppet@production] centrallog: Enable debug logging for the rsyslog-receiver

https://gerrit.wikimedia.org/r/1179753

@andrea.denisse Do you still have any debug logs available? I’m just curious...

Hi Tiziano, I don't have any debug logs. I captured and analyzed them in the host so the logs didn't leave the prod infra but I'll enable debug logging again to share them with the upstream maintainers as they would like to see the logs headers.

My plan is to enable debug logging and to share a sample of sanitized logs with the rsyslog maintainers for their advice. I can leave the file on the host if you'd like to analyze it, any findings you make could be pretty useful to further understanding or solving the issue.

Hi Tiziano, the file with the debug logs is in /srv/syslog/rsyslog-debug/rsyslog-debug-standard.log on the centrallog hosts.