Page MenuHomePhabricator

ossl rsyslog errors post-migration
Open, Needs TriagePublic

Description

I've been looking at the rsyslog dashboard and while things generally seems to work, some hosts are reporting errors: https://grafana.wikimedia.org/d/000000596/rsyslog?orgId=1&refresh=5m&from=now-1h&to=now

e.g. on ms-fe1013 there's a whole bunch of errors:

Nov 21 13:30:54 ms-fe1013 rsyslogd[2459015]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:30:55 ms-fe1013 rsyslogd[2459015]: nsd_ossl:TLS Connection initiated with remote syslog server. [v8.2102.0]
Nov 21 13:30:55 ms-fe1013 rsyslogd[2459015]: nsd_ossl:No shared curve between syslog client and server. [v8.2102.0]
Nov 21 13:30:55 ms-fe1013 rsyslogd[2459015]: nsd_ossl:No shared curve between syslog client and server. [v8.2102.0]
Nov 21 13:30:55 ms-fe1013 rsyslogd[2459015]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:30:56 ms-fe1013 rsyslogd[2459015]: nsd_ossl:TLS Connection initiated with remote syslog server. [v8.2102.0]
Nov 21 13:30:57 ms-fe1013 rsyslogd[2459015]: nsd_ossl:No shared curve between syslog client and server. [v8.2102.0]
Nov 21 13:30:57 ms-fe1013 rsyslogd[2459015]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:30:58 ms-fe1013 rsyslogd[2459015]: nsd_ossl:TLS Connection initiated with remote syslog server. [v8.2102.0]
Nov 21 13:30:58 ms-fe1013 rsyslogd[2459015]: nsd_ossl:No shared curve between syslog client and server. [v8.2102.0]
Nov 21 13:30:58 ms-fe1013 rsyslogd[2459015]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:30:59 ms-fe1013 rsyslogd[2459015]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:31:00 ms-fe1013 rsyslogd[2459015]: nsd_ossl:TLS Connection initiated with remote syslog server. [v8.2102.0]
Nov 21 13:31:00 ms-fe1013 rsyslogd[2459015]: nsd_ossl:No shared curve between syslog client and server. [v8.2102.0]
Nov 21 13:31:00 ms-fe1013 rsyslogd[2459015]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:31:06 ms-fe1013 rsyslogd[2459015]: nsd_ossl:TLS Connection initiated with remote syslog server. [v8.2102.0]

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
fgiunchedi renamed this task from ossl rsyslog post-migration to ossl rsyslog errors post-migration.Nov 21 2023, 1:35 PM

On the rsyslog side these are the errors:

Nov 21 13:42:58 centrallog2002 rsyslogd[2845781]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:42:58 centrallog2002 rsyslogd[2845781]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:42:58 centrallog2002 rsyslogd[2845781]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:42:58 centrallog2002 rsyslogd[2845781]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:42:59 centrallog2002 rsyslogd[2845781]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:42:59 centrallog2002 rsyslogd[2845781]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:42:59 centrallog2002 rsyslogd[2845781]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]
Nov 21 13:42:59 centrallog2002 rsyslogd[2845781]: nsd_ossl:TLS session terminated with remote syslog server. [v8.2102.0]

Interestingly only centrallog2002 seems affected, I can't see any errors on centrallog1002

Mentioned in SAL (#wikimedia-operations) [2023-11-21T13:49:49Z] <godog> test upgrade rsyslog on centrallog2002 - T351710

Mentioned in SAL (#wikimedia-operations) [2023-11-21T14:07:17Z] <godog> revert rsyslog upgrade on centrallog2002 - T351710

Change 976234 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] centralserver: remove icinga tls listener check

https://gerrit.wikimedia.org/r/976234

Change 976236 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] centralserver: probe syslog receiver with client auth

https://gerrit.wikimedia.org/r/976236

Change 976234 merged by Filippo Giunchedi:

[operations/puppet@production] centralserver: remove icinga tls listener check

https://gerrit.wikimedia.org/r/976234

Change 976236 merged by Filippo Giunchedi:

[operations/puppet@production] centralserver: probe syslog receiver with client auth

https://gerrit.wikimedia.org/r/976236

@fgiunchedi seems like a mismatch on configured curves between clients and servers, could I suggest providing a more detailed TLS configuration for both rsyslog servers and clients?

I think tls.tlscfgmd is what you need to set per https://www.rsyslog.com/doc/v8-stable/configuration/modules/imrelp.html#tls-tlscfgcmd

I'd recommend using the same settings that we use for the strong ssl settings globally (I'll update this comment later)

Thank you @Vgutierrez for the suggestion, I've dug a little bit into the situation and the code and I believe the message is a red-herring, in the sense that taking another bullseye host (e.g. logstash1023) I see the error although things are working as expected in that case. i.e. logstash1023 is sending to both centrallog hosts fine.
So far my suspicion is that this is related to the amount of log traffic produced, as I'm seeing plenty of (re)connections from e.g. ms-fe hosts which do syslog quite a bit

nice, but please set a sane TLS configuration :) ideally nothing lower than TLSv1.2 and solid ciphersuites

nice, but please set a sane TLS configuration :) ideally nothing lower than TLSv1.2 and solid ciphersuites

Tracked in https://phabricator.wikimedia.org/T351788

Change 976656 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Revert "centrallog: update tls_netstream_driver to use ossl"

https://gerrit.wikimedia.org/r/976656

Change 976656 merged by Filippo Giunchedi:

[operations/puppet@production] Revert "centrallog: update tls_netstream_driver to use ossl"

https://gerrit.wikimedia.org/r/976656

I tested a revert to gtls for centrallog hosts (the receiver part only), rsyslog now stays silent on centrallog though I still see the (re) connections from high syslog traffic hosts like ms-fe. We have queues to spool data on disk on end hosts so I'm not too worried at this point.

While working on rsyslog-receiver configuration I noticed we're running a single rsyslog that doubles both as the fleetwide syslog plus the receiver, this is more complex than it needs to be IMHO so I'll take this occasion to split the receiver into its own rsyslog instance (in a followup task)

Change 977090 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] rsyslog: move centrallog to ossl

https://gerrit.wikimedia.org/r/977090

Change 977090 merged by Filippo Giunchedi:

[operations/puppet@production] rsyslog: move centrallog to ossl

https://gerrit.wikimedia.org/r/977090

Change 979108 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] centralserver: reintroduce tls-remedy

https://gerrit.wikimedia.org/r/979108

Change 979108 merged by Filippo Giunchedi:

[operations/puppet@production] centralserver: reintroduce tls-remedy for centralserver

https://gerrit.wikimedia.org/r/979108

Current situation:

  • We have a separate rsyslog-receiver unit/instance with only the receiver bits on centrallog hosts
  • The fleet is running with ossl both for clients and server
  • The rsyslog remedy (i.e. restart the server if a probe with openssl fails) is back on
  • The messages/errors remain on centrallog hosts, the problem is most noticeable for hosts with high log volume (e.g. ms-fe)
  • The problem is mitigated by the client queues, which take over when a disconnection happens and then eventually flush the logs once the connection is back up

Change 980434 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] rsyslog: update receiver config for version 8.2302

https://gerrit.wikimedia.org/r/980434

Change 980434 merged by Filippo Giunchedi:

[operations/puppet@production] rsyslog: update receiver config for version 8.2302

https://gerrit.wikimedia.org/r/980434

Mentioned in SAL (#wikimedia-operations) [2023-12-06T08:49:33Z] <godog> test rsyslog version from bullseye-backports on centrallog - T351710

Change 981287 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] rsyslog: add receiver action names

https://gerrit.wikimedia.org/r/981287

Change 981287 merged by Filippo Giunchedi:

[operations/puppet@production] rsyslog: add receiver action names

https://gerrit.wikimedia.org/r/981287

I've added some more visibility into how much we are writing to local files, compared to the amount of logs we are receiving.

Turns out we receive (across both centrallog hosts) ~40k logs/s though we're writing ~10k/s. Meaning we ~75% of received logs are just dropped! I believe this is the high volume of swift access log, for which we log only state-changing requests. Instead, we should not send these logs to centrallog in the first place.

Change 981298 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] swift: write to local files and ban before centrallog

https://gerrit.wikimedia.org/r/981298

Mentioned in SAL (#wikimedia-operations) [2023-12-06T08:49:33Z] <godog> test rsyslog version from bullseye-backports on centrallog - T351710

This works and is running, though not always in the sense that in some occasions non-hostname files are created, for example:

root@centrallog1002:/srv/syslog# cat ontain/syslog.log 
Dec  7 14:49:11 ontain attribute [givenName]>#033[m
Dec  7 14:49:12 ontain attribute [givenName]>#033[m

Not a huge deal IMHO at least for now, I'm expecting that resolving T352968 will improve the situation though

Change 982085 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Move to standard rsyslog-rotate shared script

https://gerrit.wikimedia.org/r/982085

Change 982085 merged by Filippo Giunchedi:

[operations/puppet@production] Move to standard rsyslog-rotate shared script

https://gerrit.wikimedia.org/r/982085

Mentioned in SAL (#wikimedia-operations) [2023-12-06T08:49:33Z] <godog> test rsyslog version from bullseye-backports on centrallog - T351710

This works and is running, though not always in the sense that in some occasions non-hostname files are created, for example:

root@centrallog1002:/srv/syslog# cat ontain/syslog.log 
Dec  7 14:49:11 ontain attribute [givenName]>#033[m
Dec  7 14:49:12 ontain attribute [givenName]>#033[m

Not a huge deal IMHO at least for now, I'm expecting that resolving T352968 will improve the situation though

I was mistaken, even though reducing volume of logs by 75% did certainly help with load in general, this issue persists