Page MenuHomePhabricator

Encrypt syslog traffic
Closed, ResolvedPublic

Description

Currently syslog traffic towards the logging servers is not encrypted. rsyslog natively supports TLS-encrypted udp/tcp streams so we should be able to do that via its gtls driver.

  • Add listener for TLS syslog to rsyslog-receiver
  • Gradually switch clients to send TLS syslog

Event Timeline

Change 369950 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] rsyslog: add support to receive syslog over TLS

https://gerrit.wikimedia.org/r/369950

fgiunchedi renamed this task from encrypt syslog traffic to Encrypt syslog traffic.Aug 18 2017, 3:32 PM
fgiunchedi updated the task description. (Show Details)

Change 378922 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: send syslog over TLS

https://gerrit.wikimedia.org/r/378922

Change 369950 merged by Filippo Giunchedi:
[operations/puppet@production] rsyslog: add support to receive syslog over TLS

https://gerrit.wikimedia.org/r/369950

Change 378922 merged by Filippo Giunchedi:
[operations/puppet@production] base: ability to send syslog over TLS

https://gerrit.wikimedia.org/r/378922

Change 381753 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: create /etc/rsyslog as required

https://gerrit.wikimedia.org/r/381753

Change 381753 merged by Filippo Giunchedi:
[operations/puppet@production] base: create /etc/rsyslog as required

https://gerrit.wikimedia.org/r/381753

Change 381768 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable syslog over tls for ulsfo/esams

https://gerrit.wikimedia.org/r/381768

Change 381944 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: check remote_syslog and remote_syslog_tls for emptyness

https://gerrit.wikimedia.org/r/381944

Change 381944 merged by Filippo Giunchedi:
[operations/puppet@production] base: check remote_syslog and remote_syslog_tls for emptyness

https://gerrit.wikimedia.org/r/381944

Change 381768 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable syslog over tls for ulsfo

https://gerrit.wikimedia.org/r/381768

Change 381959 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: use empty list for remote_syslog

https://gerrit.wikimedia.org/r/381959

Change 381959 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: use empty list for remote_syslog

https://gerrit.wikimedia.org/r/381959

Change 381970 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable syslog over tls for codfw

https://gerrit.wikimedia.org/r/381970

Change 381970 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable syslog over tls for codfw

https://gerrit.wikimedia.org/r/381970

Change 381997 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: cater for rsyslog on trusty running as 'syslog'

https://gerrit.wikimedia.org/r/381997

Change 381998 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] rsyslog: bump maximum TCP sessions

https://gerrit.wikimedia.org/r/381998

Change 381997 merged by Filippo Giunchedi:
[operations/puppet@production] base: cater for rsyslog on trusty running as 'syslog'

https://gerrit.wikimedia.org/r/381997

Change 381998 merged by Filippo Giunchedi:
[operations/puppet@production] rsyslog: bump maximum TCP sessions

https://gerrit.wikimedia.org/r/381998

Change 382415 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable syslog over tls for eqiad

https://gerrit.wikimedia.org/r/382415

Change 382415 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable syslog over tls for eqiad

https://gerrit.wikimedia.org/r/382415

Change 383352 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable syslog over tls for esams

https://gerrit.wikimedia.org/r/383352

syslog-tls is deployed everywhere but esams (coming shortly)

Traffic looks good to me, and cpu usage overall is around 15% on wezen/lithium.

Connection could though on the two hosts isn't the same, which suggests some hosts are connected to only one of the two endpoints (see P6099 for a list, generated with

diff -u \
  <(ssh lithium.eqiad.wmnet -- ss -tr | awk '{print $5}' | cut -d: -f1 | sort -u ) \
  <(ssh wezen.codfw.wmnet -- ss -tr | awk '{print $5}'  | cut -d: -f1 | sort -u)

The eqiad change was reverted yesterday due to (among the problem above) labservices machines hanging and not being able to successfully talk TLS with syslog servers. I'll be conducting more tests and apply the change in eqiad more gradually.

Change 383352 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable syslog over tls for esams

https://gerrit.wikimedia.org/r/383352

Change 386347 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: extend syslog-tls eqiad rollout

https://gerrit.wikimedia.org/r/386347

Change 386347 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: extend syslog-tls eqiad rollout

https://gerrit.wikimedia.org/r/386347

Change 386372 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] centralserver: add icinga monitoring for syslog-tls

https://gerrit.wikimedia.org/r/386372

Change 386372 merged by Filippo Giunchedi:
[operations/puppet@production] centralserver: add icinga monitoring for syslog-tls

https://gerrit.wikimedia.org/r/386372

Change 386404 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] rsyslog: bump maximum fds for rsyslog-receiver

https://gerrit.wikimedia.org/r/386404

Change 386404 merged by Filippo Giunchedi:
[operations/puppet@production] rsyslog: bump maximum fds for rsyslog-receiver

https://gerrit.wikimedia.org/r/386404

Results of today's experiments: initially we were bumping into max open files limit (fixed) and after that was fixed rsyslog was regularly crashing with sigabrt. I saved a few core dumps in root's home on wezen (running rsyslog 8.23.0-2~bpo8+1 from jessie-backports) and lithium (running jessie stock rsyslog 8.4.2-1+deb8u2). Looks like we're running into https://github.com/rsyslog/rsyslog/issues/1728

root@wezen:~# gdb /usr/sbin/rsyslogd /root/core.wezen.in\:imtcp.25779.1508941894 
GNU gdb (Debian 7.11.1-2~bpo8+1) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/rsyslogd...(no debugging symbols found)...done.
[New LWP 25782]
[New LWP 25784]
[New LWP 25785]
[New LWP 25786]
[New LWP 25781]
[New LWP 25787]
[New LWP 25788]
[New LWP 25783]
[New LWP 25779]
[New LWP 25780]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/rsyslogd -n -d'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f1d8d457067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7f1d89d3f700 (LWP 25782))]
(gdb) set pagination off
(gdb) thread apply all bt

Thread 10 (Thread 0x7f1d8a541700 (LWP 25780)):
#0  0x00007f1d8d503893 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1d8ce0c00c in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/imuxsock.so
#2  0x0000562fa3c16970 in ?? ()
#3  0x00007f1d8e3f9064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f1d8d50a62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 9 (Thread 0x7f1d8ea3a780 (LWP 25779)):
#0  0x00007f1d8d503893 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000562fa3be4388 in main ()

Thread 8 (Thread 0x7f1d8993e700 (LWP 25783)):
#0  0x00007f1d8d50ac03 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1d8bdc01b6 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/imudp.so
#2  0x00007f1d8bdc0ac9 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/imudp.so
#3  0x00007f1d8bdc0be0 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/imudp.so
#4  0x0000562fa3c16970 in ?? ()
#5  0x00007f1d8e3f9064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007f1d8d50a62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 7 (Thread 0x7f1d73fff700 (LWP 25788)):
#0  0x00007f1d8e3fd04f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f1d8bfc8809 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/lmtcpsrv.so
#2  0x00007f1d8e3f9064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f1d8d50a62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 6 (Thread 0x7f1d8853a700 (LWP 25787)):
#0  0x00007f1d8e3fd04f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f1d8bfc8809 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/lmtcpsrv.so
#2  0x00007f1d8e3f9064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f1d8d50a62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 5 (Thread 0x7f1d8a140700 (LWP 25781)):
#0  0x00007f1d8e3ffa9d in read () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f1d8cc048ef in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/imklog.so
#2  0x00007f1d8cc04c19 in klogLogKMsg () from /usr/lib/x86_64-linux-gnu/rsyslog/imklog.so
#3  0x00007f1d8cc03eb4 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/imklog.so
#4  0x0000562fa3c16970 in ?? ()
#5  0x00007f1d8e3f9064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007f1d8d50a62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 4 (Thread 0x7f1d8893b700 (LWP 25786)):
#0  0x00007f1d8e3fd04f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f1d8bfc8809 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/lmtcpsrv.so
#2  0x00007f1d8e3f9064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f1d8d50a62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 3 (Thread 0x7f1d88d3c700 (LWP 25785)):
#0  0x00007f1d8e3fd04f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f1d8bfc8809 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/lmtcpsrv.so
#2  0x00007f1d8e3f9064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007f1d8d50a62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7f1d8953d700 (LWP 25784)):
#0  0x00007f1d8e3ffa3d in write () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000562fa3bf98d6 in ?? ()
#2  0x0000562fa3bf9a9b in ?? ()
#3  0x0000562fa3bf9df9 in dbgprintf ()
#4  0x0000562fa3c35732 in cnfexprPrint ()
#5  0x0000562fa3c35a40 in cnfstmtPrintOnly ()
#6  0x0000562fa3c0d4bf in ?? ()
#7  0x0000562fa3c0dbeb in ?? ()
#8  0x0000562fa3c1d1b2 in ?? ()
#9  0x0000562fa3c09b97 in ?? ()
#10 0x0000562fa3c065be in wtiWorker ()
#11 0x0000562fa3c05649 in ?? ()
#12 0x00007f1d8e3f9064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f1d8d50a62d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7f1d89d3f700 (LWP 25782)):
#0  0x00007f1d8d457067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f1d8d458448 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f1d8d4951b4 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f1d8d51aaa7 in __fortify_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f1d8d518cc0 in __chk_fail () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f1d8d51aa17 in __fdelt_warn () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f1d8a544e04 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/lmnsd_ptcp.so
#7  0x00007f1d8bfc8ff3 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/lmtcpsrv.so
#8  0x00007f1d8bfc92c7 in ?? () from /usr/lib/x86_64-linux-gnu/rsyslog/lmtcpsrv.so
#9  0x0000562fa3c16970 in ?? ()
#10 0x00007f1d8e3f9064 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#11 0x00007f1d8d50a62d in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb)

Today rsyslogd was "stuck" accepting new connections on lithium and wezen, at about the same time. This is a strace from check_ssl on einsteinium:

einsteinium:~$ strace -f -- /usr/lib/nagios/plugins/check_ssl -H lithium.eqiad.wmnet -p 6514
...
mmap(NULL, 30123, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f882070a000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libnss_dns.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\21\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=22952, ...}) = 0
mmap(NULL, 2117896, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f881d68c000
mprotect(0x7f881d691000, 2093056, PROT_NONE) = 0
mmap(0x7f881d890000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4000) = 0x7f881d890000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libresolv.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220:\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=84856, ...}) = 0
mmap(NULL, 2189928, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f881d475000
mprotect(0x7f881d489000, 2093056, PROT_NONE) = 0
mmap(0x7f881d688000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x13000) = 0x7f881d688000
mmap(0x7f881d68a000, 6760, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f881d68a000
close(3)                                = 0
mprotect(0x7f881d688000, 4096, PROT_READ) = 0
mprotect(0x7f881d890000, 4096, PROT_READ) = 0
munmap(0x7f882070a000, 30123)           = 0
stat("/etc/resolv.conf", {st_mode=S_IFREG|0444, st_size=420, ...}) = 0
open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=420, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8820711000
read(3, "################################"..., 4096) = 420
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0x7f8820711000, 4096)            = 0
socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("208.80.154.254")}, 16) = 0
poll([{fd=3, events=POLLOUT}], 1, 0)    = 1 ([{fd=3, revents=POLLOUT}])
sendmmsg(3, {{{msg_name(0)=NULL, msg_iov(1)=[{"+ \1\0\0\1\0\0\0\0\0\0\7lithium\5eqiad\5wmnet"..., 37}], msg_controllen=0, msg_flags=MSG_RST|MSG_ERRQUEUE|MSG_MORE|0xb60010}, 37}, {{msg_name(0)=NULL, msg_iov(1)=[{"\306\245\1\0\0\1\0\0\0\0\0\0\7lithium\5eqiad\5wmnet"..., 37}], msg_controllen=0, msg_flags=MSG_CTRUNC|MSG_TRUNC|MSG_EOR|MSG_FIN|MSG_SYN|MSG_CONFIRM|MSG_RST|MSG_ERRQUEUE|MSG_MORE|MSG_CMSG_CLOEXEC|0x9fb80000}, 37}}, 2, MSG_NOSIGNAL) = 2
poll([{fd=3, events=POLLIN}], 1, 1000)  = 1 ([{fd=3, revents=POLLIN}])
ioctl(3, FIONREAD, [53])                = 0
recvfrom(3, "+ \201\200\0\1\0\1\0\0\0\0\7lithium\5eqiad\5wmnet"..., 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("208.80.154.254")}, [16]) = 53
poll([{fd=3, events=POLLIN}], 1, 999)   = 1 ([{fd=3, revents=POLLIN}])
ioctl(3, FIONREAD, [101])               = 0
recvfrom(3, "\306\245\201\200\0\1\0\0\0\1\0\0\7lithium\5eqiad\5wmnet"..., 65536, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("208.80.154.254")}, [16]) = 101
close(3)                                = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffcdfb8d970) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
ioctl(3, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffcdfb8d970) = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
connect(3, {sa_family=AF_INET, sin_port=htons(6514), sin_addr=inet_addr("10.64.32.154")}, 16) = 0
write(3, "\26\3\1\0\310\1\0\0\304\3\3\350\376\36d\216&b\234\333\306bN\236Oj\7\321\264`Y\272"..., 205) = 205
read(3,

Mentioned in SAL (#wikimedia-operations) [2018-02-21T16:00:30Z] <godog> restart rsyslogd on lithium and wezen - T136312

Upstream has fixed the issue, should be included in the next rsyslog release. When that happens we'll try it out on the central syslog servers.

Mentioned in SAL (#wikimedia-operations) [2018-06-12T10:21:34Z] <godog> bounce stuck rsyslog on lithium / wezen - T136312

Latest rsyslog release containing the fix is already packaged in Debian unstable, it'd be easier to backport that to stretch instead of jessie. Once we have a replacement for lithium in place (T195416) and running stretch I'll test the backport there.

Mentioned in SAL (#wikimedia-operations) [2018-10-25T15:02:05Z] <godog> test rsyslog 8.38 upgrade on lithium - T136312

Change 470382 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable syslog-tls in eqiad

https://gerrit.wikimedia.org/r/470382

Change 470382 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable syslog-tls in eqiad

https://gerrit.wikimedia.org/r/470382

This has been deployed today, we're running TLS connections everywhere in the fleet. Exceptions being hosts that had puppet disabled when https://gerrit.wikimedia.org/r/470382 and will fix themselves at the next puppet run.