Page MenuHomePhabricator

Striker error log events not getting into ELK cluster due to truncation of JSON payload
Open, Needs TriagePublic

Description

The wsgi container for Striker running on californium is configured to send logs both to local disk and to the WMF ELK stack. Investigating behavior because of a reported fatal error on irc that I could find in the error messages in the on-disk logs on californium but not in a kibana search.

I found the real problem in T151422#2858048. The UDP packets are being limited to 1500 bytes with 1472 bytes for payload. So any json string longer than 1472 will be truncated in transit and discarded by the logstash json codec.


It looks like the proper firewall holes are not open to actually allow this. (hypothesis disproven in T151422#2857363)

$ telnet logstash1003.eqiad.wmnet 11514
Trying 10.64.48.113...
# connection attempt sits open here for minutes...
telnet: Unable to connect to remote host: Connection timed out

The ferm rules that need to be fixed are likely on the logstash100[123] servers themselves.


Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 23 2016, 12:14 AM
bd808 moved this task from Backlog to Ready on the Striker board.Dec 8 2016, 7:08 AM

The ferm rules allow $DOMAIN_NETWORKS. Californium is 208.80.154.147/26 which must at least be a part of $::network::constants::deployable_networks or scap3 wouldn't work. I guess I need to track down what $DOMAIN_NETWORKS actually expands to.

I verified that the packets are making it all the way to logstash1003, and being dropped by iptables.

here's the expanded ruleset:

ACCEPT     udp  --  10.128.0.0/24        anywhere             udp dpt:11514
ACCEPT     udp  --  10.192.0.0/22        anywhere             udp dpt:11514
ACCEPT     udp  --  10.192.16.0/22       anywhere             udp dpt:11514
ACCEPT     udp  --  10.192.20.0/24       anywhere             udp dpt:11514
ACCEPT     udp  --  10.192.21.0/24       anywhere             udp dpt:11514
ACCEPT     udp  --  10.192.32.0/22       anywhere             udp dpt:11514
ACCEPT     udp  --  10.192.48.0/22       anywhere             udp dpt:11514
ACCEPT     udp  --  10.20.0.0/24         anywhere             udp dpt:11514
ACCEPT     udp  --  10.64.0.0/22         anywhere             udp dpt:11514
ACCEPT     udp  --  10.64.16.0/22        anywhere             udp dpt:11514
ACCEPT     udp  --  10.64.20.0/24        anywhere             udp dpt:11514
ACCEPT     udp  --  10.64.21.0/24        anywhere             udp dpt:11514
ACCEPT     udp  --  10.64.32.0/22        anywhere             udp dpt:11514
ACCEPT     udp  --  10.64.36.0/24        anywhere             udp dpt:11514
ACCEPT     udp  --  10.64.37.0/24        anywhere             udp dpt:11514
ACCEPT     udp  --  10.64.4.0/24         anywhere             udp dpt:11514
ACCEPT     udp  --  localnet/22          anywhere             udp dpt:11514
ACCEPT     udp  --  10.64.5.0/24         anywhere             udp dpt:11514
ACCEPT     udp  --  10.64.52.0/24        anywhere             udp dpt:11514
ACCEPT     udp  --  10.64.53.0/24        anywhere             udp dpt:11514
ACCEPT     udp  --  198.35.26.0/28       anywhere             udp dpt:11514
ACCEPT     udp  --  text-lb.ulsfo.wikimedia.org/27  anywhere             udp dpt:11514
ACCEPT     udp  --  208.80.153.0/27      anywhere             udp dpt:11514
ACCEPT     udp  --  text-lb.codfw.wikimedia.org/27  anywhere             udp dpt:11514
ACCEPT     udp  --  208.80.153.32/27     anywhere             udp dpt:11514
ACCEPT     udp  --  208.80.153.64/27     anywhere             udp dpt:11514
ACCEPT     udp  --  208.80.153.96/27     anywhere             udp dpt:11514
ACCEPT     udp  --  208.80.154.0/26      anywhere             udp dpt:11514
ACCEPT     udp  --  208.80.154.128/26    anywhere             udp dpt:11514
ACCEPT     udp  --  text-lb.eqiad.wikimedia.org/27  anywhere             udp dpt:11514
ACCEPT     udp  --  208.80.154.64/26     anywhere             udp dpt:11514
ACCEPT     udp  --  208.80.155.96/27     anywhere             udp dpt:11514
ACCEPT     udp  --  91.198.174.0/25      anywhere             udp dpt:11514
ACCEPT     udp  --  text-lb.esams.wikimedia.org/27  anywhere             udp dpt:11514

We realized that californium is hitting logstash1003 over ipv6...

yuvipanda@logstash1003:~$ sudo ip6tables -L | grep 11514

ACCEPT     udp      2620:0:860:101::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:860:102::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:860:103::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:860:104::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:860:118::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:860:122::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:860:1::/64    anywhere             udp dpt:11514
ACCEPT     udp      2620:0:860:2::/64    anywhere             udp dpt:11514
ACCEPT     udp      2620:0:860:3::/64    anywhere             udp dpt:11514
ACCEPT     udp      2620:0:860:4::/64    anywhere             udp dpt:11514
ACCEPT     udp      2620:0:860:ed1a::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:101::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:102::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:103::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:104::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:105::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:106::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:107::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:108::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:117::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:118::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:119::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:1::/64    anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:2::/64    anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:3::/64    anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:4::/64    anywhere             udp dpt:11514
ACCEPT     udp      2620:0:861:ed1a::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:862:102::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:862:1::/64    anywhere             udp dpt:11514
ACCEPT     udp      2620:0:862:ed1a::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:863:101::/64  anywhere             udp dpt:11514
ACCEPT     udp      2620:0:863:1::/64    anywhere             udp dpt:11514
ACCEPT     udp      2620:0:863:ed1a::/64  anywhere             udp dpt:11514
yuvipanda@logstash1003:~$ 
yuvipanda@logstash1003:~$

So I added a TRACE but looks like there's no traffic from californium to logstash1003 in general?

So I added a TRACE but looks like there's no traffic from californium to logstash1003 in general?

The logging for Striker would be the only direct communication as far as I know. It would only happen on restart of the wsgi container or when an error is encountered in the live application at https://toolsadmin.wikimedia.org/. These are rare but important events. For testing we could temporarily disable Puppet and increase the logging verbosity.

Mentioned in SAL (#wikimedia-operations) [2016-12-08T16:30:39Z] <YuviPanda> disable puppet on californium to help debug T151422 with bd808

@yuvipanda helped me debug this further by temporarily disabling Puppet on californium and increasing the log verbosity of Striker by editing /etc/striker/striker.ini. I then tailed /srv/log/striker/*.log to verify that we were generating log events while Yuvi watched things in iptables on logstash1003.

Iptables saw events like:

16:33:46.171599 IP 208.80.154.147.60457 > 10.64.48.113.11514: UDP, length 165

The 208.80.154.147 address is in the 208.80.154.128/26 range so it should be allowed by the ingress rules.

This reminded me to check the logstash input configuration to make sure that the port 11514 traffic was properly allowed. Grep in /etc/logstash/conf.d confirmed that we have this port open:

/etc/logstash/conf.d/10-input-udp-logback.conf
input {
  udp {
    type => "logback"
    port => 11514
    codec => "json"
  }
}

The logstash configuration typically also needs a filter added to route log events from a particular input to Elasticsearch. This is a safeguard that we put in early on to prevent accidentally DOS attacking ourselves with accidental log event traffic. The input will tag events with type=logback so lets look at the filters for that type:

/etc/logstash/conf.d/20-filter_logback.conf
# vim:set sw=2 ts=2 sts=2 et
# Parse logback input
filter {
  if [type] == "logback" {
    # General message cleanup
    mutate {
      replace => [ "host", "%{HOSTNAME}" ]
      add_tag => [ "logback", "es" ]
    }

    if [program] == "cassandra" {
      mutate {
        replace => [ "type",  "cassandra" ]
      }
    } # end [program] == "cassandra"
  } # end [type] == "logback"
}

Looks ok. Anything tagged type=logback will route to Elasticsearch because of the "es" tag that is added. Hmmm... but there is another quirk of the input filters. The type will only be set on records that do not already have a type in the event payload. What do the log events actually look like?

$ sudo tcpdump -vvX port 11514 and src 208.80.154.147
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 byt
es
16:41:47.479520 IP (tos 0x0, ttl 63, id 38683, offset 0, flags [DF], proto UDP (
17), length 495)
    californium.wikimedia.org.57453 > logstash1003.eqiad.wmnet.11514: [udp sum o
k] UDP, length 467
        0x0000:  4500 01ef 971b 4000 3f11 fd4d d050 9a93  E.....@.?..M.P..
        0x0010:  0a40 3071 e06d 2cfa 01db 9d10 7b22 7479  .@0q.m,.....{"ty
        0x0020:  7065 223a 2022 7374 7269 6b65 7222 2c20  pe":."striker",.
        0x0030:  2268 6f73 7422 3a20 2263 616c 6966 6f72  "host":."califor
        0x0040:  6e69 756d 222c 2022 7265 7175 6573 745f  nium",."request_
        0x0050:  6964 223a 2022 6566 3331 3561 6564 3033  id":."ef315aed03
        0x0060:  3837 3435 6332 6232 6534 6163 3030 3033  8745c2b2e4ac0003
        0x0070:  3261 6361 3737 222c 2022 6c65 7665 6c22  2aca77",."level"
        0x0080:  3a20 2244 4542 5547 222c 2022 6c6f 6767  :."DEBUG",."logg
        0x0090:  6572 5f6e 616d 6522 3a20 2264 6a61 6e67  er_name":."djang
        0x00a0:  6f5f 6175 7468 5f6c 6461 7022 2c20 2270  o_auth_ldap",."p
        0x00b0:  6174 6822 3a20 222f 7372 762f 6465 706c  ath":."/srv/depl
        0x00c0:  6f79 6d65 6e74 2f73 7472 696b 6572 2f76  oyment/striker/v
        0x00d0:  656e 762f 6c69 622f 7079 7468 6f6e 332e  env/lib/python3.
        0x00e0:  342f 7369 7465 2d70 6163 6b61 6765 732f  4/site-packages/
        0x00f0:  646a 616e 676f 5f61 7574 685f 6c64 6170  django_auth_ldap
        0x0100:  2f63 6f6e 6669 672e 7079 222c 2022 7374  /config.py",."st
        0x0110:  6163 6b5f 696e 666f 223a 206e 756c 6c2c  ack_info":.null,
        0x0120:  2022 4074 696d 6573 7461 6d70 223a 2022  ."@timestamp":."
        0x0130:  3230 3136 2d31 322d 3038 5431 363a 3431  2016-12-08T16:41
        0x0140:  3a34 372e 3437 385a 222c 2022 4076 6572  :47.478Z",."@ver
        0x0150:  7369 6f6e 223a 2022 3122 2c20 2274 6167  sion":."1",."tag
        0x0160:  7322 3a20 5b5d 2c20 226d 6573 7361 6765  s":.[],."message
        0x0170:  223a 2022 7365 6172 6368 5f73 2827 6f75  ":."search_s('ou
        0x0180:  3d70 656f 706c 652c 6463 3d77 696b 696d  =people,dc=wikim
        0x0190:  6564 6961 2c64 633d 6f72 6727 2c20 312c  edia,dc=org',.1,
        0x01a0:  2027 2863 6e3d 2528 7573 6572 2973 2927  .'(cn=%(user)s)'
        0x01b0:  2920 7265 7475 726e 6564 2031 206f 626a  ).returned.1.obj
        0x01c0:  6563 7473 3a20 7569 643d 6264 3830 382c  ects:.uid=bd808,
        0x01d0:  6f75 3d70 656f 706c 652c 6463 3d77 696b  ou=people,dc=wik
        0x01e0:  696d 6564 6961 2c64 633d 6f72 6722 7d    imedia,dc=org"}

Bam! Striker is already adding a type to the json payload, so the rules in 20-filter_logback.conf won't be applied. We need a filter for type=striker.

$ sudo grep striker /etc/logstash/conf.d/*
/etc/logstash/conf.d/50-filter_striker.conf:  if [type] == "striker" {
/etc/logstash/conf.d/50-filter_striker.conf:  } # end [type] == "striker"

We have one? I didn't forget about this before?

/etc/logstash/conf.d/50-filter_striker.conf
# vim:set sw=2 ts=2 sts=2 et
# Parse Striker input
#
# Striker uses python-logstash to send in UDP datagrams with JSON payload on
# the same port as is used for logback messaging.
filter {
  if [type] == "striker" {
    mutate {
      add_tag => [ "es" ]
    }
  } # end [type] == "striker"
}

So that's all we need in theory. Let's go look at in kibana again. And there are log events?!

The telnet in the initial report is bogus because this is a UDP event stream not a TCP stream. But I started testing because I had a user report of a fatal error and I found it in the on-disk logs and not in kibana. So now I need to craft a repeatable fatal and check that output with tcpdump to see if the logging is different or possibly not being routed to the UDP output for some reason.

bd808 renamed this task from Striker error logs not getting into ELK cluster to Some Striker errors not getting into ELK cluster.Dec 8 2016, 5:19 PM
bd808 updated the task description. (Show Details)

The telnet in the initial report is bogus because this is a UDP event stream not a TCP stream. But I started testing because I had a user report of a fatal error and I found it in the on-disk logs and not in kibana. So now I need to craft a repeatable fatal and check that output with tcpdump to see if the logging is different or possibly not being routed to the UDP output for some reason.

When I filter out debug and info events in the last 30 days there is only one warning level log event left. The bug report in T151409: ToolsAdmin: Error 500 for ZppixBot repo listing certainly falls in this 30 day window, so the hypothesis that fatal errors don't make it to kibana is looking a bit stronger.

The Python stack traces are too long to encode in a UDP packet.

20:22:04.269268 IP (tos 0x0, ttl 63, id 11981, offset 0, flags [+], proto UDP (17), length 1500)
    californium.wikimedia.org.37439 > logstash1003.eqiad.wmnet.11514: UDP, bad length 5327 > 1472
        0x0000:  4500 05dc 2ecd 2000 3f11 81af d050 9a93  E.......?....P..
        0x0010:  0a40 3071 923f 2cfa 14d7 c6a1 7b22 7374  .@0q.?,.....{"st
        0x0020:  6174 7573 5f63 6f64 6522 3a20 3530 302c  atus_code":.500,
        0x0030:  2022 7072 6f63 6573 7322 3a20 3231 3131  ."process":.2111
        0x0040:  332c 2022 7374 6163 6b5f 7472 6163 6522  3,."stack_trace"
        0x0050:  3a20 2254 7261 6365 6261 636b 2028 6d6f  :."Traceback.(mo
        0x0060:  7374 2072 6563 656e 7420 6361 6c6c 206c  st.recent.call.l
        0x0070:  6173 7429 3a5c 6e20 2046 696c 6520 5c22  ast):\n..File.\"
        0x0080:  2f73 7276 2f64 6570 6c6f 796d 656e 742f  /srv/deployment/
        0x0090:  7374 7269 6b65 722f 7665 6e76 2f6c 6962  striker/venv/lib
        0x00a0:  2f70 7974 686f 6e33 2e34 2f73 6974 652d  /python3.4/site-
        0x00b0:  7061 636b 6167 6573 2f64 6a61 6e67 6f2f  packages/django/
        0x00c0:  636f 7265 2f75 726c 7265 736f 6c76 6572  core/urlresolver
        0x00d0:  732e 7079 5c22 2c20 6c69 6e65 2035 3634  s.py\",.line.564
        0x00e0:  2c20 696e 2072 6576 6572 7365 5c6e 2020  ,.in.reverse\n..
        0x00f0:  2020 6578 7472 612c 2072 6573 6f6c 7665  ..extra,.resolve
        0x0100:  7220 3d20 7265 736f 6c76 6572 2e6e 616d  r.=.resolver.nam
        0x0110:  6573 7061 6365 5f64 6963 745b 6e73 5d5c  espace_dict[ns]\
        0x0120:  6e4b 6579 4572 726f 723a 2027 7374 7269  nKeyError:.'stri
        0x0130:  6b65 722e 746f 6f6c 7327 5c6e 5c6e 4475  ker.tools'\n\nDu
        0x0140:  7269 6e67 2068 616e 646c 696e 6720 6f66  ring.handling.of
        0x0150:  2074 6865 2061 626f 7665 2065 7863 6570  .the.above.excep
        0x0160:  7469 6f6e 2c20 616e 6f74 6865 7220 6578  tion,.another.ex
        0x0170:  6365 7074 696f 6e20 6f63 6375 7272 6564  ception.occurred
        0x0180:  3a5c 6e5c 6e54 7261 6365 6261 636b 2028  :\n\nTraceback.(
        0x0190:  6d6f 7374 2072 6563 656e 7420 6361 6c6c  most.recent.call
        0x01a0:  206c 6173 7429 3a5c 6e20 2046 696c 6520  .last):\n..File.
        0x01b0:  5c22 2f73 7276 2f64 6570 6c6f 796d 656e  \"/srv/deploymen
        0x01c0:  742f 7374 7269 6b65 722f 7665 6e76 2f6c  t/striker/venv/l
        0x01d0:  6962 2f70 7974 686f 6e33 2e34 2f73 6974  ib/python3.4/sit
        0x01e0:  652d 7061 636b 6167 6573 2f64 6a61 6e67  e-packages/djang
        0x01f0:  6f2f 7465 6d70 6c61 7465 2f64 6566 6175  o/template/defau
        0x0200:  6c74 7461 6773 2e70 795c 222c 206c 696e  lttags.py\",.lin
        0x0210:  6520 3530 312c 2069 6e20 7265 6e64 6572  e.501,.in.render
        0x0220:  5c6e 2020 2020 6375 7272 656e 745f 6170  \n....current_ap
        0x0230:  703d 6375 7272 656e 745f 6170 7029 5c6e  p=current_app)\n
        0x0240:  2020 4669 6c65 205c 222f 7372 762f 6465  ..File.\"/srv/de
        0x0250:  706c 6f79 6d65 6e74 2f73 7472 696b 6572  ployment/striker
        0x0260:  2f76 656e 762f 6c69 622f 7079 7468 6f6e  /venv/lib/python
        0x0270:  332e 342f 7369 7465 2d70 6163 6b61 6765  3.4/site-package
        0x0280:  732f 646a 616e 676f 2f63 6f72 652f 7572  s/django/core/ur
        0x0290:  6c72 6573 6f6c 7665 7273 2e70 795c 222c  lresolvers.py\",
        0x02a0:  206c 696e 6520 3537 342c 2069 6e20 7265  .line.574,.in.re
        0x02b0:  7665 7273 655c 6e20 2020 206b 6579 295c  verse\n....key)\
        0x02c0:  6e64 6a61 6e67 6f2e 636f 7265 2e75 726c  ndjango.core.url
        0x02d0:  7265 736f 6c76 6572 732e 4e6f 5265 7665  resolvers.NoReve
        0x02e0:  7273 654d 6174 6368 3a20 2773 7472 696b  rseMatch:.'strik
        0x02f0:  6572 2e74 6f6f 6c73 2720 6973 206e 6f74  er.tools'.is.not
        0x0300:  2061 2072 6567 6973 7465 7265 6420 6e61  .a.registered.na
        0x0310:  6d65 7370 6163 655c 6e5c 6e44 7572 696e  mespace\n\nDurin
        0x0320:  6720 6861 6e64 6c69 6e67 206f 6620 7468  g.handling.of.th
        0x0330:  6520 6162 6f76 6520 6578 6365 7074 696f  e.above.exceptio
        0x0340:  6e2c 2061 6e6f 7468 6572 2065 7863 6570  n,.another.excep
        0x0350:  7469 6f6e 206f 6363 7572 7265 643a 5c6e  tion.occurred:\n
        0x0360:  5c6e 5472 6163 6562 6163 6b20 286d 6f73  \nTraceback.(mos
        0x0370:  7420 7265 6365 6e74 2063 616c 6c20 6c61  t.recent.call.la
        0x0380:  7374 293a 5c6e 2020 4669 6c65 205c 222f  st):\n..File.\"/
        0x0390:  7372 762f 6465 706c 6f79 6d65 6e74 2f73  srv/deployment/s
        0x03a0:  7472 696b 6572 2f76 656e 762f 6c69 622f  triker/venv/lib/
        0x03b0:  7079 7468 6f6e 332e 342f 7369 7465 2d70  python3.4/site-p
        0x03c0:  6163 6b61 6765 732f 646a 616e 676f 2f63  ackages/django/c
        0x03d0:  6f72 652f 6861 6e64 6c65 7273 2f62 6173  ore/handlers/bas
        0x03e0:  652e 7079 5c22 2c20 6c69 6e65 2031 3332  e.py\",.line.132
        0x03f0:  2c20 696e 2067 6574 5f72 6573 706f 6e73  ,.in.get_respons
        0x0400:  655c 6e20 2020 2072 6573 706f 6e73 6520  e\n....response.
        0x0410:  3d20 7772 6170 7065 645f 6361 6c6c 6261  =.wrapped_callba
        0x0420:  636b 2872 6571 7565 7374 2c20 2a63 616c  ck(request,.*cal
        0x0430:  6c62 6163 6b5f 6172 6773 2c20 2a2a 6361  lback_args,.**ca
        0x0440:  6c6c 6261 636b 5f6b 7761 7267 7329 5c6e  llback_kwargs)\n
        0x0450:  2020 4669 6c65 205c 222e 2f73 7472 696b  ..File.\"./strik
        0x0460:  6572 2f74 6f6f 6c73 2f76 6965 7773 2e70  er/tools/views.p
        0x0470:  795c 222c 206c 696e 6520 3539 2c20 696e  y\",.line.59,.in
        0x0480:  2064 6563 6f72 6174 6564 5c6e 2020 2020  .decorated\n....
        0x0490:  7265 7475 726e 2066 282a 6172 6773 2c20  return.f(*args,.
        0x04a0:  2a2a 6b77 6172 6773 295c 6e20 2046 696c  **kwargs)\n..Fil
        0x04b0:  6520 5c22 2e2f 7374 7269 6b65 722f 746f  e.\"./striker/to
        0x04c0:  6f6c 732f 7669 6577 732e 7079 5c22 2c20  ols/views.py\",.
        0x04d0:  6c69 6e65 2031 3032 2c20 696e 2074 6f6f  line.102,.in.too
        0x04e0:  6c5c 6e20 2020 2027 6361 6e5f 6564 6974  l\n....'can_edit
        0x04f0:  273a 2074 6f6f 6c5f 6d65 6d62 6572 2874  ':.tool_member(t
        0x0500:  6f6f 6c2c 2072 6571 2e75 7365 7229 2c5c  ool,.req.user),\
        0x0510:  6e20 2046 696c 6520 5c22 2f73 7276 2f64  n..File.\"/srv/d
        0x0520:  6570 6c6f 796d 656e 742f 7374 7269 6b65  eployment/strike
        0x0530:  722f 7665 6e76 2f6c 6962 2f70 7974 686f  r/venv/lib/pytho
        0x0540:  6e33 2e34 2f73 6974 652d 7061 636b 6167  n3.4/site-packag
        0x0550:  6573 2f64 6a61 6e67 6f2f 7368 6f72 7463  es/django/shortc
        0x0560:  7574 732e 7079 5c22 2c20 6c69 6e65 2036  uts.py\",.line.6
        0x0570:  372c 2069 6e20 7265 6e64 6572 5c6e 2020  7,.in.render\n..
        0x0580:  2020 7465 6d70 6c61 7465 5f6e 616d 652c  ..template_name,
        0x0590:  2063 6f6e 7465 7874 2c20 7265 7175 6573  .context,.reques
        0x05a0:  743d 7265 7175 6573 742c 2075 7369 6e67  t=request,.using
        0x05b0:  3d75 7369 6e67 295c 6e20 2046 696c 6520  =using)\n..File.
        0x05c0:  5c22 2f73 7276 2f64 6570 6c6f 796d 656e  \"/srv/deploymen
        0x05d0:  742f 7374 7269 6b65 722f 7665            t/striker/ve
bd808 renamed this task from Some Striker errors not getting into ELK cluster to Striker error log events not getting into ELK cluster due to UDP truncation of JSON payload.Dec 8 2016, 8:27 PM
bd808 updated the task description. (Show Details)

Apparently californium can't negotiate a packet size larger than 1500 bytes with the logstash servers.

californium:~
bd808$ python3
Python 3.4.3 (default, Oct 14 2015, 20:28:29)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import IN, socket, sys
>>> s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
>>> s.connect(('logstash1003.eqiad.wmnet',11514))
>>> s.setsockopt(socket.IPPROTO_IP, IN.IP_MTU_DISCOVER, IN.IP_PMTUDISC_DO)
>>> try:
...     s.send(b'#' * 65000)
... except socket.error:
...     option = getattr(IN, 'IP_MTU', 14) # constant taken from <linux/in.h>
...     print('MTU:', s.getsockopt(socket.IPPROTO_IP, option))
...
MTU: 1500

Change 329020 had a related patch set uploaded (by BryanDavis):
logstash: Add a json_lines TCP input

https://gerrit.wikimedia.org/r/329020

Change 329020 merged by Andrew Bogott:
logstash: Add a json_lines TCP input

https://gerrit.wikimedia.org/r/329020

Change 350638 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[labs/striker@master] Switch logstash transport from UDP to TCP

https://gerrit.wikimedia.org/r/350638

Change 350638 merged by jenkins-bot:
[labs/striker@master] Switch logstash transport from UDP to TCP

https://gerrit.wikimedia.org/r/350638

Switching to logstash.TCPLogstashHandler does not seem to have worked. Errors can be triggered by loading https://toolsadmin.wikimedia.org/tools/id/bd808-test2 for further debugging. That tool deliberately has a malformed for a diffusion repo in the database with the name tool-bd808-test2-!T151422! to trigger this bug.

With the new log event routing introduced in T217932 events are getting to the ELK cluster. Its not all perfect yet however. The log output from the app is being seen in journald split into lines where the payload seems to be maxing out at 2048 bytes. Here is an example of a single stack trace generated from the known bad https://toolsadmin.wikimedia.org/tools/id/bd808-test2 page of the app:

$ sudo journalctl --no-pager -u uwsgi-striker
Mar 28 22:16:57 labweb1002 uwsgi-striker[1524]: @cee: {"host": "labweb1002", "stack_info": null, "type": "striker", "request": "<WSGIRequest: GET '/tools/id/bd808-test2'>", "@timestamp": "2019-03-28T22:16:57.887Z", "level": "ERROR", "stack_trace": "Traceback (most recent call last):\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/core/handlers/exception.py\", line 34, in inner\n    response = get_response(request)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/core/handlers/base.py\", line 126, in _get_response\n    response = self.process_exception_by_middleware(e, request)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/core/handlers/base.py\", line 124, in _get_response\n    response = wrapped_callback(request, *callback_args, **callback_kwargs)\n  File \"./striker/tools/views/decorators.py\", line 48, in decorated\n    return f(*args, **kwargs)\n  File \"./striker/tools/views/tool.py\", line 63, in view\n    'can_suppress': False,\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/shortcuts.py\", line 36, in render\n    content = loader.render_to_string(template_name, context, request, using=using)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/loader.py\", line 62, in render_to_string\n    return template.render(context, request)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/backends/django.py\", line 61, in render\n    return self.template.render(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/base.py\", line 171, in render\n    return self._render(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/base.py\", line 163, in _render\n    return self.nodelist.render(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/base.py\", line 937, in render\n    bit = node.render_annotated(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/te
Mar 28 22:16:57 labweb1002 uwsgi-striker[1524]: mplate/base.py\", line 904, in render_annotated\n    return self.render(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/loader_tags.py\", line 150, in render\n    return compiled_parent._render(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/base.py\", line 163, in _render\n    return self.nodelist.render(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/base.py\", line 937, in render\n    bit = node.render_annotated(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/base.py\", line 904, in render_annotated\n    return self.render(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/loader_tags.py\", line 62, in render\n    result = block.nodelist.render(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/base.py\", line 937, in render\n    bit = node.render_annotated(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/base.py\", line 904, in render_annotated\n    return self.render(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/defaulttags.py\", line 209, in render\n    nodelist.append(node.render_annotated(context))\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/base.py\", line 904, in render_annotated\n    return self.render(context)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/template/defaulttags.py\", line 442, in render\n    url = reverse(view_name, args=args, kwargs=kwargs, current_app=current_app)\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/urls/base.py\", line 90, in reverse\n    return iri_to_uri(resolver._reverse_with_prefix(view, prefix, *args, **kwargs))\n  File \"/srv/deployment/striker/venv/lib/python3.5/site-packages/django/urls/resolvers.py\", line 622, in _reverse_with_pre
Mar 28 22:16:57 labweb1002 uwsgi-striker[1524]: fix\n    raise NoReverseMatch(msg)\ndjango.urls.exceptions.NoReverseMatch: Reverse for 'repo_view' with keyword arguments '{'repo': 'tool-bd808-test2-!T151422!', 'tool': 'bd808-test2'}' not found. 1 pattern(s) tried: ['tools\\\\/id\\\\/(?P<tool>[-a-zA-Z0-9_]+)\\\\/repos\\\\/id\\\\/(?P<repo>[-a-zA-Z0-9_]+)$']\n", "lineno": 228, "@version": "1", "path": "/srv/deployment/striker/venv/lib/python3.5/site-packages/django/utils/log.py", "logger_name": "django.request", "status_code": 500, "message": "Internal Server Error: /tools/id/bd808-test2", "process": 1573, "thread_name": "b'uWSGIWorker48Core0'", "tags": []}

I need to dig into the logging config on the app to see if it is doing this splitting. If not then the next place to check will be the uwsgi process, and ultimately journald itself. We are making progress though! :)

bd808 renamed this task from Striker error log events not getting into ELK cluster due to UDP truncation of JSON payload to Striker error log events not getting into ELK cluster due to truncation of JSON payload.Mar 28 2019, 10:38 PM

I need to dig into the logging config on the app to see if it is doing this splitting. If not then the next place to check will be the uwsgi process, and ultimately journald itself. We are making progress though! :)

I have been poking around with config changes in my local development environment. Here's a bit of what I have found:

  • I can reproduce message splitting at 2048 bytes of message body in journald by setting up uwsgi to log to stdout/err like is done in prod
  • If I run sudo /usr/bin/uwsgi --die-on-term --autoload --ini /etc/uwsgi/apps-enabled/striker.ini to put the uwsgi process in the foreground messages are not split. Verified by using tee to copy the log output from stdout to a file where counting newlines is easier.

This makes me strongly suspect journald as the source of line splitting. Splitting on a 2K boundary is the same issue as https://bugs.freedesktop.org/show_bug.cgi?id=86465. That bug was addressed in https://github.com/systemd/systemd/pull/4875 and its follow up at https://github.com/systemd/systemd/pull/6838.

The version of systemd that we are running should in theory include both fixes, but I can recreate all the splits documented at https://bugs.freedesktop.org/show_bug.cgi?id=86465#c8 still:

  • perl -e "print 'MESSAGE=', 'a'x4096, 'b'" | logger --journald == 1 message (4097 bytes of body)
  • perl -e "print 'a'x4096, 'b'" | systemd-cat == 3 messages (2048 + 2048 + 1 bytes of body)
  • perl -e "print 'a'x4096, 'b'" | logger == 5 messages (995 + 995 + 995 + 995 + 117 bytes of body)
$ ssh labweb1001.wikimedia.org
$ dpkg -l|grep ^ii|awk '{printf "%-20s %s\n", $2, $3}'|grep systemd
libpam-systemd:amd64 232-25+deb9u11
libsystemd0:amd64    232-25+deb9u11
systemd              232-25+deb9u11
systemd-sysv         232-25+deb9u11

I got confused by the tarball that apt-get source systemd pulled down for me. We are actually running systemd v232-25 and that predates the fix for the 2K splitting issue which I believe first shipped in v235. In theory this will get fixed magically when we upgrade to Buster and get systemd v241-5.