Page MenuHomePhabricator

intermittent brief data dropouts for esams netflow data
Closed, ResolvedPublic

Description

This graph (past 24h of packets/minute, broken down by exporter IP aka router IP) shows the problem quite clearly:


Closeup on an interval with several dropouts:

It occurs only for cr{2,3}-esams and cr2-knams, and not for any other timeseries. So, it seems likely that whatever is broken is at some chokepoint common to all three routers.

LibreNMS shows the switch port utilization of ganeti3003 (which runs netflow3001) as being quite low: ~400Mbps/10Gbps. nic_saturation_exporter agrees there's no warmth there.

We increased the UDP socket SO_RCVBUF of both samplicator (https://gerrit.wikimedia.org/r/c/operations/puppet/+/596648) and pmacctd (https://gerrit.wikimedia.org/r/c/operations/puppet/+/596660) on netflow3001, and while that almost entirely stopped the packet drops that were being shown in netflow3001's kernel socket statistics, it didn't seem to change the rate at which the problem occurs.

Event Timeline

CDanis created this task.May 19 2020, 4:28 PM
Restricted Application added a project: Operations. · View Herald TranscriptMay 19 2020, 4:28 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
CDanis triaged this task as Low priority.May 19 2020, 4:28 PM
CDanis added a subscriber: faidon.

Relevant Turnilo where lots of things happened in a short timeframe:

  • The very far left is when T248394 got deployed in AMS
  • The transition gray/color is when T246186 got deployed
  • The first "drops" in the colored area are when T248980 got deployed

And that last one seems to be when things got worse.
Maybe it's librdkafka having an issue with the extra TLS (processing/timeout/latency/etc).

Joe added a subscriber: Joe.May 20 2020, 7:06 AM

Looking at kafka, it seems there is a bizarre pattern in producing the data to the "netflow" topic:

https://grafana.wikimedia.org/d/000000234/kafka-by-topic?panelId=34&fullscreen&orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=netflow

while there is no such pattern for the wmf_netflow topic (what's the difference between the two?)

CDanis added a subscriber: elukey.May 20 2020, 12:43 PM

There's lots of this in kern.log on netflow3001:

May 26 15:15:03 netflow3001 kernel: [1214948.953863] nfacctd[422]: segfault at 0 ip 00007f54d61ebb17 sp 00007ffdcefb36e8 error 4 in librdkafka.so.1[7f54d61e9000+9d000]
May 26 15:15:03 netflow3001 kernel: [1214948.953870] Code: 65 05 00 48 83 c4 08 89 e8 5b 5d 41 5c 41 5d c3 bd 42 ff ff ff eb d4 e8 47 ec ff ff 0f 1f 80 00 00 00 00 4c 8d 05 29 2b 00 00 <48> 8b 3f 45 31 c9 31 d2 b9 01 00 00 00 e9 47 14 03 00 0f 1f 80 00
May 26 15:16:04 netflow3001 kernel: [1215009.857717] nfacctd[676]: segfault at 0 ip 00007f54d61ebb17 sp 00007ffdcefb36e8 error 4 in librdkafka.so.1[7f54d61e9000+9d000]
May 26 15:16:04 netflow3001 kernel: [1215009.857735] Code: 65 05 00 48 83 c4 08 89 e8 5b 5d 41 5c 41 5d c3 bd 42 ff ff ff eb d4 e8 47 ec ff ff 0f 1f 80 00 00 00 00 4c 8d 05 29 2b 00 00 <48> 8b 3f 45 31 c9 31 d2 b9 01 00 00 00 e9 47 14 03 00 0f 1f 80 00
May 26 15:24:04 netflow3001 kernel: [1215490.159680] nfacctd[2544]: segfault at 0 ip 00007f54d61ebb17 sp 00007ffdcefb36e8 error 4 in librdkafka.so.1[7f54d61e9000+9d000]
May 26 15:24:04 netflow3001 kernel: [1215490.159699] Code: 65 05 00 48 83 c4 08 89 e8 5b 5d 41 5c 41 5d c3 bd 42 ff ff ff eb d4 e8 47 ec ff ff 0f 1f 80 00 00 00 00 4c 8d 05 29 2b 00 00 <48> 8b 3f 45 31 c9 31 d2 b9 01 00 00 00 e9 47 14 03 00 0f 1f 80 00
May 26 16:24:03 netflow3001 kernel: [1219089.036484] nfacctd[19200]: segfault at 0 ip 00007f54d61ebb17 sp 00007ffdcefb36e8 error 4 in librdkafka.so.1[7f54d61e9000+9d000]
May 26 16:24:03 netflow3001 kernel: [1219089.036491] Code: 65 05 00 48 83 c4 08 89 e8 5b 5d 41 5c 41 5d c3 bd 42 ff ff ff eb d4 e8 47 ec ff ff 0f 1f 80 00 00 00 00 4c 8d 05 29 2b 00 00 <48> 8b 3f 45 31 c9 31 d2 b9 01 00 00 00 e9 47 14 03 00 0f 1f 80 00
May 26 16:33:03 netflow3001 kernel: [1219628.700557] nfacctd[21331]: segfault at 0 ip 00007f54d61ebb17 sp 00007ffdcefb36e8 error 4 in librdkafka.so.1[7f54d61e9000+9d000]
May 26 16:33:03 netflow3001 kernel: [1219628.700566] Code: 65 05 00 48 83 c4 08 89 e8 5b 5d 41 5c 41 5d c3 bd 42 ff ff ff eb d4 e8 47 ec ff ff 0f 1f 80 00 00 00 00 4c 8d 05 29 2b 00 00 <48> 8b 3f 45 31 c9 31 d2 b9 01 00 00 00 e9 47 14 03 00 0f 1f 80 00
May 26 16:34:03 netflow3001 kernel: [1219688.759653] nfacctd[21563]: segfault at 0 ip 00007f54d61ebb17 sp 00007ffdcefb36e8 error 4 in librdkafka.so.1[7f54d61e9000+9d000]
May 26 16:34:03 netflow3001 kernel: [1219688.759662] Code: 65 05 00 48 83 c4 08 89 e8 5b 5d 41 5c 41 5d c3 bd 42 ff ff ff eb d4 e8 47 ec ff ff 0f 1f 80 00 00 00 00 4c 8d 05 29 2b 00 00 <48> 8b 3f 45 31 c9 31 d2 b9 01 00 00 00 e9 47 14 03 00 0f 1f 80 00
May 26 16:36:03 netflow3001 kernel: [1219808.701503] nfacctd[22643]: segfault at 0 ip 00007f54d61ebb17 sp 00007ffdcefb36e8 error 4 in librdkafka.so.1[7f54d61e9000+9d000]
May 26 16:36:03 netflow3001 kernel: [1219808.701513] Code: 65 05 00 48 83 c4 08 89 e8 5b 5d 41 5c 41 5d c3 bd 42 ff ff ff eb d4 e8 47 ec ff ff 0f 1f 80 00 00 00 00 4c 8d 05 29 2b 00 00 <48> 8b 3f 45 31 c9 31 d2 b9 01 00 00 00 e9 47 14 03 00 0f 1f 80 00

Seems to correlate with these events in the nfacctd journal output:

May 26 16:33:01 netflow3001 nfacctd[21331]: INFO ( default_kafka/kafka ): *** Purging cache - START (PID: 21331) ***
May 26 16:33:03 netflow3001 nfacctd[21331]: ERROR ( default_kafka/kafka ): Failed to produce to topic netflow partition -1: Local: Queue full
May 26 16:33:03 netflow3001 nfacctd[21331]: ERROR ( default_kafka/kafka ): Connection failed to Kafka: p_kafka_close()
May 26 16:34:01 netflow3001 nfacctd[21563]: INFO ( default_kafka/kafka ): [/etc/pmacct/librdkafka.conf] Reading librdkafka global config.
May 26 16:34:01 netflow3001 nfacctd[21563]: INFO ( default_kafka/kafka ): *** Purging cache - START (PID: 21563) ***
May 26 16:34:03 netflow3001 nfacctd[21563]: ERROR ( default_kafka/kafka ): Failed to produce to topic netflow partition -1: Local: Queue full
May 26 16:34:03 netflow3001 nfacctd[21563]: ERROR ( default_kafka/kafka ): Connection failed to Kafka: p_kafka_close()
May 26 16:35:01 netflow3001 nfacctd[21783]: INFO ( default_kafka/kafka ): [/etc/pmacct/librdkafka.conf] Reading librdkafka global config.
May 26 16:35:01 netflow3001 nfacctd[21783]: INFO ( default_kafka/kafka ): *** Purging cache - START (PID: 21783) ***
May 26 16:35:06 netflow3001 nfacctd[21783]: INFO ( default_kafka/kafka ): *** Purging cache - END (PID: 21783, QN: 119429/119429, ET: 2) ***
May 26 16:36:01 netflow3001 nfacctd[22643]: INFO ( default_kafka/kafka ): [/etc/pmacct/librdkafka.conf] Reading librdkafka global config.
May 26 16:36:01 netflow3001 nfacctd[22643]: INFO ( default_kafka/kafka ): *** Purging cache - START (PID: 22643) ***
May 26 16:36:03 netflow3001 nfacctd[22643]: ERROR ( default_kafka/kafka ): Failed to produce to topic netflow partition -1: Local: Queue full
May 26 16:36:03 netflow3001 nfacctd[22643]: ERROR ( default_kafka/kafka ): Connection failed to Kafka: p_kafka_close()

Mentioned in SAL (#wikimedia-operations) [2020-05-26T17:47:00Z] <cdanis> netflow3001: disabling puppet and testing some pmacct/librdkafka config tweaks T253128

Change 598841 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] pmacctd: various increases to buffer sizes

https://gerrit.wikimedia.org/r/598841

One more of these today:

May 27 18:33:01 netflow3001 nfacctd[28442]: INFO ( default_kafka/kafka ): [/etc/pmacct/librdkafka.conf] Reading librdkafka global config.                                                    
May 27 18:33:01 netflow3001 nfacctd[28442]: INFO ( default_kafka/kafka ): *** Purging cache - START (PID: 28442) ***                                                                         
May 27 18:33:14 netflow3001 nfacctd[28442]: INFO ( default_kafka/kafka ): *** Purging cache - END (PID: 28442, QN: 118233/118233, ET: 4) ***                                                 
May 27 18:33:59 netflow3001 nfacctd[31200]: WARN ( default_kafka/kafka ): Missing data detected (plugin_buffer_size=1444 plugin_pipe_size=20480000).                                         
May 27 18:33:59 netflow3001 nfacctd[31200]: WARN ( default_kafka/kafka ): Increase values or look for plugin_buffer_size, plugin_pipe_size in CONFIG-KEYS document.

However this just means that a few flows are missing, not that we crashed while uploading the flows for an entire minute-wide bucket.

We had one more of these:

May 28 10:57:33 netflow3001 nfacctd[31200]: WARN ( default_kafka/kafka ): Missing data detected (plugin_buffer_size=1444 plugin_pipe_size=20480000).                                         
May 28 10:57:33 netflow3001 nfacctd[31200]: WARN ( default_kafka/kafka ): Increase values or look for plugin_buffer_size, plugin_pipe_size in CONFIG-KEYS document.

I'm testing bumping plugin_pipe_size to 30MB and assuming that looks stable, will call this good.

Overall, things look great since the changes:

Mentioned in SAL (#wikimedia-operations) [2020-05-29T14:59:02Z] <cdanis> disabling puppet on netflow* to deploy Ic71e96f0 T253128

Change 598841 merged by CDanis:
[operations/puppet@production] nfacctd: various increases to buffer sizes

https://gerrit.wikimedia.org/r/598841

CDanis closed this task as Resolved.May 29 2020, 3:22 PM
CDanis claimed this task.