Page MenuHomePhabricator

UDP rcvbuferrors and inerrors on graphite hosts
Closed, ResolvedPublic

Description

looks like the current rate of incoming udp packets causes receive errors, the ~2x increase seen in the graph is xhprof enabled

Screen Shot 2015-06-02 at 11.27.27.png (507×1 px, 71 KB)

see also the graphite dashboard at https://gdash.wikimedia.org/dashboards/graphite.eqiad/

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
fgiunchedi raised the priority of this task from to Medium.Jun 2 2015, 6:28 PM
fgiunchedi added projects: Grafana, acl*sre-team.
fgiunchedi subscribed.

Change 215383 had a related patch set uploaded (by Filippo Giunchedi):
Sample profiling data at 1:10,000

https://gerrit.wikimedia.org/r/215383

Change 215383 merged by jenkins-bot:
Sample profiling data at 1:10,000

https://gerrit.wikimedia.org/r/215383

increasing xhprof sampling helped by reducing the incoming packets rates, though it is still there (the new increase is due to varnishstatsd https://gerrit.wikimedia.org/r/214651). Likely offenders are statsdlb and/or statsite

Screen Shot 2015-06-02 at 15.06.02.png (502×1 px, 96 KB)

Likely offenders are statsdlb and/or statsite

Close: it's statsdlb and carbon-c-relay.

$ grep -Pv ' 0\s+$' /proc/net/udp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops
15935: 00000000:07D3 00000000:0000 07 00000000:00000000 00:00000000 00000000   115        0 742460050 2 0000000000000000 12273633
22057: 00000000:1FBD 00000000:0000 07 00000000:00000000 00:00000000 00000000   113        0 701339285 2 0000000000000000 149627693

$ getent passwd 115
carbon-c-relay:x:115:118::/var/lib/carbon-c-relay:/bin/false

$ getent passwd 113
statsdlb:x:113:65534::/nonexistent:/bin/false

Right now statsdlb can receive multiple metric strings in a single datagram, but it only sends one metric per datagram. It has to handle each metric strings separately because it has to hash the key name to determine which downstream receiver it should forward it to. But it could simply keep a buffer for each receiver, and flush when some size threshold is hit.

indeed the udp traffic from carbon-c-relay is graphite clients on 2003/udp

graphite1001:~$ grep -Pv ' 0\s+$' /proc/net/udp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops             
15935: 00000000:07D3 00000000:0000 07 00000000:00000000 00:00000000 00000000   115        0 802210643 2 0000000000000000 129687
22057: 00000000:1FBD 00000000:0000 07 00000000:00000000 00:00000000 00000000   113        0 701339285 2 0000000000000000 149816708
graphite1001:~$ ps fwaux | grep -i carbon-c-rela
filippo   5020  0.0  0.0  11868   920 pts/16   S+   22:50   0:00              \_ grep -i carbon-c-rela
carbon-+ 18826  4.5  0.0 1366752 48096 ?       Ssl  19:52   8:02 /usr/bin/carbon-c-relay -H graphite.wikimedia.org_local -p 1903 -f /etc/carbon/local-relay.conf
carbon-+ 18871  5.2  0.0 1388124 55704 ?       Ssl  19:52   9:22 /usr/bin/carbon-c-relay -H graphite.wikimedia.org_frontend -p 2003 -f /etc/carbon/frontend-relay.conf
graphite1001:~$ sudo lsof -p 18826 | grep 802210643
graphite1001:~$ sudo lsof -p 18871 | grep 802210643
carbon-c- 18871 carbon-c-relay    5u  IPv4          802210643      0t0       UDP *:2003 
graphite1001:~$

I took a one-minute survey of statsd traffic to get top users

graphite1001:~$ sudo timeout 1m ngrep -q -W byline . udp dst port 8125  | grep -v -e '^U ' -e '^$' | sed -e 's/:.*//' | sort | cut -d. -f1 | uniq -dc | sort -rn | head -20
7677869 MediaWiki
2114283 varnish
 734322 restbase
  75951 kafka
  69421 jobrunner
  64569 varnishkafka
  31518 parsoid
  21038 media
  18955 eventbus
  14506 Hadoop
  12538 frontend
   7595 logstash
   4793 aqs
   4193 ocg
   1108 ResourceLoader
   1019 servers
    850 zuul
    758 kartotherian
    748 restbase-test
    704 graphoid

and expanded to the first three fields

graphite1001:~$ sudo timeout 1m ngrep -q -W byline . udp dst port 8125  | grep -v -e '^U ' -e '^$' | sed -e 's/:.*//' | sort | cut -d. -f1,2,3 | uniq -dc | sort -rn | head -50
2394083 MediaWiki.db.commit_masters
1569647 MediaWiki.image_cache.hit
 961492 varnish.eqiad.backends
 522954 varnish.esams.backends
 413060 MediaWiki.resourceloader_build.all
 371008 varnish.codfw.backends
 248226 MediaWiki.CirrusSearch.codfw
 222360 varnish.ulsfo.backends
 202917 MediaWiki.CirrusSearch.connectionPool
 194341 MediaWiki.resourceloader_cache.minify_js
 167073 restbase.internal-update.sys_table_-table-_
 166937 MediaWiki.resourceloader_build.user_options
 166718 MediaWiki.resourceloader_build.user_tokens
 156828 MediaWiki.db.commit_all
 128783 MediaWiki.timing.viewResponseTime
 121857 MediaWiki.pcache.hit
 120764 MediaWiki.jobqueue.inserts
 109984 MediaWiki.jobqueue.pickup_delay
 109982 MediaWiki.jobqueue.pops
 109842 MediaWiki.jobqueue.acks
 105002 MediaWiki.resourceloader_cache.minify_css
 103974 MediaWiki.jobqueue.inserts_actual
  98627 MediaWiki.CirrusSearch.poolCounter
  82596 MediaWiki.AbuseFilter.tokenizerCache
  76598 restbase.internal-update.sys_key-rev-value_-bucket-_-key--_revision--_tid-
  63602 kafka.cluster.analytics-eqiad
  54992 MediaWiki.jobqueue.run
  54992 MediaWiki.jobqueue.rss_delta
  50668 restbase.internal-update.sys_page-revisions_page_-title--_revision-
  48106 restbase.internal-update.sys_key-value_-bucket-_-key--_tid-
  47493 MediaWiki.pcache.miss
  35259 restbase.internal-update.sys_parsoid_html_-title--_revision--_tid-
  32145 MediaWiki.ParsoidBatchAPI.items
  32145 MediaWiki.ParsoidBatchAPI.batches
  26788 restbase.internal.sys_table_-table-_
  25054 MediaWiki.CirrusSearch.eqiad
  24027 restbase.internal-update.sys_events_
  23166 varnish.clients.ssl_cipher
  20686 restbase.internal-update.v1_page_html_-title-_-revision--_tid-
  20381 MediaWiki.centralauth.session
  20252 parsoid.wt2html.pageWithOldid
  16764 MediaWiki.jobqueue.dupe_inserts
  15154 restbase.internal-update.sys_parsoid_pagebundle_-title--_revision--_tid-
  14864 restbase.external.sys_table_-table-_
  14602 media.thumbnail.client
  14574 restbase.internal-update.sys_parsoid_data-parsoid_-title--_revision--_tid-
  14440 MediaWiki.image_cache.miss
  13671 MediaWiki.jobqueue.pickup_root_age
  13041 restbase.internal.sys_page-revisions_page_-title--_revision-
  12985 restbase.internal.sys_key-rev-value_-bucket-_-key--_revision--_tid-

Change 282986 had a related patch set uploaded (by Ori.livneh):
Don't report image cache hits / misses

https://gerrit.wikimedia.org/r/282986

Change 282990 had a related patch set uploaded (by Ori.livneh):
Revert "Measure commitMasterChanges() run time"

https://gerrit.wikimedia.org/r/282990

Change 282996 had a related patch set uploaded (by Ori.livneh):
Revert "Measure commitMasterChanges() run time"

https://gerrit.wikimedia.org/r/282996

Change 282998 had a related patch set uploaded (by Ori.livneh):
Don't report image cache hits / misses

https://gerrit.wikimedia.org/r/282998

Change 282998 merged by Ori.livneh:
Don't report image cache hits / misses

https://gerrit.wikimedia.org/r/282998

Change 282996 merged by Ori.livneh:
Revert "Measure commitMasterChanges() run time"

https://gerrit.wikimedia.org/r/282996

Change 283000 had a related patch set uploaded (by Ori.livneh):
Revert "Measure commitMasterChanges() run time"

https://gerrit.wikimedia.org/r/283000

Change 283000 merged by Ori.livneh:
Revert "Measure commitMasterChanges() run time"

https://gerrit.wikimedia.org/r/283000

Change 283001 had a related patch set uploaded (by Ori.livneh):
Don't report image cache hits / misses

https://gerrit.wikimedia.org/r/283001

Change 282986 merged by Ori.livneh:
Don't report image cache hits / misses

https://gerrit.wikimedia.org/r/282986

Change 283001 merged by Ori.livneh:
Don't report image cache hits / misses

https://gerrit.wikimedia.org/r/283001

Change 282990 merged by jenkins-bot:
Revert "Measure commitMasterChanges() run time"

https://gerrit.wikimedia.org/r/282990

Mentioned in SAL [2016-07-01T11:23:38Z] <godog> bounce statsdlb on graphite1001, drops are back after yesterday's reboot T101141

this issue seems to be back after yesterday's reboot of graphite1001 for a trusty kernel upgrade:

graphite1001:~$ grep -Pv ' 0\s+$' /proc/net/udp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode ref pointer drops             
22126: 00000000:1FBD 00000000:0000 07 00000300:00034000 00:00000000 00000000   113        0 3066951 2 0000000000000000 255879

Change 296920 had a related patch set uploaded (by Filippo Giunchedi):
use SO_REUSEPORT in code, not SO_REUSEADDR

https://gerrit.wikimedia.org/r/296920

Mentioned in SAL [2016-07-01T15:53:17Z] <godog> temporarily run 3x statsdlb instances on graphite1001 to minimise drops - T101141

Change 296920 merged by Filippo Giunchedi:
use SO_REUSEPORT in code, not SO_REUSEADDR

https://gerrit.wikimedia.org/r/296920

I've tried replacing statsdlb with statsd_proxy today and so far so good. I can still see the occasional errors but an order of magnitude lower now.

I spot-checked some statsd metrics and grafana dashboards and couldn't find discrepancies, the switch happened at 2016-07-28T11:17.
One thing that changed though is the kafka-by-topic dashboard that shows intermittent data for messages/bytes in/out by-topic, https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=now-1h&to=now
I'll need to dig deeper into the dashboard/data to see what's going on (cc @Ottomata @elukey )

This seems to be related only to main-eqiad (kafka100[12].eqiad). I can see these horrible messages in jmxlog:

[28 Jul 2016 13:56:28] [ServerScheduler_Worker-1] 75635  ERROR (com.googlecode.jmxtrans.jobs.ServerJob:41) - Error
java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host: kafka1001.eqiad.wmnet; nested exception is:
	java.net.ConnectException: Connection refused]
	at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:370)
	at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:268)
	at com.googlecode.jmxtrans.util.JmxUtils.getServerConnection(JmxUtils.java:351)
	at com.googlecode.jmxtrans.util.JmxConnectionFactory.makeObject(JmxConnectionFactory.java:31)
	at org.apache.commons.pool.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:1212)
	at com.googlecode.jmxtrans.jobs.ServerJob.execute(ServerJob.java:37)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:216)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
Caused by: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host: kafka1001.eqiad.wmnet; nested exception is:
	java.net.ConnectException: Connection refused]
	at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:118)
	at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:203)
	at javax.naming.InitialContext.lookup(InitialContext.java:411)
	at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1928)
	at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1895)
	at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:287)
	... 7 more
Caused by: java.rmi.ConnectException: Connection refused to host: kafka1001.eqiad.wmnet; nested exception is:
	java.net.ConnectException: Connection refused
	at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:619)
	at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216)
	at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
	at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:341)
	at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)
	at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:114)
	... 12 more
Caused by: java.net.ConnectException: Connection refused
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:579)
	at java.net.Socket.connect(Socket.java:528)
	at java.net.Socket.<init>(Socket.java:425)
	at java.net.Socket.<init>(Socket.java:208)
	at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:40)
	at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:147)
	at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:613)
	... 17 more

Restarting seems not working..

Correction: even the graphs for analytics-eqiad are showing some changes, but not that heavy like main-eqiad.

Mentioned in SAL [2016-07-28T14:59:17Z] <godog> bounce carbon-cache on graphite1001 - T101141

Mentioned in SAL [2016-07-28T15:15:52Z] <godog> bounce statsite on graphite1001 - T101141

I looked at the whisper files for OneMinuteRate and I think the NULLs are due to data not making it to the whisper bucket in time, since statsite flushes every 60s and statsd-proxy writes every 10ms to statsite. I'm going to try flushing statsite more frequently (30s) which should improve things.

Change 301635 had a related patch set uploaded (by Filippo Giunchedi):
statsite: flush to graphite every 30s

https://gerrit.wikimedia.org/r/301635

Change 301635 merged by Filippo Giunchedi:
statsite: flush to graphite every 30s

https://gerrit.wikimedia.org/r/301635

Mentioned in SAL [2016-07-28T16:49:59Z] <godog> bounce statsite on graphite1001 T101141

recap: jmxtrans pushes gauges to statsd every 15s, I can see the metrics being received by statsd-proxy and forwarded to statsite roughly at the same interval.

What doesn't seem to be happening reliably for jmxtrans gauges is the flushing statsite -> carbon

Mentioned in SAL [2016-07-28T16:49:59Z] <godog> bounce statsite on graphite1001 T101141

It looks like this did something to some of the metrics (though not all):

https://grafana.wikimedia.org/dashboard/db/media

Screen Shot 2016-07-28 at 20.35.30.png (1×1 px, 233 KB)

The drop is immediate at 16:50, after the above log entry.

Mentioned in SAL [2016-07-28T16:49:59Z] <godog> bounce statsite on graphite1001 T101141

It looks like this did something to some of the metrics (though not all):

indeed, so the change in SAL was about changing the flush period statsd -> graphite from 60s to 30s. The metric e.g. for cached thumbnails served is based on sum and then scaleToSeconds() is applied, you could use rate metric which is always per-second.

Anyways, the flush period change didn't have the results I've expected so I'm going to revert it now regardless.

Change 301765 had a related patch set uploaded (by Filippo Giunchedi):
Revert "statsite: flush to graphite every 30s"

https://gerrit.wikimedia.org/r/301765

Change 301765 merged by Filippo Giunchedi:
Revert "statsite: flush to graphite every 30s"

https://gerrit.wikimedia.org/r/301765

Mentioned in SAL [2016-07-29T08:51:03Z] <godog> switch back statsite flush period to 60s T101141

trying to track down where some metrics don't get flushed to graphite, taking kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate as an example.

jmxtrans -> statsd-proxy on graphite1001 (8125/udp)

root@graphite1001:~# timeout 3m ngrep -q -W byline . udp dst port 8125  | grep kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9229.955827609365|g
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9261.413268614526|g
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9294.83859068178|g
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9234.212510770094|g
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9208.05079053432|g
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9223.560965714945|g
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9244.575335172642|g
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9208.014260557184|g
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9200.133815280758|g
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9220.049515174913|g
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9175.8143242711|g
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9203.801788213077|g
Terminated

statsd-proxy -> statsite (8126-8131/udp)

root@graphite1001:~# timeout 3m ngrep -d lo -t  -W byline kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 'udp dst port (8126 or 8127 or 8128 or 8129 or 8130 or 8131)' | grep -e '^[TU] ' -e kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate
match: kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate
U 2016/07/29 13:40:03.798197 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10034.088016421334|g
U 2016/07/29 13:40:18.838229 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9981.073858063919|g
U 2016/07/29 13:40:33.718221 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10003.718249726055|g
U 2016/07/29 13:40:48.728114 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9989.917392425285|g
U 2016/07/29 13:41:03.728204 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9990.644593542436|g
U 2016/07/29 13:41:18.738096 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10035.694702257877|g
U 2016/07/29 13:41:33.848399 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10001.913578578356|g
U 2016/07/29 13:41:48.818131 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9986.1155052628|g
U 2016/07/29 13:42:03.718238 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9985.875323647155|g
U 2016/07/29 13:42:18.718167 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10029.00048466752|g
U 2016/07/29 13:42:33.738370 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10033.396873696549|g
U 2016/07/29 13:42:48.708271 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10036.771501899368|g
Terminated

capturing the traffic that goes statsite -> carbon-c-relay frontend -> local (2003 -> 1903) there's this:

root@graphite1001:~# timeout 20m ngrep -t -d lo -q -W byline kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 'tcp dst port 1903'  | grep -e '^[TU] ' -e kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate
match: kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate
T 2016/07/29 14:49:26.524723 10.64.32.155:42988 -> 10.64.32.155:1903 [A]
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9995.078023 1469803765
T 2016/07/29 14:51:26.509645 10.64.32.155:42988 -> 10.64.32.155:1903 [A]
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9966.901415 1469803885
T 2016/07/29 14:53:26.932847 10.64.32.155:42988 -> 10.64.32.155:1903 [A]
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9991.672961 1469804005
T 2016/07/29 14:54:26.417651 10.64.32.155:42988 -> 10.64.32.155:1903 [A]
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 10001.118134 1469804065
T 2016/07/29 14:57:26.752286 10.64.32.155:42988 -> 10.64.32.155:1903 [A]
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9944.986556 1469804245
T 2016/07/29 14:58:27.142492 10.64.32.155:42988 -> 10.64.32.155:1903 [A]
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9906.701438 1469804305
T 2016/07/29 14:59:27.104136 10.64.32.155:42988 -> 10.64.32.155:1903 [A]
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9873.971604 1469804365
T 2016/07/29 15:00:26.704842 10.64.32.155:42988 -> 10.64.32.155:1903 [AP]
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9807.134333 1469804425
T 2016/07/29 15:01:27.078196 10.64.32.155:42988 -> 10.64.32.155:1903 [A]
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9852.840489 1469804485
T 2016/07/29 15:03:26.806722 10.64.32.155:42988 -> 10.64.32.155:1903 [A]
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9829.313463 1469804605
T 2016/07/29 15:07:27.892026 10.64.32.155:42988 -> 10.64.32.155:1903 [AP]
Terminated

and traffic statsd-proxy -> statsite at the same time

root@graphite1001:~# timeout 20m ngrep -t -d lo -q -W byline kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 'tcp dst port 1903'  | grep -e '^[TU] ' -e kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate
MinuteRate:9804.832992504611|g
U 2016/07/29 15:02:03.738260 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9760.638489792827|g
U 2016/07/29 15:02:18.718536 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9777.759951355441|g
U 2016/07/29 15:02:33.718558 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9774.637898335885|g
U 2016/07/29 15:02:48.758135 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9790.173290691353|g
U 2016/07/29 15:03:03.718092 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9821.527271966477|g
U 2016/07/29 15:03:18.728119 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9829.31346348762|g
U 2016/07/29 15:03:33.728333 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9843.430718072295|g
U 2016/07/29 15:03:48.718521 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9799.106728749253|g
U 2016/07/29 15:04:03.738372 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9762.278110286614|g
U 2016/07/29 15:04:18.768362 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9747.04034892221|g
U 2016/07/29 15:04:33.738529 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9731.92816369084|g
U 2016/07/29 15:04:48.738312 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9720.605189040985|g
U 2016/07/29 15:05:03.728253 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9743.367768833981|g
U 2016/07/29 15:05:18.728121 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9782.272634596275|g
U 2016/07/29 15:05:33.888288 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9806.045824671204|g
U 2016/07/29 15:05:48.728199 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9771.983558343027|g
U 2016/07/29 15:06:03.718235 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9789.94828648672|g
U 2016/07/29 15:06:18.758332 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9789.148823708021|g
U 2016/07/29 15:06:33.798181 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9810.695610113127|g
U 2016/07/29 15:06:48.808179 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9783.861943505917|g
U 2016/07/29 15:07:03.808230 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9769.43363691246|g
U 2016/07/29 15:07:18.728360 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9780.347978634356|g
U 2016/07/29 15:07:33.808352 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9794.406386858032|g
U 2016/07/29 15:07:48.838092 127.0.0.1:54905 -> 127.0.0.1:8131
kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9772.183595136628|g
Terminated

e.g. 14:52 14:55 14:56 do not show up inbound for carbon-c-relay, looks like it might be statsite

I don't fully understand what's going on here, but the problem is getting increasingly worse, making a mess of graphs and averages. One quick thing I did notice, though:

Change 296920 merged by Filippo Giunchedi:
use SO_REUSEPORT in code, not SO_REUSEADDR

https://gerrit.wikimedia.org/r/296920

The code should probably be using both socket options, rather than just one or the other. But I doubt that's the problem unless statsdlb is restarting constantly and failing to do so quickly...

@BBlack we've stopped using statsdlb in favour of statsd-proxy, but yeah I didn't see statsdlb restarting quickly even when we were using it.

An example of the problem in tls-ciphers dashboard is missing datapoints which show up as spikes in graphs, e.g. between Aug 16th and Aug 18th:

screenshot_h1Vhan.png (775×622 px, 154 KB)

I believe these correspond to restarts of the carbon daemons, e.g. for https://gerrit.wikimedia.org/r/#/c/305238/

I've also noticed the udp inerrors come back and on a steady level after the 16th:

screenshot_gqPQQx.png (307×456 px, 29 KB)

and according to SAL a varnish rolling restart was in progress at the time for T142810

We've been observing periodic elevated (>500/s) udp inerrors / buffer errors on graphite1001 since yesterday _after_ having switched statsd traffic to graphite1004 in T196484. The only statsd client still sending traffic to graphite1001 is ores in this case and errors are still elevated even with modest traffic (compared to the firehose of all udp statsd traffic)

fgiunchedi renamed this task from udp rcvbuferrors and inerrors on graphite1001 to UDP rcvbuferrors and inerrors on graphite hosts.Jan 16 2019, 11:43 AM

Resolving since we have significantly lessened the load of udp traffic