looks like the current rate of incoming udp packets causes receive errors, the ~2x increase seen in the graph is xhprof enabled
see also the graphite dashboard at https://gdash.wikimedia.org/dashboards/graphite.eqiad/
fgiunchedi | |
Jun 2 2015, 6:28 PM |
F4376920: screenshot_gqPQQx.png | |
Aug 19 2016, 2:54 PM |
F4376905: screenshot_h1Vhan.png | |
Aug 19 2016, 2:52 PM |
F4315847: Screen Shot 2016-07-28 at 20.35.30.png | |
Jul 28 2016, 7:37 PM |
F173364: Screen Shot 2015-06-02 at 15.06.02.png | |
Jun 2 2015, 10:22 PM |
F173180: Screen Shot 2015-06-02 at 11.27.27.png | |
Jun 2 2015, 6:28 PM |
looks like the current rate of incoming udp packets causes receive errors, the ~2x increase seen in the graph is xhprof enabled
see also the graphite dashboard at https://gdash.wikimedia.org/dashboards/graphite.eqiad/
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | fgiunchedi | T105218 check_graphite - "UNKNOWN: More than half of the datapoints are undefined " | |||
Resolved | fgiunchedi | T101141 UDP rcvbuferrors and inerrors on graphite hosts | |||
Resolved | fgiunchedi | T116031 mediawiki should send statsd metrics in batches | |||
Resolved | fgiunchedi | T116033 diamond should send statsd metrics in batches | |||
Resolved | • Pchelolo | T121231 restbase should send metrics in batches | |||
Resolved | fgiunchedi | T121861 switch diamond to use graphite line protocol | |||
Resolved | fgiunchedi | T105679 add more statsite processes to get more balanced load | |||
Resolved | fgiunchedi | T126447 add more statsdlb instances for more throughput | |||
Resolved | aaron | T132327 jobrunner should send statsd in batches | |||
Resolved | fgiunchedi | T132472 mediawiki statsd traffic | |||
Declined | Ottomata | T141524 eventbus should send statsd in batches |
Change 215383 had a related patch set uploaded (by Filippo Giunchedi):
Sample profiling data at 1:10,000
increasing xhprof sampling helped by reducing the incoming packets rates, though it is still there (the new increase is due to varnishstatsd https://gerrit.wikimedia.org/r/214651). Likely offenders are statsdlb and/or statsite
Close: it's statsdlb and carbon-c-relay.
$ grep -Pv ' 0\s+$' /proc/net/udp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 15935: 00000000:07D3 00000000:0000 07 00000000:00000000 00:00000000 00000000 115 0 742460050 2 0000000000000000 12273633 22057: 00000000:1FBD 00000000:0000 07 00000000:00000000 00:00000000 00000000 113 0 701339285 2 0000000000000000 149627693 $ getent passwd 115 carbon-c-relay:x:115:118::/var/lib/carbon-c-relay:/bin/false $ getent passwd 113 statsdlb:x:113:65534::/nonexistent:/bin/false
Right now statsdlb can receive multiple metric strings in a single datagram, but it only sends one metric per datagram. It has to handle each metric strings separately because it has to hash the key name to determine which downstream receiver it should forward it to. But it could simply keep a buffer for each receiver, and flush when some size threshold is hit.
indeed the udp traffic from carbon-c-relay is graphite clients on 2003/udp
graphite1001:~$ grep -Pv ' 0\s+$' /proc/net/udp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 15935: 00000000:07D3 00000000:0000 07 00000000:00000000 00:00000000 00000000 115 0 802210643 2 0000000000000000 129687 22057: 00000000:1FBD 00000000:0000 07 00000000:00000000 00:00000000 00000000 113 0 701339285 2 0000000000000000 149816708 graphite1001:~$ ps fwaux | grep -i carbon-c-rela filippo 5020 0.0 0.0 11868 920 pts/16 S+ 22:50 0:00 \_ grep -i carbon-c-rela carbon-+ 18826 4.5 0.0 1366752 48096 ? Ssl 19:52 8:02 /usr/bin/carbon-c-relay -H graphite.wikimedia.org_local -p 1903 -f /etc/carbon/local-relay.conf carbon-+ 18871 5.2 0.0 1388124 55704 ? Ssl 19:52 9:22 /usr/bin/carbon-c-relay -H graphite.wikimedia.org_frontend -p 2003 -f /etc/carbon/frontend-relay.conf graphite1001:~$ sudo lsof -p 18826 | grep 802210643 graphite1001:~$ sudo lsof -p 18871 | grep 802210643 carbon-c- 18871 carbon-c-relay 5u IPv4 802210643 0t0 UDP *:2003 graphite1001:~$
I took a one-minute survey of statsd traffic to get top users
graphite1001:~$ sudo timeout 1m ngrep -q -W byline . udp dst port 8125 | grep -v -e '^U ' -e '^$' | sed -e 's/:.*//' | sort | cut -d. -f1 | uniq -dc | sort -rn | head -20 7677869 MediaWiki 2114283 varnish 734322 restbase 75951 kafka 69421 jobrunner 64569 varnishkafka 31518 parsoid 21038 media 18955 eventbus 14506 Hadoop 12538 frontend 7595 logstash 4793 aqs 4193 ocg 1108 ResourceLoader 1019 servers 850 zuul 758 kartotherian 748 restbase-test 704 graphoid
and expanded to the first three fields
graphite1001:~$ sudo timeout 1m ngrep -q -W byline . udp dst port 8125 | grep -v -e '^U ' -e '^$' | sed -e 's/:.*//' | sort | cut -d. -f1,2,3 | uniq -dc | sort -rn | head -50 2394083 MediaWiki.db.commit_masters 1569647 MediaWiki.image_cache.hit 961492 varnish.eqiad.backends 522954 varnish.esams.backends 413060 MediaWiki.resourceloader_build.all 371008 varnish.codfw.backends 248226 MediaWiki.CirrusSearch.codfw 222360 varnish.ulsfo.backends 202917 MediaWiki.CirrusSearch.connectionPool 194341 MediaWiki.resourceloader_cache.minify_js 167073 restbase.internal-update.sys_table_-table-_ 166937 MediaWiki.resourceloader_build.user_options 166718 MediaWiki.resourceloader_build.user_tokens 156828 MediaWiki.db.commit_all 128783 MediaWiki.timing.viewResponseTime 121857 MediaWiki.pcache.hit 120764 MediaWiki.jobqueue.inserts 109984 MediaWiki.jobqueue.pickup_delay 109982 MediaWiki.jobqueue.pops 109842 MediaWiki.jobqueue.acks 105002 MediaWiki.resourceloader_cache.minify_css 103974 MediaWiki.jobqueue.inserts_actual 98627 MediaWiki.CirrusSearch.poolCounter 82596 MediaWiki.AbuseFilter.tokenizerCache 76598 restbase.internal-update.sys_key-rev-value_-bucket-_-key--_revision--_tid- 63602 kafka.cluster.analytics-eqiad 54992 MediaWiki.jobqueue.run 54992 MediaWiki.jobqueue.rss_delta 50668 restbase.internal-update.sys_page-revisions_page_-title--_revision- 48106 restbase.internal-update.sys_key-value_-bucket-_-key--_tid- 47493 MediaWiki.pcache.miss 35259 restbase.internal-update.sys_parsoid_html_-title--_revision--_tid- 32145 MediaWiki.ParsoidBatchAPI.items 32145 MediaWiki.ParsoidBatchAPI.batches 26788 restbase.internal.sys_table_-table-_ 25054 MediaWiki.CirrusSearch.eqiad 24027 restbase.internal-update.sys_events_ 23166 varnish.clients.ssl_cipher 20686 restbase.internal-update.v1_page_html_-title-_-revision--_tid- 20381 MediaWiki.centralauth.session 20252 parsoid.wt2html.pageWithOldid 16764 MediaWiki.jobqueue.dupe_inserts 15154 restbase.internal-update.sys_parsoid_pagebundle_-title--_revision--_tid- 14864 restbase.external.sys_table_-table-_ 14602 media.thumbnail.client 14574 restbase.internal-update.sys_parsoid_data-parsoid_-title--_revision--_tid- 14440 MediaWiki.image_cache.miss 13671 MediaWiki.jobqueue.pickup_root_age 13041 restbase.internal.sys_page-revisions_page_-title--_revision- 12985 restbase.internal.sys_key-rev-value_-bucket-_-key--_revision--_tid-
Change 282986 had a related patch set uploaded (by Ori.livneh):
Don't report image cache hits / misses
Change 282990 had a related patch set uploaded (by Ori.livneh):
Revert "Measure commitMasterChanges() run time"
Change 282996 had a related patch set uploaded (by Ori.livneh):
Revert "Measure commitMasterChanges() run time"
Change 282998 had a related patch set uploaded (by Ori.livneh):
Don't report image cache hits / misses
Change 283000 had a related patch set uploaded (by Ori.livneh):
Revert "Measure commitMasterChanges() run time"
Change 283001 had a related patch set uploaded (by Ori.livneh):
Don't report image cache hits / misses
Mentioned in SAL [2016-07-01T11:23:38Z] <godog> bounce statsdlb on graphite1001, drops are back after yesterday's reboot T101141
this issue seems to be back after yesterday's reboot of graphite1001 for a trusty kernel upgrade:
graphite1001:~$ grep -Pv ' 0\s+$' /proc/net/udp sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 22126: 00000000:1FBD 00000000:0000 07 00000300:00034000 00:00000000 00000000 113 0 3066951 2 0000000000000000 255879
Change 296920 had a related patch set uploaded (by Filippo Giunchedi):
use SO_REUSEPORT in code, not SO_REUSEADDR
Mentioned in SAL [2016-07-01T15:53:17Z] <godog> temporarily run 3x statsdlb instances on graphite1001 to minimise drops - T101141
Change 296920 merged by Filippo Giunchedi:
use SO_REUSEPORT in code, not SO_REUSEADDR
I've tried replacing statsdlb with statsd_proxy today and so far so good. I can still see the occasional errors but an order of magnitude lower now.
I spot-checked some statsd metrics and grafana dashboards and couldn't find discrepancies, the switch happened at 2016-07-28T11:17.
One thing that changed though is the kafka-by-topic dashboard that shows intermittent data for messages/bytes in/out by-topic, https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=now-1h&to=now
I'll need to dig deeper into the dashboard/data to see what's going on (cc @Ottomata @elukey )
This seems to be related only to main-eqiad (kafka100[12].eqiad). I can see these horrible messages in jmxlog:
[28 Jul 2016 13:56:28] [ServerScheduler_Worker-1] 75635 ERROR (com.googlecode.jmxtrans.jobs.ServerJob:41) - Error java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host: kafka1001.eqiad.wmnet; nested exception is: java.net.ConnectException: Connection refused] at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:370) at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:268) at com.googlecode.jmxtrans.util.JmxUtils.getServerConnection(JmxUtils.java:351) at com.googlecode.jmxtrans.util.JmxConnectionFactory.makeObject(JmxConnectionFactory.java:31) at org.apache.commons.pool.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:1212) at com.googlecode.jmxtrans.jobs.ServerJob.execute(ServerJob.java:37) at org.quartz.core.JobRunShell.run(JobRunShell.java:216) at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549) Caused by: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host: kafka1001.eqiad.wmnet; nested exception is: java.net.ConnectException: Connection refused] at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:118) at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:203) at javax.naming.InitialContext.lookup(InitialContext.java:411) at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1928) at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1895) at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:287) ... 7 more Caused by: java.rmi.ConnectException: Connection refused to host: kafka1001.eqiad.wmnet; nested exception is: java.net.ConnectException: Connection refused at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:619) at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216) at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202) at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:341) at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source) at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:114) ... 12 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.<init>(Socket.java:425) at java.net.Socket.<init>(Socket.java:208) at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:40) at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:147) at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:613) ... 17 more
Restarting seems not working..
Correction: even the graphs for analytics-eqiad are showing some changes, but not that heavy like main-eqiad.
Mentioned in SAL [2016-07-28T14:59:17Z] <godog> bounce carbon-cache on graphite1001 - T101141
Mentioned in SAL [2016-07-28T15:15:52Z] <godog> bounce statsite on graphite1001 - T101141
I looked at the whisper files for OneMinuteRate and I think the NULLs are due to data not making it to the whisper bucket in time, since statsite flushes every 60s and statsd-proxy writes every 10ms to statsite. I'm going to try flushing statsite more frequently (30s) which should improve things.
Change 301635 had a related patch set uploaded (by Filippo Giunchedi):
statsite: flush to graphite every 30s
Mentioned in SAL [2016-07-28T16:49:59Z] <godog> bounce statsite on graphite1001 T101141
recap: jmxtrans pushes gauges to statsd every 15s, I can see the metrics being received by statsd-proxy and forwarded to statsite roughly at the same interval.
What doesn't seem to be happening reliably for jmxtrans gauges is the flushing statsite -> carbon
It looks like this did something to some of the metrics (though not all):
https://grafana.wikimedia.org/dashboard/db/media
The drop is immediate at 16:50, after the above log entry.
indeed, so the change in SAL was about changing the flush period statsd -> graphite from 60s to 30s. The metric e.g. for cached thumbnails served is based on sum and then scaleToSeconds() is applied, you could use rate metric which is always per-second.
Anyways, the flush period change didn't have the results I've expected so I'm going to revert it now regardless.
Change 301765 had a related patch set uploaded (by Filippo Giunchedi):
Revert "statsite: flush to graphite every 30s"
Change 301765 merged by Filippo Giunchedi:
Revert "statsite: flush to graphite every 30s"
Mentioned in SAL [2016-07-29T08:51:03Z] <godog> switch back statsite flush period to 60s T101141
trying to track down where some metrics don't get flushed to graphite, taking kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate as an example.
jmxtrans -> statsd-proxy on graphite1001 (8125/udp)
root@graphite1001:~# timeout 3m ngrep -q -W byline . udp dst port 8125 | grep kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9229.955827609365|g kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9261.413268614526|g kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9294.83859068178|g kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9234.212510770094|g kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9208.05079053432|g kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9223.560965714945|g kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9244.575335172642|g kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9208.014260557184|g kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9200.133815280758|g kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9220.049515174913|g kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9175.8143242711|g kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9203.801788213077|g Terminated
statsd-proxy -> statsite (8126-8131/udp)
root@graphite1001:~# timeout 3m ngrep -d lo -t -W byline kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 'udp dst port (8126 or 8127 or 8128 or 8129 or 8130 or 8131)' | grep -e '^[TU] ' -e kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate match: kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate U 2016/07/29 13:40:03.798197 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10034.088016421334|g U 2016/07/29 13:40:18.838229 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9981.073858063919|g U 2016/07/29 13:40:33.718221 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10003.718249726055|g U 2016/07/29 13:40:48.728114 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9989.917392425285|g U 2016/07/29 13:41:03.728204 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9990.644593542436|g U 2016/07/29 13:41:18.738096 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10035.694702257877|g U 2016/07/29 13:41:33.848399 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10001.913578578356|g U 2016/07/29 13:41:48.818131 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9986.1155052628|g U 2016/07/29 13:42:03.718238 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9985.875323647155|g U 2016/07/29 13:42:18.718167 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10029.00048466752|g U 2016/07/29 13:42:33.738370 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10033.396873696549|g U 2016/07/29 13:42:48.708271 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:10036.771501899368|g Terminated
capturing the traffic that goes statsite -> carbon-c-relay frontend -> local (2003 -> 1903) there's this:
root@graphite1001:~# timeout 20m ngrep -t -d lo -q -W byline kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 'tcp dst port 1903' | grep -e '^[TU] ' -e kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate match: kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate T 2016/07/29 14:49:26.524723 10.64.32.155:42988 -> 10.64.32.155:1903 [A] kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9995.078023 1469803765 T 2016/07/29 14:51:26.509645 10.64.32.155:42988 -> 10.64.32.155:1903 [A] kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9966.901415 1469803885 T 2016/07/29 14:53:26.932847 10.64.32.155:42988 -> 10.64.32.155:1903 [A] kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9991.672961 1469804005 T 2016/07/29 14:54:26.417651 10.64.32.155:42988 -> 10.64.32.155:1903 [A] kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 10001.118134 1469804065 T 2016/07/29 14:57:26.752286 10.64.32.155:42988 -> 10.64.32.155:1903 [A] kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9944.986556 1469804245 T 2016/07/29 14:58:27.142492 10.64.32.155:42988 -> 10.64.32.155:1903 [A] kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9906.701438 1469804305 T 2016/07/29 14:59:27.104136 10.64.32.155:42988 -> 10.64.32.155:1903 [A] kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9873.971604 1469804365 T 2016/07/29 15:00:26.704842 10.64.32.155:42988 -> 10.64.32.155:1903 [AP] kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9807.134333 1469804425 T 2016/07/29 15:01:27.078196 10.64.32.155:42988 -> 10.64.32.155:1903 [A] kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9852.840489 1469804485 T 2016/07/29 15:03:26.806722 10.64.32.155:42988 -> 10.64.32.155:1903 [A] kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 9829.313463 1469804605 T 2016/07/29 15:07:27.892026 10.64.32.155:42988 -> 10.64.32.155:1903 [AP] Terminated
and traffic statsd-proxy -> statsite at the same time
root@graphite1001:~# timeout 20m ngrep -t -d lo -q -W byline kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate 'tcp dst port 1903' | grep -e '^[TU] ' -e kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate MinuteRate:9804.832992504611|g U 2016/07/29 15:02:03.738260 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9760.638489792827|g U 2016/07/29 15:02:18.718536 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9777.759951355441|g U 2016/07/29 15:02:33.718558 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9774.637898335885|g U 2016/07/29 15:02:48.758135 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9790.173290691353|g U 2016/07/29 15:03:03.718092 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9821.527271966477|g U 2016/07/29 15:03:18.728119 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9829.31346348762|g U 2016/07/29 15:03:33.728333 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9843.430718072295|g U 2016/07/29 15:03:48.718521 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9799.106728749253|g U 2016/07/29 15:04:03.738372 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9762.278110286614|g U 2016/07/29 15:04:18.768362 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9747.04034892221|g U 2016/07/29 15:04:33.738529 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9731.92816369084|g U 2016/07/29 15:04:48.738312 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9720.605189040985|g U 2016/07/29 15:05:03.728253 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9743.367768833981|g U 2016/07/29 15:05:18.728121 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9782.272634596275|g U 2016/07/29 15:05:33.888288 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9806.045824671204|g U 2016/07/29 15:05:48.728199 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9771.983558343027|g U 2016/07/29 15:06:03.718235 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9789.94828648672|g U 2016/07/29 15:06:18.758332 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9789.148823708021|g U 2016/07/29 15:06:33.798181 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9810.695610113127|g U 2016/07/29 15:06:48.808179 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9783.861943505917|g U 2016/07/29 15:07:03.808230 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9769.43363691246|g U 2016/07/29 15:07:18.728360 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9780.347978634356|g U 2016/07/29 15:07:33.808352 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9794.406386858032|g U 2016/07/29 15:07:48.838092 127.0.0.1:54905 -> 127.0.0.1:8131 kafka.cluster.analytics-eqiad.kafka.kafka1014_eqiad_wmnet_9999.kafka.server.BrokerTopicMetrics.MessagesInPerSec.webrequest_text.OneMinuteRate:9772.183595136628|g Terminated
e.g. 14:52 14:55 14:56 do not show up inbound for carbon-c-relay, looks like it might be statsite
I don't fully understand what's going on here, but the problem is getting increasingly worse, making a mess of graphs and averages. One quick thing I did notice, though:
The code should probably be using both socket options, rather than just one or the other. But I doubt that's the problem unless statsdlb is restarting constantly and failing to do so quickly...
@BBlack we've stopped using statsdlb in favour of statsd-proxy, but yeah I didn't see statsdlb restarting quickly even when we were using it.
An example of the problem in tls-ciphers dashboard is missing datapoints which show up as spikes in graphs, e.g. between Aug 16th and Aug 18th:
I believe these correspond to restarts of the carbon daemons, e.g. for https://gerrit.wikimedia.org/r/#/c/305238/
I've also noticed the udp inerrors come back and on a steady level after the 16th:
and according to SAL a varnish rolling restart was in progress at the time for T142810
We've been observing periodic elevated (>500/s) udp inerrors / buffer errors on graphite1001 since yesterday _after_ having switched statsd traffic to graphite1004 in T196484. The only statsd client still sending traffic to graphite1001 is ores in this case and errors are still elevated even with modest traffic (compared to the firehose of all udp statsd traffic)