mediawiki should send statsd metrics in batches
Closed, ResolvedPublic

Description

while investigating T101141: udp rcvbuferrors and inerrors on graphite1001 it occurred to me that mediawiki should also be batching statsd metrics, ATM it doesn't (strace from statsdlb)

recvfrom(3, "MediaWiki.image_cache.hit:1|c", 16777215, 0, NULL, NULL) = 29
recvfrom(3, "MediaWiki.jobqueue.pickup_delay.all:0|ms", 16777215, 0, NULL, NULL) = 40
recvfrom(3, "MediaWiki.resourceloader_cache.minify_css.hit:1|c", 16777215, 0, NULL, NULL) = 49
recvfrom(3, "varnish.clients.ssl.tlsv1:506|c\nvarnish.clients.ssl_cipher.ecdhe-ecdsa-aes128-sha256:112|c\nvarnish.clients.ssl_cipher.ecdhe-rsa-aes128-sha:38|c\nvarnish.clients.ssl_sessions.negotiated:1696|c\nvarnish.clients.ssl_cipher.ecdhe-ecdsa-aes128-gcm-sha256:1827|c\nv"..., 16777215, 0, NULL, NULL) = 680
recvfrom(3, "MediaWiki.resourceloader_build.user_options:0.10204315185547|ms", 16777215, 0, NULL, NULL) = 63
recvfrom(3, "MediaWiki.CirrusSearch.requestTime:15|ms", 16777215, 0, NULL, NULL) = 40
recvfrom(3, "MediaWiki.resourceloader_cache.minify_css.hit:1|c", 16777215, 0, NULL, NULL) = 49
fgiunchedi claimed this task.
fgiunchedi added subscribers: Matanya, ori, gerritbot and 2 others.
fgiunchedi set Security to None.
fgiunchedi removed a subscriber: gerritbot.

Change 255720 had a related patch set uploaded (by Addshore):
Fix packet reduction in SamplingStatsdClient

https://gerrit.wikimedia.org/r/255720

Change 255720 merged by jenkins-bot:
Fix packet reduction in SamplingStatsdClient

https://gerrit.wikimedia.org/r/255720

looks like this has been deployed yesterday? still seeing single-packet traffic from appservers

@ ..a...I9<MediaWiki.jobqueue.inserts.cirrusSearchLinksUpdatePrioritized:1|c
09:40:47.186175 IP mw1255.eqiad.wmnet.46543 > graphite1001.eqiad.wmnet.8125: UDP, length 29
E..97.@.?...
@0Z
@ ......%..MediaWiki.image_cache.hit:1|c
09:40:47.186177 IP mw1006.eqiad.wmnet.45108 > graphite1001.eqiad.wmnet.8125: UDP, length 49
E..M..@.?...
@.$
@ ..4...9..MediaWiki.db.commit_masters:0.0078678131103516|ms
09:40:47.186180 IP mw1114.eqiad.wmnet.50468 > graphite1001.eqiad.wmnet.8125: UDP, length 29
E..9..@.?..f
@.^
@ ..$...%"=MediaWiki.image_cache.hit:1|c
09:40:47.186181 IP mw1007.eqiad.wmnet.60829 > graphite1001.eqiad.wmnet.8125: UDP, length 49
E..M..@.?.p.
@.%
@ ......9..MediaWiki.jobqueue.pickup_delay.refreshLinks:0|ms
09:40:47.186181 IP mw1210.eqiad.wmnet.60087 > graphite1001.eqiad.wmnet.8125: UDP, length 41
E..E..@.?...
@0&
@ ......1..MediaWiki.session.read:1.0688304901123|ms
09:40:47.186183 IP mw1033.eqiad.wmnet.57628 > graphite1001.eqiad.wmnet.8125: UDP, length 54
E..R..@.?.j.
@.?
ori added a comment.Dec 9 2015, 9:45 AM

Only to group 0 wikis; the main wikis are on the previous version for a few more hours.

the train has hit all wikis now, still seeing some jobqueue related traffic not in batches but looks fairly minor now

09:50:25.047676 IP mw1009.eqiad.wmnet.39387 > graphite1001.eqiad.wmnet.8125: UDP, length 31
E..;:v@.?...
@.'
@ ......'n.jobrunner.some-full.mw1009:1|c
--
09:50:25.048325 IP mw1010.eqiad.wmnet.33675 > graphite1001.eqiad.wmnet.8125: UDP, length 54
E..R{w@.?...
@.(
@ ......>3fjobrunner.pop.wikibase-addUsagesForPage.ok.mw1010:1|c
--
09:50:25.049354 IP mw1004.eqiad.wmnet.44443 > graphite1001.eqiad.wmnet.8125: UDP, length 36
E..@.g@.?..     
@."
@ ......,L.jobrunner.prioritychange.mw1004:3|c
--
09:50:25.049699 IP mw1004.eqiad.wmnet.44443 > graphite1001.eqiad.wmnet.8125: UDP, length 31
E..;.h@.?..
@."
@ ......'Z.jobrunner.some-full.mw1004:1|c
--
09:50:25.051731 IP mw1013.eqiad.wmnet.51005 > graphite1001.eqiad.wmnet.8125: UDP, length 36
E..@%.@.?..V
@.+
@ ..=...,1Pjobrunner.pop.enqueue.ok.mw1013:1|c
--
09:50:25.051997 IP mw1015.eqiad.wmnet.33026 > graphite1001.eqiad.wmnet.8125: UDP, length 36
E..@7.@.?...
@.-
@ ......,x.jobrunner.prioritychange.mw1015:3|c
hashar added a subscriber: hashar.Dec 16 2015, 11:40 PM

jobrunner is a different system though: mediawiki/services/jobrunner.git