Page MenuHomePhabricator

Investigate apparent restbase request rate under-reporting in graphite: statsd issue?
Closed, ResolvedPublic

Description

During load testing on the restbase/cassandra test cluster graphite is reporting significantly lower request rates than would be expected by the sum of the client-reported request rates. The reported number is about 38% of the expected one.

This could be an issue in

  • graphite,
  • statsd,
  • restbase, or
  • the network (UDP packet loss).

The load on graphite1001 looks moderate. It might still be worth checking if there is an issue with a hanging statsd thread or the like.

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added projects: RESTBase, acl*sre-team.
GWicke added subscribers: GWicke, ori, yuvipanda.
GWicke renamed this task from Investigate apparent graphite request rate under-reporting to Investigate apparent restbase request rate under-reporting in graphite: statsd issue?.Feb 18 2015, 4:45 PM
GWicke triaged this task as Medium priority.
GWicke set Security to None.

likely UDP packet drop and txstatsd maxed out

graphite1001:~$ while sleep 1 ; do grep Udp: /proc/net/snmp ; done
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043345140 9730618683 2795830101 16611466 128098037 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043356191 9730618719 2795863685 16611466 128098038 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043368451 9730618741 2795893997 16611466 128099479 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043384257 9730618776 2795920339 16611466 128099792 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043401934 9730618785 2795940445 16611466 128102562 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043421447 9730618857 2795961437 16611470 128103531 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043442740 9730618883 2795980767 16611479 128105614 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043465589 9730618898 2796000362 16611479 128106399 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043488385 9730618941 2796019361 16611480 128107120 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043509568 9730618986 2796040384 16611480 128107741 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043532847 9730618986 2796060643 16611480 128109464 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043555541 9730618994 2796079183 16611480 128111065 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043578344 9730618996 2796097080 16611480 128112979 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043601451 9730619040 2796113069 16611480 128114356 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043622627 9730619049 2796132950 16611480 128115342 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043645157 9730619100 2796150214 16611486 128115824 0 0
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors InCsumErrors
Udp: 13043667254 9730619119 2796167870 16611486 128117644 0 0

which also makes me think the way diamond thinks this works it isn't how it really works, i.e. the network drops graph at http://gdash.wikimedia.org/dashboards/graphite/ is flat

so, short term plan is to reduce txstatsd load by stopping mw metrics into txstatsd (cc @ori) and move them to another statsd port to relieve pressure. A longish term plan is in T89858 too with more to come.

so, short term plan is to reduce txstatsd load by stopping mw metrics into txstatsd (cc @ori) and move them to another statsd port to relieve pressure. A longish term plan is in T89858 too with more to come.

Done in https://gerrit.wikimedia.org/r/191372

The restbase request rate as reported by graphite increased with this patch:

pasted_file (656×1 px, 82 KB)

It is still not quite at 100% of the expected rate though. Is txstatsd still maxed out?

@ori, re https://gerrit.wikimedia.org/r/191372: Are requests to port 8135 handled by statsd rather than txstatsd?

@ori, re https://gerrit.wikimedia.org/r/191372: Are requests to port 8135 handled by statsd rather than txstatsd?

No, that was just another way of turning off the metrics from MediaWiki. There wasn't anything listening on the port. I moved the metrics rather than actually turn them off so I could experiment on graphite1001 with some actual load.

GWicke claimed this task.

This is now resolved thanks to Ori's statsdlb work. Thank you, @ori!