Page MenuHomePhabricator

wikibase_shared/<current_train_version>-wikidatawiki-hhvm:CacheAwarePropertyInfoStore memcached key not well distributed, causing excessive traffic
Closed, DuplicatePublic

Description

SRE team noticed that a specific host (mc1023) is close to saturating the uplink network connection [1]. More investigation into the grafana graphs for the entire cluster [2] showed that this is a recurring pattern that seems to follow hosts around. Doing a memkeys on mc1023 we found out that the key

wikibase_shared/1_32_0-wmf_20-wikidatawiki-hhvm:CacheAwarePropertyInfoStore

is doing >600Mbps of traffic. The fact the train version is coded in the key name supports the theory of the key name following the train and being hashed to a different server, explaining the fact the traffic seems to follow hosts around.

This will cause an outage soon, needs to be fixed

[1] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=8&fullscreen&orgId=1&var-server=mc1023&var-datasource=eqiad%20prometheus%2Fops&from=now-7d&to=now-1m

[2] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&from=now-30m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=All

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2018, 8:16 PM
akosiaris triaged this task as High priority.Sep 11 2018, 8:17 PM
akosiaris added a project: Performance-Team.

https://grafana.wikimedia.org/dashboard/db/t204083?orgId=1 shows the excessive traffic moving around the various memcached hosts for the last 1 year.

jijiki added a subscriber: jijiki.Sep 11 2018, 9:13 PM
mark added a comment.Sep 11 2018, 9:17 PM

T97368 appears to be about the same issue.