Page MenuHomePhabricator

wikibase_shared/<current_train_version>-wikidatawiki-hhvm:CacheAwarePropertyInfoStore memcached key not well distributed, causing excessive traffic
Closed, DuplicatePublic

Description

SRE team noticed that a specific host (mc1023) is close to saturating the uplink network connection [1]. More investigation into the grafana graphs for the entire cluster [2] showed that this is a recurring pattern that seems to follow hosts around. Doing a memkeys on mc1023 we found out that the key

wikibase_shared/1_32_0-wmf_20-wikidatawiki-hhvm:CacheAwarePropertyInfoStore

is doing >600Mbps of traffic. The fact the train version is coded in the key name supports the theory of the key name following the train and being hashed to a different server, explaining the fact the traffic seems to follow hosts around.

This will cause an outage soon, needs to be fixed

[1] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=8&fullscreen&orgId=1&var-server=mc1023&var-datasource=eqiad%20prometheus%2Fops&from=now-7d&to=now-1m

[2] https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&from=now-30m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=All