Page MenuHomePhabricator

Audit Memcache load (Spring 2017)
Closed, ResolvedPublic

Description

After T58602 was solved again the memcache bandwidth usage started dropping again but now it is even worse.

https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&c=Memcached+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report

Details

Reference
bz72024

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:52 AM
bzimport set Reference to bz72024.
bzimport added a subscriber: Unknown Object (MLST).

If we check the result of memkeys (on eth0) to see the top keys on all 18 mc10** servers, we could check for:
a) The sites key to see if that problem is back
b) Any other key that has excessive or unexpected usage

Captured about 60 seconds worth of requests to mc1001 via tcpdump:

sudo tcpdump -i eth0 -s 500 -A -t port 11211 | cut -c 9- | grep gets > tmp.txt
# Separate WANCache since it typically has one more colon-separated segment before dnname/global
cat tmp.txt | grep WANCache > tmp.wan.txt
cat tmp.txt | grep -v WANCache > tmp.nonwan.txt
# Strip ids and hashes
cat tmp.wan.txt | sed 's/:[0-9a-f][0-9a-f][0-9a-f]\+/:*/g' > tmp.wan.norm.txt
cat tmp.nonwan.txt | sed 's/:[0-9a-f][0-9a-f][0-9a-f]\+/:*/g' > tmp.nonwan.norm.txt

EDIT: See next comment

Krinkle renamed this task from find out what causes too high memcache io to Audit Memcache load (Spring 2017).Mar 30 2017, 12:25 AM

Aggregated from mc10* (mc1001-mc1018) during approx. 60 seconds.

Popular WANCache keys
$ cat *.wan.norm.txt | cut -d':' -f4- | sort | uniq -c | sort -rn | head
2407183 revisiontext:textid:* 
1391617 file:* 
1090343 page:10:* 
 782408 page:content-model:* 
 687809 revision:enwiki:*:* 
 632893 revision:enwiktionary:*:* 
 521202 image_redirect:* 
 462218 revision:commonswiki:*:* 
 426820 page-restrictions:*:* 
 401333 gadgets-definition:9:2 
 248394 titleblacklist:normalized-unicode:* 
 222898 messages:en 
 216919 messages:en:hash:v1 
 150800 gadgets-definition:*
 144216 Wikimedia\Rdbms\LoadBalancer:server-read-only:* 
 126767 revision:zhwiki:*:* 
  94437 user:id:enwiki:* 
Popular local cluster keys
$ cat *.nonwan.norm.txt | cut -d':' -f2- | sort | uniq -c | sort -rn | head
 504279 Wikimedia\Rdbms\ChronologyProtector:*:v1 
 310501 preprocess-hash:*:1 
 130739 preprocess-hash:*:0 
 113401 textextracts:*:*:en:1:1 
  97314 pcache:idoptions:* 
  45978 page:last-dc-purge:* 
  22789 CacheAwarePropertyInfoStore 
  13967 flaggedrevs:includesSynced:* 
  11550 textextracts:*:*:es:1:1 
  11261 textextracts:*:*:en:1: 
   8685 OtherProjectsSites:* 
   4602 textextracts:*:*:de:1:1 
  • Total number of gets captured: 13,929,151 lines (13 million)
  • WANCache: 12,397,808 (89%)
  • Local cluster: 1,531,343 (11%)

(Unrelated: During the capture, 0 mc gets were received by mc1005 - not pooled?).

Krinkle claimed this task.