Page MenuHomePhabricator

Investigate increase in GET ops registered by mcrouter for the mediawiki appserver cluster
Closed, ResolvedPublic0 Estimated Story Points

Assigned To
Authored By
elukey
May 17 2019, 5:39 PM
Referenced Files
F29279472: get.png
May 30 2019, 2:18 PM
F29184944: Screen Shot 2019-05-21 at 2.25.13 PM.png
May 21 2019, 12:26 PM
F29154957: mcrouter_ops.png
May 20 2019, 8:36 AM
F29090798: Screen Shot 2019-05-17 at 7.29.44 PM.png
May 17 2019, 5:39 PM
F29090814: Screen Shot 2019-05-17 at 7.30.54 PM.png
May 17 2019, 5:39 PM
Tokens
"Evil Spooky Haunted Tree" token, awarded by jijiki.

Description

While working on T223310, I noticed the following increase in GET ops registered by mcrouter on appservers:

Screen Shot 2019-05-17 at 7.29.44 PM.png (716×2 px, 291 KB)

On apis is not that heavy but still present:

Screen Shot 2019-05-17 at 7.30.54 PM.png (730×2 px, 397 KB)

From the memcached metrics I don't see the same increase, and if I break down the graph by mc host, I cannot see anything either it is difficult to spot the same increase since memcached aggregates get+gets calls in the same metric (gets in our case are a disproportionate amount compared to get).

If I break down the above mcrouter graphs by mw instance I can see that most of the appservers started to make more GET requests, not only few ones.

The only noticeable thing that I can see from the SAL is @jijiki increasing to 5% the percentage of traffic served by PHP7: https://tools.wmflabs.org/sal/log/AWp3wVrdOwpQ-3Pkw6oo
Zoom showing two bumps:

mcrouter_ops.png (597×886 px, 64 KB)

We know for sure that the increase is related to php7, but we still don't know why.

Event Timeline

elukey triaged this task as Medium priority.May 17 2019, 5:39 PM
elukey created this task.

I zoomed to check when the increase happened, and I noticed something interesting:

https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=1&fullscreen&orgId=1&from=1556564681483&to=1556826259465&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All

The first bump seems to match to the SAL entry related to @jijiki raising to 1% the PHP7.2 anonymous users requests, meanwhile the second one (the biggest) seems to match the SAL entry related to PHP7.2 to 5%.

I don't see a clear increase in GET ops registered by memcached in its related dashboard, so it is difficult to track down the problem. I tried to dump some localhost traffic on mw1238 to port 11213 (mcrouter) but I need a bit more time to get something useful out of it.

@Addshore After writing the above entry I remembered that we discussed a while ago about a difference in behavior for memcached in hhvm and php7 by Wikidata traffic. Could it be related?

Not a very lucky morning:

  • I tried to build and deploy the memkeys package (currently only on mc hosts) on mw1238, to allow a better visualization of the traffic flowing from mediawiki to mcrouter on port 11213, but ended up in a segfault - T223863
  • used tcpdump, and noticed that numbers for get/gets command were not the ones that this graph reported. After checking a bit the mcrouter stats, I found that get and gets are returned by different metrics, and our prometheus exporter does not take that into account. Opened https://github.com/Dev25/mcrouter_exporter/pull/8 to upstream.

I still didn't find any trace of the big increase in GET traffic in memcached's graphs. Either the traffic is spread evenly among multiple shards or something else is going on.

I also tried to grab a few seconds pcap dump (via tcpdump) of localhost:11213 traffic on mw1238, removed gets and tried to see if the remaining get commands showed a big top talker, but up to now I didn't spot anything.

Grabbed a .pcap from mw1238 (09:26:41 -> 09:27:00 UTC, so ~20s of traffic) and tried to create a meaningful summary of the get traffic:

tcpdump -r test.pcap -A | grep -v gets| egrep -o "get\ .*" | sort | cut -d ":" -f 1,2,3,4 | uniq -c | sort -n -k 1
reading from file test.pcap, link-type EN10MB (Ethernet)
   1 get WANCache:t:bgwiki:MessageBlobStore
   1 get WANCache:t:bswiki:MessageBlobStore
   1 get WANCache:t:itwiki:user-quicktouched
   1 get WANCache:v:commonswiki:echo-notification-counts
   1 get WANCache:v:commonswiki:sidebar
   1 get WANCache:v:commonswiki:valid-tags-db WANCache
   1 get WANCache:v:commonswiki:valid-tags-hook WANCache
   1 get WANCache:v:dewiki:messages
   1 get WANCache:v:enwiki:cirrussearch-interwiki-matrix
   1 get WANCache:v:eswiki:file
   1 get WANCache:v:eswiki:image_redirect
   1 get WANCache:v:eswiki:page-restrictions
   1 get WANCache:v:eswiki:sidebar
   1 get WANCache:v:euwiki:MessageBlobStore
   1 get WANCache:v:itwiki:babel
   1 get WANCache:v:itwiki:echo-notification-counts
   1 get WANCache:v:itwiki:page-restrictions
   1 get WANCache:v:itwiki:sidebar
   1 get WANCache:v:itwiki:sitenotice
   1 get WANCache:v:itwiki:unpatrollable-page
   1 get WANCache:v:jawiki:SiteStats
   1 get WANCache:v:jawiki:sidebar
   1 get WANCache:v:nlwiki:page-restrictions
   1 get WANCache:v:plwiki:page-restrictions
   1 get WANCache:v:plwiki:sidebar
   1 get WANCache:v:ptwiki:SiteStats
   1 get WANCache:v:ptwiki:anonnotice
   1 get WANCache:v:ptwiki:page-restrictions
   1 get WANCache:v:ruwiki:page-restrictions
   1 get WANCache:v:viwiki:file
   1 get WANCache:v:viwiki:image_redirect
   1 get WANCache:v:viwiki:page-restrictions
   1 get WANCache:v:zhwiki:file
   2 get WANCache:v:commonswiki:pagetranslation
   2 get WANCache:v:commonswiki:preprocess-hash
   2 get WANCache:v:eswiki:messages
   2 get WANCache:v:frwiki:file
   2 get WANCache:v:frwiki:image_redirect
   2 get WANCache:v:frwiki:sidebar
   2 get WANCache:v:global:echo-notification-counts
   2 get WANCache:v:itwiki:messages
   2 get WANCache:v:itwiki:page
   2 get WANCache:v:jawiki:page-restrictions
   2 get WANCache:v:nlwiki:messages
   2 get WANCache:v:plwiki:messages
   2 get WANCache:v:ruwiki:page-content-model
   2 get WANCache:v:viwiki:messages
   2 get WANCache:v:zhwiki:sidebar
   3 get WANCache:t:commonswiki:user-quicktouched
   3 get WANCache:v:commonswiki:page
   3 get WANCache:v:commonswiki:spamblacklist
   3 get WANCache:v:enwiki:image_redirect
   3 get WANCache:v:ptwiki:messages
   3 get WANCache:v:zhwiki:page-restrictions
   4 get WANCache:t:global:echo-notification-updated
   4 get WANCache:v:commonswiki:page-restrictions
   4 get WANCache:v:frwiki:anonnotice
   4 get WANCache:v:frwiki:page
   4 get WANCache:v:frwiki:preprocess-hash
   4 get WANCache:v:jawiki:file
   4 get WANCache:v:jawiki:image_redirect
   4 get WANCache:v:jawiki:messages
   5 get WANCache:v:enwiki:file
   5 get WANCache:v:frwiki:page-restrictions
   5 get WANCache:v:global:cirrussearch-boost-templates
   5 get WANCache:v:nlwiki:image_redirect
   5 get WANCache:v:ptwiki:file
   5 get WANCache:v:ptwiki:image_redirect
   6 get WANCache:v:enwiki:page-restrictions
   7 get WANCache:v:nlwiki:file
   8 get WANCache:v:commonswiki:babel
   8 get WANCache:v:frwiki:messages
   8 get WANCache:v:viwiki:messages-big
   9 get WANCache:v:commonswiki:messages
  10 get WANCache:v:itwiki:messages-big
  11 get WANCache:v:nlwiki:messages-big
  14 get WANCache:v:nlwiki:preprocess-hash
  15 get WANCache:v:enwiki:sidebar
  15 get WANCache:v:eswiki:page-content-model
  15 get WANCache:v:nlwiki:page
  15 get WANCache:v:plwiki:page-content-model
  15 get WANCache:v:zhwiki:messages
  16 get WANCache:v:global:centralauth-user
  16 get WANCache:v:global:user
  16 get WANCache:v:itwiki:page-content-model
  16 get WANCache:v:ptwiki:preprocess-hash
  17 get WANCache:v:nlwiki:page-content-model
  17 get WANCache:v:ptwiki:messages-big
  17 get WANCache:v:ptwiki:page-content-model
  17 get WANCache:v:viwiki:page-content-model
  18 get WANCache:v:jawiki:preprocess-hash
  19 get WANCache:v:enwiki:messages
  21 get WANCache:v:global:globalpreferences
  22 get WANCache:v:global:Wikimedia\Rdbms\LoadBalancer
  26 get WANCache:v:commonswiki:page-content-model
  28 get WANCache:v:global:lag-times
  32 get WANCache:v:jawiki:page-content-model
  33 get WANCache:v:global:NameTableSqlStore
  34 get WANCache:v:eswiki:messages-big
  43 get WANCache:v:jawiki:page
  50 get WANCache:v:zhwiki:page-content-model
  59 get WANCache:v:global:resourceloader
  61 get WANCache:v:ptwiki:page
  63 get WANCache:v:enwiki:page-content-model
  66 get WANCache:v:frwiki:page-content-model
  76 get WANCache:v:plwiki:messages-big
  77 get WANCache:v:commonswiki:file
 125 get WANCache:v:global:revision-row-1.29
 131 get WANCache:v:global:BlobStore
 211 get WANCache:v:zhwiki:messages-big
 239 get WANCache:v:commonswiki:messages-big
 382 get WANCache:v:global:CacheAwarePropertyInfoStore

In total 2247 get commands for ~20s of traffic, ~100 ops (that is in line with what I see in the mcrouter graphs).

Switching off php7 confirmed it's the cause of the increased number of GETs.

I would like to first repackage and deploy the mccrouter exporter to include https://github.com/Dev25/mcrouter_exporter/pull/8 before we investigate this further.

In fact, I find it not really plausible that the numbers we report for hhvm make sense: having the same number of GET and SET requests seems strange.

Other 20s of traffic from mw1238 (dst port 11213 on localhost):

tcpdump -r test_after.pcap -A | grep -v gets| egrep -o "get\ .*" | sort | cut -d ":" -f 1,2,3,4 | uniq -c | sort -n -k 1
reading from file test_after.pcap, link-type EN10MB (Ethernet)
   1 get WANCache:t:fiwiki:MessageBlobStore
   1 get WANCache:t:nowiktionary:MessageBlobStore
   1 get WANCache:v:enwiki:cirrussearch-interwiki-matrix
   1 get WANCache:v:enwiki:image_redirect
   1 get WANCache:v:global:CentralNoticeChoiceData
   1 get WANCache:v:global:centralauth-user
   1 get WANCache:v:global:globalpreferences
   1 get WANCache:v:global:user
   1 get WANCache:v:jawiki:page-restrictions
   1 get WANCache:v:jawiki:sidebar
   1 get WANCache:v:nowiktionary:messages
   1 get WANCache:v:ptwiki:messages
   2 get WANCache:v:commonswiki:file
   2 get WANCache:v:enwiki:file
   2 get WANCache:v:enwiki:page-restrictions
   2 get WANCache:v:global:BlobStore
   2 get WANCache:v:jawiki:messages
   4 get WANCache:v:global:NameTableSqlStore
   4 get WANCache:v:global:Wikimedia\Rdbms\LoadBalancer
   5 get WANCache:v:global:cirrussearch-boost-templates
   9 get WANCache:v:global:lag-times
  12 get WANCache:v:enwiki:sidebar
  14 get WANCache:v:enwiki:messages
  15 get WANCache:v:jawiki:page-content-model
  22 get WANCache:v:enwiki:messages-big
  29 get WANCache:v:global:resourceloader
  32 get WANCache:v:enwiki:page-content-model
  49 get WANCache:v:jawiki:messages-big

Will also dump from another host to get more data.

20s of traffic from mw1239 (dst port 11213 on localhost):

tcpdump -r test_after.pcap -A | grep -v gets| egrep -o "get\ .*" | sort | cut -d ":" -f 1,2,3,4 | uniq -c | sort -n -k 1
reading from file test_after.pcap, link-type EN10MB (Ethernet)
   1 get WANCache:t:global:echo-notification-updated
   1 get WANCache:t:jawiki:MessageBlobStore
   1 get WANCache:t:kkwiki:MessageBlobStore
   1 get WANCache:t:plwiki:user-quicktouched
   1 get WANCache:t:zhwiki:MessageBlobStore
   1 get WANCache:v:arwiki:page-restrictions
   1 get WANCache:v:dewiki:file
   1 get WANCache:v:dewiki:image_redirect
   1 get WANCache:v:dewiki:page-restrictions
   1 get WANCache:v:dewiki:preprocess-hash
   1 get WANCache:v:dewiki:sidebar
   1 get WANCache:v:enwiki:MessageBlobStore
   1 get WANCache:v:eswiki:file
   1 get WANCache:v:eswiki:image_redirect
   1 get WANCache:v:eswiki:page-restrictions
   1 get WANCache:v:eswiki:sidebar
   1 get WANCache:v:frwiki:anonnotice
   1 get WANCache:v:frwiki:page
   1 get WANCache:v:frwiki:page-restrictions
   1 get WANCache:v:frwiki:preprocess-hash
   1 get WANCache:v:frwiki:sidebar
   1 get WANCache:v:jawiki:page-restrictions
   1 get WANCache:v:plwiki:page-restrictions
   1 get WANCache:v:viwiki:file
   1 get WANCache:v:viwiki:image_redirect
   1 get WANCache:v:viwiki:page-restrictions
   1 get WANCache:v:viwiki:sidebar
   1 get WANCache:v:zhwiki:SiteStats
   1 get WANCache:v:zhwiki:sidebar
   2 get WANCache:v:arwiki:messages
   2 get WANCache:v:arwiki:messages-big
   2 get WANCache:v:arwiki:page-content-model
   2 get WANCache:v:enwiki:image_redirect
   2 get WANCache:v:eswiki:messages
   2 get WANCache:v:frwiki:messages
   2 get WANCache:v:global:CentralNoticeChoiceData
   2 get WANCache:v:jawiki:page-content-model
   2 get WANCache:v:plwiki:messages
   2 get WANCache:v:plwiki:page-content-model
   2 get WANCache:v:viwiki:messages
   2 get WANCache:v:zhwiki:preprocess-hash
   3 get WANCache:v:dewiki:messages
   3 get WANCache:v:global:centralauth-user
   3 get WANCache:v:global:globalpreferences
   3 get WANCache:v:global:user
   3 get WANCache:v:zhwiki:page-restrictions
   5 get WANCache:v:enwiki:file
   5 get WANCache:v:enwiki:page-restrictions
   6 get WANCache:v:plwiki:messages-big
   6 get WANCache:v:zhwiki:image_redirect
  10 get WANCache:v:global:NameTableSqlStore
  10 get WANCache:v:zhwiki:file
  12 get WANCache:v:zhwiki:page
  13 get WANCache:v:global:Wikimedia\Rdbms\LoadBalancer
  15 get WANCache:v:commonswiki:file
  15 get WANCache:v:dewiki:page-content-model
  15 get WANCache:v:eswiki:page-content-model
  15 get WANCache:v:frwiki:page-content-model
  15 get WANCache:v:viwiki:page-content-model
  16 get WANCache:v:global:BlobStore
  16 get WANCache:v:global:revision-row-1.29
  17 get WANCache:v:zhwiki:messages
  18 get WANCache:v:enwiki:sidebar
  20 get WANCache:v:global:lag-times
  22 get WANCache:v:enwiki:messages
  34 get WANCache:v:eswiki:messages-big
  46 get WANCache:v:dewiki:messages-big
  52 get WANCache:v:global:resourceloader
  52 get WANCache:v:zhwiki:page-content-model
  55 get WANCache:v:viwiki:messages-big
  69 get WANCache:v:enwiki:page-content-model
 425 get WANCache:v:zhwiki:messages-big

Grabbed a sample of localhost traffic to port 11213 on mw1238 from 11:45:39 to 11:48:05 (146s).

tcpdump filters:

  • tcpdump -r test_2.pcap -A | grep -v gets| egrep -o "get\ .*" | wc -l -> 7071 -> 48 rps (inline with stats from mcrouter for the same time period)
  • tcpdump -r test_2.pcap -A | egrep -o "gets\ .*" | wc -l -> 456835 -> 3129 rps (in line with stats from mcrouter - new exporter just deployed on mw1238 showing also gets traffic)

Screen Shot 2019-05-21 at 2.25.13 PM.png (1×2 px, 235 KB)

After restoring traffic served by php-fpm, it is clear that the increase in GET is correlated with it. Very interesting that GETS remained stable.

I would suspect a look at the actual code of the extensions in php7 and hhvm is needed to clarify this, but my guess would be that HHVM uses GETS also outside of CAS operations.

Some clarification about:

From the memcached metrics I don't see the same increase, and if I break down the graph by mc host, I cannot see anything either.

I checked today and memcached reports only a single "get" metric, that includes get+gets calls (and that matches with the sum of all ops reported by mcrouter after https://github.com/Dev25/mcrouter_exporter/pull/8). This means that a rise in "get" ops is not visible currently in the memcache dashboard due to the disproportionate amount of gets vs get.

Interesting data that might support what Joe thinks (namely that HHVM for some reason uses more gets than get):

tcpdump -r test_3.pcap -A | grep -v gets| egrep -o "get\ .*" | sort | cut -d ":" -f 1,2,3,4 | uniq -c | sort -n -k 1 | tail
reading from file test_3.pcap, link-type EN10MB (Ethernet)
  98 get WANCache:v:global:NameTableSqlStore
 109 get WANCache:v:global:resourceloader
 111 get WANCache:v:dewiki:page-content-model
 113 get WANCache:v:commonswiki:file
 122 get WANCache:v:ruwiki:messages-big
 171 get WANCache:v:ruwiki:page
 255 get WANCache:v:global:CacheAwarePropertyInfoStore
 305 get WANCache:v:enwiki:page-content-model
 337 get WANCache:v:global:revision-row-1.29
 366 get WANCache:v:global:BlobStore
tcpdump -r test_3.pcap -A | egrep -o "gets\ .*" | sort | cut -d ":" -f 1,2,3,4 | uniq -c | sort -n -k 1 | tail
reading from file test_3.pcap, link-type EN10MB (Ethernet)
 674 gets WANCache:v:global:NameTableSqlStore
 708 gets WANCache:v:global:resourceloader
 850 gets WANCache:v:zhwiki:messages-big
 951 gets WANCache:v:enwiki:page-content-model
 973 gets WANCache:v:commonswiki:page
1371 gets WANCache:v:enwiki:page
1382 gets WANCache:v:commonswiki:file
3466 gets WANCache:v:global:revision-row-1.29
3782 gets WANCache:v:global:BlobStore
6477 gets WANCache:v:global:CacheAwarePropertyInfoStore

In this pcap, grabbed from mw1238, I compared get vs gets, and it seems that the traffic is similar, but in different volumes of course.

The best guess that I can give after checking the HHVM memcached extension's code is that the following might be the reason:

https://github.com/facebook/hhvm/blob/master/hphp/runtime/ext/memcached/ext_memcached.cpp#L508

If memcached_behavior_set (libmemcached) sets a global behavior, rather than a per request one, it gets turned on after the first getByKey call that sets the support_cas flag. This could in turn force all the next get calls to be issued as gets.

It shouldn't be too difficult to test in labs..