Page MenuHomePhabricator

Test onhost memcached performance and functionality
Closed, ResolvedPublic

Assigned To
None
Authored By
jijiki
Sep 27 2020, 7:58 PM
Referenced Files
F34173396: image.png
Mar 19 2021, 3:42 PM
F33918442: image.png
Nov 18 2020, 5:34 PM
F33918457: image.png
Nov 18 2020, 5:34 PM
F33918439: image.png
Nov 18 2020, 4:59 PM
F32370333: image.png
Oct 1 2020, 8:28 PM
F32370338: image.png
Oct 1 2020, 8:28 PM
F32370335: image.png
Oct 1 2020, 8:28 PM
F32370316: image.png
Oct 1 2020, 8:14 PM

Description

In T244340 we discussed about introducing an onhost memcached in order to further speed up our cache fetches.

In this test we will install memcached on mwdebug1001 on localhost:1120 and add and tell mcrouter to use it using the WarmupRoute. Additionally we will add a prometheus exporter.

Functionality test:
Servers: mwdebug1001 (with onhost memcached), & mwdebug1002
Clients: mwdebug2001, mwbug2001

We will run the same subset of URLs against both servers

Performance testing in production:
Manually install a mcrouter config on mw2271 (appserver) and use the following config:

<snip>
	  "onhost": {
		"servers": [
		  "127.0.0.1:11210:ascii:plain"
		]
	  }
<snip>
  {
		"aliases": [
		  "/codfw/mw/"
		],
		"route": {
			"type": "OperationSelectorRoute",
			"operation_policies": {
			  "get": {
				"type": "WarmUpRoute",
				"cold": "PoolRoute|onhost",
				"warm":
					 {
					 "failover": "PoolRoute|gutter",
					 "failover_errors": [
					   "tko"
					 ],
					"failover_exptime": 600,
					"normal": "PoolRoute|codfw",
					"type": "FailoverWithExptimeRoute"
				  },
				  "exptime": 10
				}
			},
			"default_policy": {
				"failover": "PoolRoute|gutter",
				"failover_errors": [
				  "tko"
				],
			   "failover_exptime": 600,
			   "normal": "PoolRoute|codfw",
			   "type": "FailoverWithExptimeRoute"
			}
		  }
	  },

In other words:

  • All GETs (but not /*/mw-wan) are first looked up in the onhost memcached (cold) and if not found are fetched from our memcached cluster (warm) and are added (ADD) back in the onhost memcached with a max TTL of 10.
  • All other commands are routed to the memcached cluster or the gutter pool

dashboard: https://grafana.wikimedia.org/d/jI1SNdFMz/xxxx-effie-compare-mediawiki-hosts?orgId=1

Event Timeline

ArielGlenn triaged this task as Medium priority.Sep 28 2020, 9:46 AM

Change 630856 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: enable onhost memcached on mwdebug1001

https://gerrit.wikimedia.org/r/630856

Change 630856 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: enable onhost memcached on mwdebug1001

https://gerrit.wikimedia.org/r/630856

Change 631246 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: enable onhost memcached on mw2271

https://gerrit.wikimedia.org/r/631246

Change 631246 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: enable onhost memcached on mw2271

https://gerrit.wikimedia.org/r/631246

Mentioned in SAL (#wikimedia-operations) [2020-09-30T19:01:50Z] <effie> disable puppet on mw2271 and use onhost memcached - T263958

I installed memcached on mwdebug1001 and configured mcrouter as is described in the task description. Functionality wise, I didn't see any related errors in kibana while running the urllist.

  • Memcached on mwdebug1001 warmed up quite fast

image.png (1×3 px, 352 KB)

  • There was a noticeable difference in host's RX network traffic vs mwdebug1002's, but mcrouter's latency on mwdebug1001 increased. CPU, load avg and Memory were similar

image.png (1×3 px, 460 KB)

Interestingly, even though we expected that mwdebug1001 would perform slightly better, graphs tell otherwise:

image.png (1×3 px, 475 KB)

It appears that p50 response times of mwdebug1002 were better, while avg, p95 as well as request rate was more or less the same

Graphs: mwdebug1001 vs mwdebug1002, mwdebug1001 memcached

I installed memcached on a mw2271 appserver and configured mcrouter as above. This experiment was surely more representative since this is a production server. mw2272 shares the same specs and role with mw2271, so it made sense to compare them.

  • Mcrouter's worst latency is nearly half on mw2272, network RX utilisation is definitely less in mw2271, while CPU stats are pretty much the same. Memory consumption is different, and it is expected as the onhost memcached fills up (that inlcudes expired items as well).

image.png (558×1 px, 120 KB)

image.png (533×911 px, 106 KB)

  • The GET hit ratio is ~0.6 (production's is ~0.9+). The test didn't run long enough to see how far this would go, but on the other hand in production we are not forcing a TTL of 10s

image.png (1×3 px, 584 KB)

  • Lastly, just like in the mwdebug test, p50 response times of mw2271 were slightly worse

image.png (1×3 px, 1 MB)

Graphs: mw2271 vs mw2272, mw2271 memcached

Testing a single page

Onhost memcached though is clearly a winner in any aspect when running an ab test on mw2271 (with onhost memcached) vs mw2272. The test was rendering the Barack Obama page with 10000 requests at a concurrency of 20

Network/CPU/Memory

image.png (483×1 px, 148 KB)

But it is evident that mw2271 had served more requests per second, and faster:

image.png (802×1 px, 171 KB)

What happens when onhost memcached in unavailable? https://phabricator.wikimedia.org/T244340#6211682 @elukey @aaron

With the configuration in the description and using the mcrouter command line options we are already using, we asked mcrouter where a key will be retrieved:

  • Normal operations: onhost memcached ONLINE, memcached cluster ONLINE, gutter pool ONLINE
get __mcrouter__.route(get,koko)
VALUE __mcrouter__.route(get,koko) 0 35
127.0.0.1:11210          # localhost
10.64.16.109:11211   # mc1026.eqiad.wmnet
  • Stopped onhost memcached instance: onhost memcached OFFLINE, memcached cluster ONLINE, gutter pool ONLINE
Oct  2 12:27:34 mwdebug1001 mcrouter[23942]: I1002 12:27:34.705948 23943 ProxyDestination.cpp:453] 127.0.0.1:11210 marked hard TKO. Total hard TKOs: 1; soft TKOs: 0. Reply: mc_res_connect_error
Oct  2 12:28:49 mwdebug1001 mcrouter[23942]: I1002 12:28:49.443056 23943 AsyncSocket.cpp:2194] AsyncSocket::handleConnect(this=0x7f1bf08a04f0, fd=40 host=127.0.0.1:11210) exception: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
get __mcrouter__.route(get,koko)
VALUE __mcrouter__.route(get,koko) 0 18
10.64.16.109:11211   # mc1026.eqiad.wmnet
END
  • blocking access to the memcached cluster: nhost memcached OFFLINE, memcached cluster OFFLINE, gutter pool ONLINE
get __mcrouter__.route(get,koko)
VALUE __mcrouter__.route(get,koko) 0 18
10.64.32.101:11211   # mc-gp1002.eqiad.wmnet
END

I guess the above results make us happy!

Onhost TKOs:
Looking at the logs, mcrouter TKOs the onhost memcached pool almost immediately and takes about a minute to add it back:

$ sudo systemctl stop memcached; date;
Fri Oct  2 13:07:19 UTC 2020

Oct  2 13:07:19 mwdebug1001 mcrouter[23942]: I1002 13:07:19.068133 23943 ProxyDestination.cpp:453] 127.0.0.1:11210 marked hard TKO. Total hard TKOs: 1; soft TKOs: 0. Reply: mc_res_connect_error

and takes about a minute to add it back:

$ sudo systemctl start memcached; date;
Fri Oct  2 13:07:34 UTC 2020

Oct  2 13:08:28 mwdebug1001 mcrouter[23942]: I1002 13:08:28.482733 23943 ProxyDestination.cpp:453] 127.0.0.1:11210 unmarked TKO. Total hard TKOs: 0; soft TKOs: 0. Reply: mc_res_ok