Page MenuHomePhabricator

Test onhost memcached performance and functionality
Open, MediumPublic

Description

In T244340 we discussed about introducing an onhost memcached in order to further speed up our cache fetches.

In this test we will install memcached on mwdebug1001 on localhost:1120 and add and tell mcrouter to use it using the WarmupRoute. Additionally we will add a prometheus exporter.

Functionality test:
Servers: mwdebug1001 (with onhost memcached), & mwdebug1002
Clients: mwdebug2001, mwbug2001

We will run the same subset of URLs against both servers

Performance testing in production:
Manually install a mcrouter config on mw2271 (appserver) and use the following config:

<snip>
	  "onhost": {
		"servers": [
		  "127.0.0.1:11210:ascii:plain"
		]
	  }
<snip>
  {
		"aliases": [
		  "/codfw/mw/"
		],
		"route": {
			"type": "OperationSelectorRoute",
			"operation_policies": {
			  "get": {
				"type": "WarmUpRoute",
				"cold": "PoolRoute|onhost",
				"warm":
					 {
					 "failover": "PoolRoute|gutter",
					 "failover_errors": [
					   "tko"
					 ],
					"failover_exptime": 600,
					"normal": "PoolRoute|codfw",
					"type": "FailoverWithExptimeRoute"
				  },
				  "exptime": 10
				}
			},
			"default_policy": {
				"failover": "PoolRoute|gutter",
				"failover_errors": [
				  "tko"
				],
			   "failover_exptime": 600,
			   "normal": "PoolRoute|codfw",
			   "type": "FailoverWithExptimeRoute"
			}
		  }
	  },

In other words:

  • All GETs (but not /*/mw-wan) are first looked up in the onhost memcached (cold) and if not found are fetched from our memcached cluster (warm) and are added (ADD) back in the onhost memcached with a max TTL of 10.
  • All other commands are routed to the memcached cluster or the gutter pool

dashboard: https://grafana.wikimedia.org/d/jI1SNdFMz/xxxx-effie-compare-mediawiki-hosts?orgId=1

Event Timeline

jijiki created this task.Sun, Sep 27, 7:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSun, Sep 27, 7:58 PM
ArielGlenn triaged this task as Medium priority.Mon, Sep 28, 9:46 AM
jijiki moved this task from Inbox 🐅 to Radar 📻 on the User-jijiki board.Tue, Sep 29, 10:00 AM
jijiki moved this task from Radar 📻 to In Progress 🏋️‍♀️ on the User-jijiki board.

Change 630856 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: enable onhost memcached on mwdebug1001

https://gerrit.wikimedia.org/r/630856

Change 630856 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: enable onhost memcached on mwdebug1001

https://gerrit.wikimedia.org/r/630856

Change 631246 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: enable onhost memcached on mw2271

https://gerrit.wikimedia.org/r/631246

Change 631246 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: enable onhost memcached on mw2271

https://gerrit.wikimedia.org/r/631246

Mentioned in SAL (#wikimedia-operations) [2020-09-30T19:01:50Z] <effie> disable puppet on mw2271 and use onhost memcached - T263958

jijiki updated the task description. (Show Details)Thu, Oct 1, 2:37 PM
jijiki updated the task description. (Show Details)Thu, Oct 1, 2:40 PM
jijiki added a comment.EditedThu, Oct 1, 3:31 PM

I installed memcached on mwdebug1001 and configured mcrouter as is described in the task description. Functionality wise, I didn't see any related errors in kibana while running the urllist.

  • Memcached on mwdebug1001 warmed up quite fast

  • There was a noticeable difference in host's RX network traffic vs mwdebug1002's, but mcrouter's latency on mwdebug1001 increased. CPU, load avg and Memory were similar

Interestingly, even though we expected that mwdebug1001 would perform slightly better, graphs tell otherwise:

It appears that p50 response times of mwdebug1002 were better, while avg, p95 as well as request rate was more or less the same

Graphs: mwdebug1001 vs mwdebug1002, mwdebug1001 memcached

jijiki added a comment.EditedThu, Oct 1, 8:28 PM

I installed memcached on a mw2271 appserver and configured mcrouter as above. This experiment was surely more representative since this is a production server. mw2272 shares the same specs and role with mw2271, so it made sense to compare them.

  • Mcrouter's worst latency is nearly half on mw227, network TX utilisation is definitely less in mw2271, while CPU stats are pretty much the same. Memory consumption is different, and it is expected as the onhost memcached fills up (that inlcudes expired items as well).

  • The GET hit ratio is ~0.6 (production's is ~0.9+). The test didn't run long enough to see how far this would go, but on the other hand in production we are not forcing a TTL of 10s

  • Lastly, just like in the mwdebug test, p50 response times of mw2271 were slightly worse

Graphs: mw2271 vs mw2272, mw2271 memcached

What happens when onhost memcached in unavailable? https://phabricator.wikimedia.org/T244340#6211682 @elukey @aaron

With the configuration in the description and using the mcrouter command line options we are already using, we asked mcrouter where a key will be retrieved:

  • Normal operations: onhost memcached ONLINE, memcached cluster ONLINE, gutter pool ONLINE
get __mcrouter__.route(get,koko)
VALUE __mcrouter__.route(get,koko) 0 35
127.0.0.1:11210          # localhost
10.64.16.109:11211   # mc1026.eqiad.wmnet
  • Stopped onhost memcached instance: onhost memcached OFFLINE, memcached cluster ONLINE, gutter pool ONLINE
Oct  2 12:27:34 mwdebug1001 mcrouter[23942]: I1002 12:27:34.705948 23943 ProxyDestination.cpp:453] 127.0.0.1:11210 marked hard TKO. Total hard TKOs: 1; soft TKOs: 0. Reply: mc_res_connect_error
Oct  2 12:28:49 mwdebug1001 mcrouter[23942]: I1002 12:28:49.443056 23943 AsyncSocket.cpp:2194] AsyncSocket::handleConnect(this=0x7f1bf08a04f0, fd=40 host=127.0.0.1:11210) exception: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
get __mcrouter__.route(get,koko)
VALUE __mcrouter__.route(get,koko) 0 18
10.64.16.109:11211   # mc1026.eqiad.wmnet
END
  • blocking access to the memcached cluster: nhost memcached OFFLINE, memcached cluster OFFLINE, gutter pool ONLINE
get __mcrouter__.route(get,koko)
VALUE __mcrouter__.route(get,koko) 0 18
10.64.32.101:11211   # mc-gp1002.eqiad.wmnet
END

I guess the above results make us happy!

Onhost TKOs:
Looking at the logs, mcrouter TKOs the onhost memcached pool almost immediately and takes about a minute to add it back:

$ sudo systemctl stop memcached; date;
Fri Oct  2 13:07:19 UTC 2020

Oct  2 13:07:19 mwdebug1001 mcrouter[23942]: I1002 13:07:19.068133 23943 ProxyDestination.cpp:453] 127.0.0.1:11210 marked hard TKO. Total hard TKOs: 1; soft TKOs: 0. Reply: mc_res_connect_error

and takes about a minute to add it back:

$ sudo systemctl start memcached; date;
Fri Oct  2 13:07:34 UTC 2020

Oct  2 13:08:28 mwdebug1001 mcrouter[23942]: I1002 13:08:28.482733 23943 ProxyDestination.cpp:453] 127.0.0.1:11210 unmarked TKO. Total hard TKOs: 0; soft TKOs: 0. Reply: mc_res_ok