Page MenuHomePhabricator

Update memcached package and configuration options
Closed, ResolvedPublic

Description

The release notes for memcached versions 1.4.24 and 1.4.25 describe several improvements designed to improve performance. One of the most substantial is the introduction of a slab rebalancer / automover.

As data is stored into memcached, it pre-allocates pages of memory into slab classes of a particular size (ie: 90 bytes, 120 bytes). If you fill your cache with 90 byte objects, and then start writing 120 byte objects, there will be much less space available for 120 byte objects. With the slab automover improvements, freed memory can be reclaimed back into a global pool and reassigned to new slab classes.

This could be a major win for us, because our memcached objects (raw and parsed revisions) are highly irregular in size.

There is also a new LRU implementation, which was designed to provide better protection for hot items, to perform most evictions asynchronously using a background thread, and to reduce lock contention in read operations. This could also be a substantial win for us.

The wiki makes the following suggestion:

To get all of the benefits of the last few releases, we recommend adding the following startup options:

-o slab_reassign,slab_automove,lru_crawler,lru_maintainer

A modern start line includes a few other items:

-o slab_reassign,slab_automove,lru_crawler,lru_maintainer,maxconns_fast,hash_algorithm=murmur3

Many of these options are likely to become defaults in the future.

I'd like to ask that someone in ops evaluate 1.4.25, and package / deploy it to all memcached servers if it works well.

Details

Related Gerrit Patches:
operations/puppet : productionRefactor memcached role to allow a more flexible hieradata config
operations/puppet : productionUpgrade memcached on mc2009 to 1.4.28
operations/puppet : productionRestore mc1007 memcached growth factor to 1.05 as the rest of the cluster.
operations/puppet : productionDeploy memcached 1.4.25 to mc1010 as part of a performance experiment.
operations/puppet : productionFix memcached gmond module for Python syntax error.
operations/puppet : productionAdd get_hits_ratio calculation to memcached's gmond agent.
operations/puppet : productionAdd new suggested memcached settings to mc1009 as part of perf experiment.
operations/puppet : productionRe-enable refresh on unit file change for memcached.
operations/puppet : productionAdd a space after each memcached extra command line option to ensure proper settings.
operations/puppet : productionAdd new memcached features/settings to mc1009 as part of perf experiment.
operations/puppet : productionRaise the memcached chunk growth factor on mc1007 as part of a perf experiment.
operations/puppet : productionAdd the possibility to specify memcached's chunk growth factor.
operations/puppet : productionRestore basic memcached settings to mc1009 as part of a performance test.
operations/puppet : productionRemove testing parameters/settings from mc1009's memcached.
operations/puppet : productionRemove duplicate of 'lru_crawler' in the mc[12]009 memcached configs.
operations/puppet : productionConfigure mc1009 with the latest memcached version as performance test.
operations/puppet : productionUpdate memcached version on mc1009 as part of a performance test.
operations/puppet : productionExample of possible configuration to run mc2009 with the latest memcached version.
operations/puppet : productionmemcached: on mc2010, set 'maxconns_fast', 'hash_algorithm=murmur3', 'lru_crawler'

Related Objects

Mentioned In
T217020: Test different growth factors for memcached (prep step for upgrade to newer versions)
T213089: Upgrade memcached for Debian Stretch/Buster
rOPUPc98279e0a913: Restore mc1007 memcached growth factor to 1.05 as the rest of the cluster.
rOPUPf3f8b34032d8: Restore mc1007 memcached growth factor to 1.05 as the rest of the cluster.
rOPUP384b0da2d7af: Example of possible configuration to run mc2009 with the latest memcached…
rOPUP6911ed5647e1: Configure mc1009 with the latest memcached version as performance test.
rOPUPb74052442f88: Update memcached version on mc1009 as part of a performance test.
rOPUP41cd373e871d: Configure mc1009 with the latest memcached version as performance test.
rOPUP26fc1e2b5c7d: Restore basic memcached settings to mc1009 as part of a performance test.
rOPUP5fc63b6f8483: Restore basic memcached settings to mc1009 as part of a performance test.
rOPUP1d1ba358e3df: Remove duplicate of 'lru_crawler' in the mc[12]009 memcached configs.
rOPUP0386c188740b: Remove testing parameters/settings from mc1009's memcached.
rOPUPe4a978dc262a: Add the possibility to specify memcached's chunk growth factor.
rOPUP6f7f7be49a8a: Add the possibility to specify memcached's chunk growth factor.
rOPUP50d0179cae3e: Add the possibility to specify memcached's chunk growth factor.
rOPUPc5f76191fb94: Add the possibility to specify memcached's chunk growth factor.
rOPUPb9336ce46f04: Add the possibility to specify memcached's chunk growth factor.
rOPUP30c3daa6f074: Add new memcached features/settings to mc1009 as part of perf experiment.
rOPUPd7b9e8ec05f4: Raise the memcached chunk growth factor on mc1007 as part of a perf experiment.
rOPUPef27e247faeb: Add new suggested memcached settings to mc1009 as part of perf experiment.
rOPUPfbbb05ae840f: Add get_hits_ratio calculation to memcached's gmond agent.
rOPUP95f068eee693: Deploy memcached 1.4.25 to mc1010 as part of a performance experiment.
rOPUP6703ca02d765: Deploy memcached 1.4.25 to mc1010 as part of a performance experiment.
rOPUP98be13f36e3f: Deploy memcached 1.4.25 to mc1010 as part of a performance experiment.
rOPUP45af66bc8041: Deploy memcached 1.4.25 to mc1010 as part of a performance experiment.
rOPUPcbbcac261bb3: Deploy memcached 1.4.25 to mc1010 as part of a performance experiment.
rOPUPdb3259977f53: Fix memcached gmond module for Python syntax error.
rOPUP4c60593a99ce: Provision Diamond collector for Memcached
rOPUP130561da2a58: Add get_hits_ratio calculation to memcached's gmond agent.
rOPUPc4c5a4d3764c: Add new suggested memcached settings to mc1009 as part of perf experiment.
rOPUP2e189f9d7b39: Re-enable refresh on unit file change for memcached.
rOPUPda96c40bc1ce: Add a space after each memcached extra command line option to ensure proper…
rOPUP308d8704f442: Raise the memcached chunk growth factor on mc1007 as part of a perf experiment.
rOPUPa7388feffd18: Add the possibility to specify memcached's chunk growth factor.
Mentioned Here
T137345: Rack/Setup new memcache servers mc1019-36
P2910 mem_wasted.py

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hit ratio for today (remembering that some hosts got restarted during the last maintenance window):

mc1004.eqiad.wmnet: 0.9128341632
mc1013.eqiad.wmnet: 0.9004886832
mc1002.eqiad.wmnet: 0.9124152852
mc1016.eqiad.wmnet: 0.9187834295
mc1005.eqiad.wmnet: 0.9181986155
mc1010.eqiad.wmnet: 0.8806511072
mc1008.eqiad.wmnet: 0.9197412476
mc1014.eqiad.wmnet: 0.9127896803
mc1017.eqiad.wmnet: 0.920642405
mc1018.eqiad.wmnet: 0.917222886
mc1007.eqiad.wmnet: 0.9407402109
mc1006.eqiad.wmnet: 0.9162372236
mc1011.eqiad.wmnet: 0.9160034643
mc1003.eqiad.wmnet: 0.9067202868
mc1001.eqiad.wmnet: 0.914404811
mc1009.eqiad.wmnet: 0.9004796522
mc1012.eqiad.wmnet: 0.8925207704
mc1015.eqiad.wmnet: 0.9232272475

Stats for mc1009 before the upgrade to -o slab_reassign,slab_automove,lru_crawler,lru_maintainer:

mc1009_stats_settings_ 1463486759

mc1009_stats_slabs_stats_ 1463486759

mc1009_stats_slabs_conns_ 1463486759

I am going to merge https://gerrit.wikimedia.org/r/#/c/288951/1 to finally check the full potential of 1.4.25. I will use to measure of comparison:

  1. previous mc1009 stats, collected at various stages
  2. current mc1007 stats, because it is running with growth factor 1.15
  3. current mc1010 stats, that is running with current 1.4.21 config

Since I'd prefer not to restart three hosts but only one, I'll just grab mc1007/10 stats one day before mc1009 to have the same "time of service".

elukey added a comment.EditedMay 17 2016, 12:27 PM

Summary of evictions. Things to notice:

  • 0 corresponds to hosts restarted yesterday
  • mc1009 has been restarted multiple times so has longer service than the others (and its control host has been restarted yesterday sadly)
mc1004.eqiad.wmnet:                                                                                                                                                [14/16]
    STAT evictions 69999283
mc1005.eqiad.wmnet:
    STAT evictions 94605787
mc1001.eqiad.wmnet:
    STAT evictions 78547649
mc1002.eqiad.wmnet:
    STAT evictions 94877684
mc1003.eqiad.wmnet:
    STAT evictions 97879845
mc1008.eqiad.wmnet:
    STAT evictions 24342022
mc1007.eqiad.wmnet:
    STAT evictions 0
mc1018.eqiad.wmnet:
    STAT evictions 93724574
mc1006.eqiad.wmnet:
    STAT evictions 88435752
mc1017.eqiad.wmnet:
    STAT evictions 81581919
mc1009.eqiad.wmnet:
    STAT evictions 8966582
mc1016.eqiad.wmnet:
    STAT evictions 123570821
mc1014.eqiad.wmnet:
    STAT evictions 96344914
mc1013.eqiad.wmnet:
    STAT evictions 0
mc1012.eqiad.wmnet:
    STAT evictions 0
mc1011.eqiad.wmnet:
    STAT evictions 94893219
mc1010.eqiad.wmnet:
    STAT evictions 0
mc1015.eqiad.wmnet:
    STAT evictions 0

Change 288951 merged by Elukey:
Add new suggested memcached settings to mc1009 as part of perf experiment.

https://gerrit.wikimedia.org/r/288951

Mentioned in SAL [2016-05-17T13:22:42Z] <elukey> memcahced restarted on mc1009 with -o slab_reassign,slab_automove,lru_crawler,lru_maintainer as part of a perf experiment (T129963)

elukey added a comment.EditedMay 18 2016, 11:57 AM

I took a look to Ganglia's mem_report and all the caches seems to have almost recovered from the last restart event, except of course mc1009 that has been restarted only yesterday. I would be inclined to compare mc1009 (1.4.25 + gf 1.15) vs mc1007 (1.4.21 + gf 1.15) vs mc1010 (1.4.21 + gf 1.05) in a couple of days. Please remember that the procedure will be done in a staggered way, namely mc1010/1007 and then m1009 the day after to respect the different restart times.

mc1009 stats

Memcached 1.4.25 brought new things that we enabled:

  • lru_crawler
The LRU Crawler is an optional background thread which will walk from the tail
toward the head of requested slab classes, actively freeing memory for expired
items. This is useful if you have a mix of items with both long and short
TTL's, but aren't accessed very often. This system is not required for normal
usage, and can add small amounts of latency and increase CPU usage.

elukey@mc1009:~$ echo stats | nc localhost 11211 -q2  | grep crawler
STAT lru_crawler_running 0
STAT lru_crawler_starts 16868
STAT crawler_reclaimed 9219943
STAT crawler_items_checked 1009919728
  • lru_maintainer

https://github.com/memcached/memcached/blob/master/doc/new_lru.txt

elukey@mc1009:~$ echo stats | nc localhost 11211 -q2  | egrep 'maintainer|moves'
STAT lru_maintainer_juggles 23061049
STAT moves_to_cold 11454503
STAT moves_to_warm 7447
STAT moves_within_lru 7138957
  • slab_automove and reassign

https://github.com/memcached/memcached/pull/113

elukey@mc1009:~$ echo stats | nc localhost 11211 -q2  | grep slab
STAT slab_reassign_rescues 3213797
STAT slab_reassign_evictions_nomem 0
STAT slab_reassign_inline_reclaim 997
STAT slab_reassign_busy_items 137
STAT slab_reassign_running 0
STAT slabs_moved 29939
STAT slab_global_page_pool 0

These are very useful things that it would be great to run to ensure that memcached will be able to reallocate memory for different slabs without requiring restarts.

Preliminary stats:

elukey@mc1009:~$ echo stats | nc localhost 11211 -q2  | egrep 'eviction|curr_items|total_items|get_hit|get_miss'
STAT get_hits 427568131
STAT get_misses 57852192
STAT slab_reassign_evictions_nomem 0
STAT curr_items 12809333
STAT total_items 28398292
STAT evictions 0
elukey@mc1009:~$ echo stats | nc localhost 11211 -q2 | egrep 'get_hits|get_misses' | cut -d " " -f 3 | sed -e 's/\r//' | paste -d " " - - | awk '{print $1"/("$1"+"$2")"}' | bc -l
.88082306798354187842

Still need to catch up with elements stored and hit_ratio, let's see during the next couple of days.

Re-checked Ganglia and mem_report for mc1009 seems to have reached a stable state, but the memory allocated is around 44GB rather than 80+ like its friends in the cluster. Tried to check some metrics (please remember that mc1010/7 have been running for a day more than mc1009):

  1. get hit ratio keeps growing but slowly, it got +1.0 from the last check. The avg of the cluster is around 0.91 more or less, but sadly I don't have the starting value since I haven't grabbed it before upgrading mc1009 to 1.4.25 (didn't even know about these metrics at the time, shame on me).
elukey@mc1009:~$ echo stats | nc localhost 11211 -q2 | egrep 'get_hits|get_misses' | cut -d " " -f 3 | sed -e 's/\r//' | paste -d " " - - | awk '{print $1"/("$1"+"$2")"}' | bc -l
.89007378368972122257
  1. Number of items currently stored and "seen" shows some interesting info. mc1009 is behind a day but it has seen 50M items and kept only 15M, meanwhile the other ones have 80M and 30M respectively (so the percentage of elements kept from the total seems less for mc1009).
elukey@neodymium:~$ sudo -i salt -t 120 mc10[01][079]* cmd.run 'echo "stats" | nc localhost 11211 -q 2 | egrep "curr_items|total_items"'
mc1007.eqiad.wmnet:
    STAT curr_items 29810762
    STAT total_items 80771223
mc1009.eqiad.wmnet:
    STAT curr_items 15915232
    STAT total_items 53759326
mc1010.eqiad.wmnet:
    STAT curr_items 31258642
    STAT total_items 83053704
  1. Previous point with more stats, like new lru_crawler metrics.
elukey@neodymium:~$ sudo -i salt -t 120 mc10[01][079]* cmd.run 'echo "stats" | nc localhost 11211 -q 2 | egrep "items|crawler|unfetched|reassign|move"'
mc1007.eqiad.wmnet:
    STAT slab_reassign_running 0
    STAT slabs_moved 0
    STAT curr_items 29810805
    STAT total_items 80909350
    STAT expired_unfetched 5751226
    STAT evicted_unfetched 3313332
    STAT crawler_reclaimed 0
mc1017.eqiad.wmnet:
    STAT curr_items 32938742
    STAT total_items 556545645
    STAT expired_unfetched 92287835
    STAT evicted_unfetched 46505297
    STAT crawler_reclaimed 0
mc1010.eqiad.wmnet:
    STAT slab_reassign_running 0
    STAT slabs_moved 0
    STAT curr_items 31258629
    STAT total_items 83196330
    STAT expired_unfetched 5738716
    STAT evicted_unfetched 3238702
    STAT crawler_reclaimed 0
mc1009.eqiad.wmnet:
    STAT slab_reassign_rescues 10578984
    STAT slab_reassign_evictions_nomem 0
    STAT slab_reassign_inline_reclaim 2463
    STAT slab_reassign_busy_items 446
    STAT slab_reassign_running 0
    STAT slabs_moved 71405
    STAT lru_crawler_running 0
    STAT lru_crawler_starts 24412
    STAT curr_items 15937135
    STAT total_items 53915317
    STAT expired_unfetched 15858396
    STAT evicted_unfetched 0
    STAT crawler_reclaimed 21541362
    STAT crawler_items_checked 2259508010
    STAT moves_to_cold 18261616
    STAT moves_to_warm 16205
    STAT moves_within_lru 12625957
  1. I also checked stats slabs comparing chunk size and total chunks for each host and I didn't see any weird thing (there are of course differences but mc1009 looks good).

I want to follow up with the memcached devs to see if this behavior is somehow expected. The things that I've written above are only food for thoughts, final considerations will be done comparing the final snapshots of all the hosts. Will do mc1007/mc1010 today and mc1009 tomorrow.

Snapshots:

mc1007_stats_1463649014
mc1007_stats_slabs_1463649014
mc1007_stats_settings_1463649014
mc1007_stats_conns_1463649014

mc1010_stats_1463649014
mc1010_stats_slabs_1463649014
mc1010_stats_settings_1463649014
mc1010_stats_conns_1463649014

Also, for the record:

elukey@neodymium:~$ sudo -i salt -t 120 mc10[01][079]* cmd.run 'echo "stats" | nc localhost 11211 -q 2 | grep uptime'
mc1007.eqiad.wmnet:
    STAT uptime 263690
mc1010.eqiad.wmnet:
    STAT uptime 261916
mc1009.eqiad.wmnet:
    STAT uptime 158626

Also preliminary stats for mc1009 (the snapshot to be used with mc1010,mc1007 will be taken tomorrow since mc1009 has been running one day less than the others):

mc1009_stats_1463649014
mc1009_stats_slabs_1463649014
mc1009_stats_settings_1463649014
mc1009_stats_conns_1463649014

elukey added a comment.EditedMay 20 2016, 10:01 AM

Reporting a conversation with dormando on the #memcached Freenode channel: https://phabricator.wikimedia.org/P3153

Comments are about the last mc1009's stats - https://phabricator.wikimedia.org/T129963#2308421

Important pre-read about LRU: https://github.com/memcached/memcached/blob/master/doc/new_lru.txt

Highlights:

  • The LRU crawler has collected 22m of expired items from the LRU.
  • Previous to these new features it was possible that stacks of expired stuff mixed with unexpired items would cause you to evict less used unexpired items from the tail of the LRU.
  • About the new HOT/WARM/COLD LRUs - You have enough items to overflow from HOT (fixed to 32% of memory), but only 19m items out of the 56m (total_items seen by the cache) ever made it to the COLD.
  • From COLD only 17,068 items ever got hit a second time and lived long enough to be seen again by the algorithm.
  • The new features these new patches mostly help with longer tail stuff.. things that get hit way more infrequently would more likely get evicted by nature of reaching the bottom of the LRU before but now that memory is aggressively reclaimed they'll stick around more often.

@ori: https://github.com/memcached/memcached/pull/127 - might be really nice to have.

So from what the memcached devel can see from our stats, 1.4.25 is working as designed.

elukey added a comment.EditedMay 20 2016, 11:45 AM

Compared chunk size vs number of chunks for the hosts under testing to get a visual difference (I tried to combine the graphs but my spreadsheet's skills are almost none so I gave up):

mc1009 - growth factor 1.15 - memcached 1.4.25

mc1010 - growth factor 1.05 - memcached 1.4.21

mc1007 - growth factor 1.15 - memcached 1.4.21
ERRATA: x axis is "Chunk size"

Data grabbed with:

echo stats slabs | nc localhost 11211 -q2 | egrep 'chunk_size|total_chunks'
| cut -d " " -f 3 | sed -e 's/\r//' | paste - -

Finally the Snapshots:

mc1007 - growth factor 1.15 - memcached 1.4.21

mc1007_stats_1463649014
mc1007_stats_slabs_1463649014
mc1007_stats_settings_1463649014
mc1007_stats_conns_1463649014

mc1010 - growth factor 1.05 - memcached 1.4.21

mc1010_stats_1463649014
mc1010_stats_slabs_1463649014
mc1010_stats_settings_1463649014
mc1010_stats_conns_1463649014

mc1009 - growth factor 1.15 - memcached 1.4.25

mc1009_stats_1463754173
mc1009_stats_slabs_1463754173
mc1009_stats_settings_1463754173
mc1009_stats_conns_1463754173
mc1009_stats_items_1463754173

Also, for the record:

elukey@neodymium:~$ sudo -i salt -t 120 mc10[01][079]* cmd.run 'echo "stats" | nc localhost 11211 -q 2 | grep uptime'
mc1007.eqiad.wmnet:
    STAT uptime 263690
mc1010.eqiad.wmnet:
    STAT uptime 261916
mc1009.eqiad.wmnet:
    STAT uptime 263016
elukey added a comment.EditedMay 21 2016, 9:34 AM

Some mc1009 metrics from today:

  1. memory consumption is still growing very slowly, it will probably reach the 80G limit in few days. Good to compare mc1009 allocations with and without the new LRU features enabled.

  1. Hit ratio landed in the 0.9 zone and it is still growing.
  1. Evictions are zero but LRU activity to clean up expired items and garbage in the slab LRUs is still very high:
STAT slab_reassign_rescues 28312054
STAT slab_reassign_evictions_nomem 0
STAT slab_reassign_inline_reclaim 6337
STAT slab_reassign_busy_items 1245
STAT slab_reassign_running 0
STAT lru_crawler_running 0
STAT lru_crawler_starts 35021
STAT lru_maintainer_juggles 77756899
STAT expired_unfetched 32984788
STAT evictions 0
STAT crawler_reclaimed 46786973
STAT crawler_items_checked 4865848292
STAT lrutail_reflocked 13788
STAT moves_to_cold 31028256
STAT moves_to_warm 29814
STAT moves_within_lru 24410728
  1. A lot of items are stored in memcached (total_items) but only very few get to stay in the more permanent cache (current_items). This would be bad if evictions were also high, but there are none (the LRU crawler is the one cleaning up expired items more frequently in the background):
STAT curr_items 22721384
STAT total_items 111805622
STAT evictions 0
STAT crawler_reclaimed 46786973
STAT expired_unfetched 32984788

As we were saying on IRC this "slow" behavior seems to be a risk for us, rather than a win, but it may give us some reflection points. The new features that memcached offers might not perform as we expect because they "suffer" from the way we put items in the queue, for example setting low TTLs (very ignorant about this point, but the expired_unfetched metric seems to point us to this direction).

As we were saying on IRC this "slow" behavior seems to be a risk for us, rather than a win, but it may give us some reflection points. The new features that memcached offers might not perform as we expect because they "suffer" from the way we put items in the queue, for example setting low TTLs (very ignorant about this point, but the expired_unfetched metric seems to point us to this direction).

I had a chat with dormando on IRC this morning about the above paragraph, all saved in https://phabricator.wikimedia.org/P3165.

He is suggesting to measure hit ratio continuously over the day with fixed time windows to catch differences rather than calculating it from the last restart. For example, sampling get_hits and get_misses every 30 seconds and then subtracting previous_get_hits from get_hits, previous_get_miss from get_miss, and finally calculating the hit ratio. Moreover 0 evictions should be a good indicator that the hit ratio shouldn't be worse but equal or better than before.

ori added a comment.May 23 2016, 11:05 AM

He is suggesting to measure hit ratio continuously over the day with fixed time windows to catch differences rather than calculating it from the last restart. For example, sampling get_hits and get_misses every 30 seconds and then subtracting previous_get_hits from get_hits, previous_get_miss from get_miss, and finally calculating the hit ratio.

Sounds good. I hacked together this script, which I'm now running in a screen session on tin:

tin:/home/ori/mc/window/window.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

import contextlib
import random
import telnetlib
import time


servers = [('mc10%02d.eqiad.wmnet' % n) for n in range(1, 19)]

while 1:
    random.shuffle(servers)
    for server in servers:
        ts = int(time.time())
        try:
            with contextlib.closing(telnetlib.Telnet(server, 11211)) as client:
                client.write('stats\n')
                data = client.read_until('END')
                name = server.split('.')[0]
                with open('%s.%s.txt' % (name, ts), 'wt') as f:
                    f.write(data)
                print('[%s] %s: OK' % (ts, name))
        except:
            continue
    print('sleeping for 300 seconds')
    time.sleep(300)

Storing the raw data for now seemed like the best way to avoid having some bug in our analysis cause us to need to throw away everything we have accumulated as invalid.

ori added a comment.May 23 2016, 11:12 AM

He is suggesting to measure hit ratio continuously over the day with fixed time windows to catch differences rather than calculating it from the last restart.

Actually this is exactly what Ganglia is doing. See /usr/lib/ganglia/python_modules/gmond_memcached.py on one of the memcached hosts.

Change 290233 had a related patch set uploaded (by Elukey):
Add get_hits_ratio calculation to memcached's gmond agent.

https://gerrit.wikimedia.org/r/290233

Change 290233 merged by Ori.livneh:
Add get_hits_ratio calculation to memcached's gmond agent.

https://gerrit.wikimedia.org/r/290233

Change 290394 had a related patch set uploaded (by Elukey):
Fix memcached gmond module for Python syntax error.

https://gerrit.wikimedia.org/r/290394

Change 290394 merged by Elukey:
Fix memcached gmond module for Python syntax error.

https://gerrit.wikimedia.org/r/290394

ori added a comment.May 24 2016, 8:00 AM

This data is now in Graphite, too. For example: https://graphite.wikimedia.org/S/BW .

This data is now in Graphite, too. For example: https://graphite.wikimedia.org/S/BW .

Ah I didn't get it yesterday! I thought that we had to create the diamond module but apparently it is already there. Nice! I'll try to build a grafana dashboard!

elukey added a comment.EditedMay 24 2016, 8:52 AM

Remaining issues:

  1. I can't find the new get_hits_ratio metric on ganglia, not sure if we need to add more settings to enable it.
  1. mc1012 memcached metrics are now shown by Ganglia. I checked with TCP dumps and everything flows correctly from mc1012 to carbon, and also the non memcached metrics are fine. I didn't see any issue in syslog and tried to restart multiple times gmond without any luck. Graphite metrics are good.

Mentioned in SAL [2016-05-25T09:42:14Z] <elukey> restarted gmetad on uranium to test if new memcached metrics would be picked up (T129963)

Mentioned in SAL [2016-05-27T08:00:32Z] <elukey> restarted memcached on mc1009 to collect metrics for T129963

Change 291916 had a related patch set uploaded (by Elukey):
Deploy memcached 1.4.25 to mc1010 as part of a performance experiment.

https://gerrit.wikimedia.org/r/291916

Mentioned in SAL [2016-05-31T16:52:52Z] <elukey> disabling puppet on mc10* hosts as prep step for https://gerrit.wikimedia.org/r/#/c/291916. Memcached 1.4.25 will be deployed to mc1010 as part of a perf. test (T129963)

Change 291916 merged by Elukey:
Deploy memcached 1.4.25 to mc1010 as part of a performance experiment.

https://gerrit.wikimedia.org/r/291916

Change 295702 had a related patch set uploaded (by Elukey):
Restore mc1007 memcached growth factor to 1.05 as the rest of the cluster.

https://gerrit.wikimedia.org/r/295702

I have been very slow to follow up on this task due to other priorities, I'll add a summary very soon for all my findings. gerrit/295702 is another interesting piece of information that we don't have in my opinion, namely if the growth factor set to 1.15 could improve 1.4.21 performances (and this is my bad).

Change 295702 merged by Elukey:
Restore mc1007 memcached growth factor to 1.05 as the rest of the cluster.

https://gerrit.wikimedia.org/r/295702

Mentioned in SAL [2016-06-24T07:10:21Z] <elukey> memcached on mc1007 restarted with growth factor 1.05 (T129963)

The growth factor didn't play any role into mc1007's hit ratio, double checked with last days of data.

elukey added a comment.EditedJun 28 2016, 3:50 PM

EDIT: graphs available in https://grafana.wikimedia.org/dashboard/db/t129963

Time to make a summary now that we have a lot of data. The main difference between 1.4.21 and 1.4.25 is the way in which the cache is built and maintained over time. Here's some relevant changes:

  • 1.4.23 release - The LRUs are now broken down into HOT/WARM/COLD and items store will got through the three stages (a background thread will manage them, LRU Maintainer). Once reached COLD, if not accessed, they will be deleted by a background thread (LRU_Crawler). Locks are not used for reads operations anymore.
  • 1.4.25 release - Improved the slab_automover feature, so freed memory can be reclaimed back into a global pool and reassigned to new slab classes.

An example of the first point is depicted by the following graphs:

Some interesting things to notice:

  • mc1009 was already running 1.4.25 and it took ~11 days (27/05 -> 06/06) to reach full cache utilization.
  • mc1010 was running 1.4.21 until 31/05 and then it was restarted with 1.4.25. It took ~7 days to reach full cache utilization. At the same time, evictions dropped to zero and restarted (on a lower rate) when memory was fully utilized.
  • hosts running 1.4.25 (mc1009 and mc1010) show a much lower eviction rate.
  • Hit ratio of mc1010 didn't change a lot between 1.4.21 and 1.4.25.
  • mc1007 runs 1.4.21 with growth factor 1.15 (as opposed to the default 1.05). Its impressive hit ratio seems to be related to a shard coincidence, because I restarted it on 24/06 (as we can see from Active memory dropping) with growth factor 1.05 to verify.
  • mc1009/10 (1.4.25) hit ratios are inline with the other ones (1.4.21).

Other interesting metrics about LRU Crawler/Maintainer only for mc1009/mc1010:

Expired unfetched seem to be one of the reason why evictions are lowered down with 1.4.25, since new threads are actively working on removing garbage asynchronously. It might also explain why the caches takes more time to reach full utilization compared to 1.4.21.

We do care about hit ratio improvements so I can say that we didn't see a huge win with 1.4.25, even though on paper it seems superior. We might not want to migrate the whole cluster straight away, but keep mc1009/10 as canaries and see how they behaves (compared to 1.4.21 based hosts) over time. One thing that we should do is also to work on the client side, tracking down things like who put in cache elements that are not going to be hit again (the expired unfetched for example).

There are two new memcached releases that could help:

They introduce new awesome logging tools that will probably make life easier for us.

elukey added a comment.Jul 5 2016, 8:09 AM

1.4.28 is out and contains only bugfixes for the last logging features.

1.4.29 is out and includes a big change, namely the maximum item size is now configurable and not fixed to 1MB.

Now we have to decide how to proceed:

  1. close this task and wait for new Debian stable releases to upgrade memcached, using all the info collected in here as background.
  2. keep 1.4.21 as official version with some canaries running 1.4.28. This will allow us to have only the new logging features (plus bug fixes) without any slab-related changes like it has happened in 1.4.29.

Change 313803 had a related patch set uploaded (by Elukey):
Upgrade memcached on mc2009 to 1.4.28

https://gerrit.wikimedia.org/r/313803

Change 313803 merged by Elukey:
Upgrade memcached on mc2009 to 1.4.28

https://gerrit.wikimedia.org/r/313803

Mentioned in SAL (#wikimedia-operations) [2016-10-04T10:23:16Z] <elukey> installed memcached 1.4.28-1.1+wmf1 on mc2009 as part of a performance test - T129963

Change 314260 had a related patch set uploaded (by Elukey):
Refactor memcached role to allow a more flexible hieradata config

https://gerrit.wikimedia.org/r/314260

Change 314260 merged by Elukey:
Refactor memcached role to allow a more flexible hieradata config

https://gerrit.wikimedia.org/r/314260

Mentioned in SAL (#wikimedia-releng) [2016-10-12T13:37:32Z] <elukey> upgraded memcached on deployment-memc04 to 1.4.28-1.1+wmf1 as part of a perf experiment (T129963) - rollback: wipe https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep/host/deployment-memc04, apt-get remove memcached on deployment-memc04, puppet run

Update after a long time:

  • We have tested version 1.4.25 (that introduced big changes like a max of 64 slab classes) with several extended options and growth factor 1.15 (as opposed to 1.05 currently used). No big benefit was found in primary key metrics.
  • The awesome dev team of memcached released a new advanced logging feature in 1.4.28: watch (fetchers|evictions|mutations) that allows us to inspect in real time fetch request, evictions and mutations.
  • deployment-memc04 in Deployment-Prep has been upgraded to 1.4.28.

Next steps:

  • testing 1.4.28 in deployment-prep, and if satisfied upgrade the two prod hosts that are running 1.4.25 from the previous experiment.
  • decide whether or not to upgrade all the mc* hosts to 1.4.28 or just keep two "logging" hosts for debugging purpose.
elukey changed the task status from Open to Stalled.Nov 25 2016, 10:10 AM

This task is currently blocked by T137345

elukey moved this task from Backlog to Ops Backlog on the User-Elukey board.Dec 14 2016, 11:08 AM
elukey moved this task from Ops Backlog to Stalled on the User-Elukey board.Dec 14 2016, 5:42 PM
elukey changed the task status from Stalled to Open.May 3 2017, 7:58 AM

Finally we were able to decommission the old mc1001->mc1018 hardware and replace it with mc1019-mc1036.

Some time has passed and I believe that this task should now be focused on testing the next memcached version that we'll probably use, namely the one in Debian Stretch: https://packages.debian.org/stretch/memcached (1.4.33).

If this is ok I'll try to check what changed and deploy 1.4.33 in deployment-prep and prod (single host).

elukey moved this task from Stalled to Ops Backlog on the User-Elukey board.Aug 11 2017, 9:12 AM
elukey moved this task from Ops Backlog to Backlog on the User-Elukey board.Feb 16 2018, 12:01 PM
elukey removed elukey as the assignee of this task.Mar 23 2018, 10:58 AM
elukey moved this task from Backlog to Keep an eye on it on the User-Elukey board.Mar 23 2018, 3:55 PM

Stretch is now packaging 1.4.33, meanwhile the last version tested in this task was 1.4.28. Release notes between the two:

So 1.4.33 seems to contain all patches to "stabilize" 1.4.29, that was a big change. In theory this could be a good moment to review our shards and see:

  1. if we need to change our growth factor (likely yes, as stated during the past month there are now a maximum of 64 slab sizes, and from previous tests 1.05 - our current value - was too conservative).
  2. if we need to use the new -I feature that allows items bigger than 1M to be stored.

The main idea, during the next months, could be to test memcached 1.4.33 (packaged for jessie) on deployment prep and a couple of prod shards, come up with a good set of new parameters and then upgrade the whole fleet when ready (with extreme care since restarting memcached means wiping the mw cache).

Cc: @Krinkle @Imarlier to know your thoughts and decide if the above plan could be good (since it will surely require the perf-team guidance/assistance/review :)

elukey closed this task as Resolved.Jan 7 2019, 3:39 PM
elukey claimed this task.

Going to close this task to open another one that tracks the upgrade to buster or stretch, this one is full of information and I wouldn't like to overload it (more than now).