Page MenuHomePhabricator

Enable extstore to a subset of memcached servers (experiment)
Open, Stalled, MediumPublic

Assigned To
Authored By
jijiki
Dec 6 2023, 4:40 PM
Referenced Files
F54932370: image.png
Jun 4 2024, 11:22 AM
F54932229: image.png
Jun 4 2024, 11:22 AM
F54931929: image.png
Jun 4 2024, 11:22 AM
F54931668: image.png
Jun 4 2024, 11:22 AM

Description

What

Extstore is a memcached feature extending memcached's memory space. Instead of completely evicting a key (to make room for new ones), memcached leaves the hash table and keys in memory, but moves values to external storage (disk, flash, whatever)

Details can be found here: https://github.com/memcached/memcached/wiki/Extstore

Why?

We have slabs with higher eviction rate than others, and it is currently unknown how those evictions impact production (though we could investigate this in a separate tasks)

Enabling extstore in production so to determine if has positive impact or not, is a low effort project.

How?

  • puppet changes
  • hiera keys to enable/disable the feature
  • update prometheus-memcached-exporter to v0.14.2 T350807
  • update grafana dashboards to include extstore metrics memcached-Slabs and memcache

eqiad

  • mc1045.eqiad.wmnet
  • mc1046.eqiad.wmnet
  • mc1047.eqiad.wmnet
  • mc1048.eqiad.wmnet
  • mc1049.eqiad.wmnet
  • mc1050.eqiad.wmnet

codfw

  • mc2045.codfw.wmnet
  • mc2046.codfw.wmnet
  • mc2047.codfw.wmnet
  • mc2048.codfw.wmnet
  • mc2049.codfw.wmnet
  • mc2050.codfw.wmnet

Event Timeline

jijiki triaged this task as Medium priority.

Change #1035633 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] memcached: add extstore option

https://gerrit.wikimedia.org/r/1035633

jijiki changed the task status from Open to In Progress.May 24 2024, 11:49 AM
jijiki claimed this task.
jijiki changed the status of subtask T273950: Modernise memcached systemd unit / sync, and make it presentable from Open to In Progress.
jijiki updated the task description. (Show Details)

Change #1036281 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: enable extstore on mc1049 and mc2049

https://gerrit.wikimedia.org/r/1036281

Change #1035633 merged by Effie Mouzeli:

[operations/puppet@production] memcached: add extstore option

https://gerrit.wikimedia.org/r/1035633

Change #1036281 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: enable extstore on mc1049 and mc2049

https://gerrit.wikimedia.org/r/1036281

Change #1037053 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] memcached: test extstore on 10 servers

https://gerrit.wikimedia.org/r/1037053

Change #1037053 merged by Effie Mouzeli:

[operations/puppet@production] memcached: test extstore on 10 servers

https://gerrit.wikimedia.org/r/1037053

Change #1038262 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] memcached: enable extstore on mc1050/mc2050

https://gerrit.wikimedia.org/r/1038262

jijiki added a subscriber: aaron.

Change #1038262 merged by Effie Mouzeli:

[operations/puppet@production] memcached: enable extstore on mc1050/mc2050

https://gerrit.wikimedia.org/r/1038262

We have enabled extstore to 6/18 servers per memcached cluster. A few observations:

  • System: apart from the expected disc ops and increase in cached memory, nothing stands out
  • We do not have items being evicted before expiring, expected as they are moved to extstore.
  • hitrate is almost the same

I have noticed there are differences in the patterns of some slabs. Comparison was on mc1050 which
I switched to extstore a few days after it was running on bookworm.

  • 1d offset is before enabling extstore on mc1050
  • objects over 176B are eligible to be moved to extstore

Slab 4 (max 152B, yellow offset 1d)
current items and cas hits

image.png (490×1 px, 64 KB)

image.png (470×1 px, 157 KB)

Slab 10 (max 384B, yellow offset 1d)
memory requested

image.png (470×1 px, 68 KB)

Slab 30 (max 7k, yellow offset 1d)
memory requested

image.png (457×1 px, 76 KB)

I am not sure yet how this affects performance overall. cc @Krinkle or whoever would like to dig further with me. We could consider rolling this change on one for our 2 DCs, and dig through how other metrics are affected (or not?), eg mw-* latency, DB traffic etc.

We discussed this in the MwEng-SvcOps meeting (13 June 2024). Extstore was enabled on a few hosts in the DC. Some stats differered, but we weren't able to come up with a testing strategy to prove or disprove an observable benefit from the MW application in this state since keys are sharded across all hosts, and traffic generally involves many different keys.

We noticed that the given hosts, contrary to my own expectations, there is a continuous non-zero trickle of evictions of tiny keys (e.g. < 200 bytes). This is surprising, because during research for T278392 and T336004, we @tstarling and I found that these slabs were not under pressure, and emperical testing showed that these values reliably persisted for at least a minute in practice. This is worrying, because we've since migrated to mcrouter-primary-dc for short-lived auth/nonce tokens, Rdbms-ChronologyProtector positions, and rate limiter counters.

The good news, and the reason we noticed, is that with extstore enabled, much more space is given to the tiny value "slab 4", and there is a flat line of zero undue evictions since.

Aside from this little happy accident, we found no per-host stats that are reason for concern. I agreed with Effie that rolling it to fully to the primary DC with the secondary as A/B-esque control would be a fine next step. We expect the worst outcome to be "no effect". And hope that there is enough similarity between the DCs traffic, yet enough separation in our data collection to notice an improvement here (e.g. via MW statds->prometheus stats that we have per-DC, using WANCache as way to measure cache hit ratio on meaningful keys; as well as Apache-level latency numbers such as the Appserver RED dashboards).

Change #1052739 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] memcached: enable extstore to eqiad only

https://gerrit.wikimedia.org/r/1052739

Mentioned in SAL (#wikimedia-operations) [2024-07-10T14:15:49Z] <effie> disable puppet on mw memcached hosts - T352885

Change #1052739 merged by Effie Mouzeli:

[operations/puppet@production] memcached: enable extstore to eqiad only

https://gerrit.wikimedia.org/r/1052739

@Krinkle Extstore is fully enabled on eqiad

jijiki changed the task status from In Progress to Stalled.Aug 2 2024, 10:57 AM

Based on findings in T370185, we are unsure if we saw any overall benefit here. I will mark this as stalled until I have some time to dig deeper.

Change #1162904 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] memcached: enable extstore on codfw

https://gerrit.wikimedia.org/r/1162904

Change #1162904 merged by Effie Mouzeli:

[operations/puppet@production] memcached: enable extstore on codfw

https://gerrit.wikimedia.org/r/1162904