In T129963 we explored some newer versions of memcache to deploy for the MW object cache, but we have never decided to upgrade all the shards. On all the mcXXXX hosts we are running Jessie, and sooner or later we'll have to think about either Stretch or Buster :)
The major complication is the fact that Redis and the MW Session Storage is co-located on the same nodes, so upgrading the OS means upgrading both Memcached and Redis at the same time. T265643
While we could wait for the new Session Storage Service to be alive (that should in theory get rid of Redis in favor of something else), I would like to choose a new version of memcached and try it on a couple of Production shards for a couple of months to study and tune settings, since from T129963 we know that a lot has changed. Some highlights:
- the maximum number of slab classes for a "recent" 1.4 or 1.5 version of memcached is 64, meanwhile we are currently using a lot more (160+) on each shard due to the growth factor that we use. In T129963 we tested the increase of te growth factor to 1.15, it seemed working nicely.
- the LRU logic has been completely changed, more info in https://github.com/memcached/memcached/blob/master/doc/new_lru.txt and https://memcached.org/blog/modern-lru
- SLAB automover - freed memory can be reclaimed back into a global pool and reassigned to new slab classes (currently memory assigned to a slab class cannot be reclaimed, even if free, for another use).
- new features are now ready to use and tested by a lot of people already.
After we decide what to do with the redis instances residing in our memcached cluster T265643, reimage at least one shard to buster T252391, and resolve whatever minor or major issues arise, we will be ready to move on and upgrade our memcached clusters in December 2020. Because we rely on our memcached instances being hot🔥, after the rolling upgrade process commences, we will be reimaging one memcached server per day.
While a server is being reimaged:
- All its data (memcached + redis) will be lost
- memcached: mcrouter will failover to the gutter pool to replace the missing shard
- redis: nutcracker will eject the server and spread its keys across pool
user facing issues
- some unsaved states
- some lost user actions
- some failed sessions
Redis Lock Manager issues
The Redis Lock Manager (defined in ProductionServices.php) uses 3 redis servers to a) help avoid uploading more than one file with the same name, this needs at least 2/3 redis servers to be online b) dispatching changes to wikis from Wikidata; it is a maintenance script which runs 3 dispatchers each one of which "locks" a wiki while updating, so it won't be updated by another dispatcher. So after we have reimaged 15/18 memcached hosts in the active datacenter we should:
- Wikidata dispatch: reduce the number of dispatchers to 1 dispatcher, so redis locking will not be needed at all
- file upload: Choose a different set of 3 memcached servers and gradually replace them in wmf-config/ProductionServices.php
Reimaging a server can take up to two hours, but we will know exactly how much after we do the first one.
- mc1019 -> LockManager Redis
- mc1024 -> LockManager Redis
- mc1033 -> LockManager Redis
- mc2019 -> LockManager Redis
- mc2020 -> LockManager Redis
- mc2021 -> LockManager Redis