Change Details

In T129963 we explored some newer versions of memcache to deploy for the MW object cache, but we have never decided to upgrade all the shards. On all the mcXXXX hosts we are running Jessie, and sooner or later we'll have to think about either Stretch or Buster :) The major complication is the fact that Redis and the MW Session Storage is co-located on the same nodes, so upgrading the OS means upgrading both Memcached and Redis at the same time. T265643 ~~While we could wait for the new Session Storage Service to be alive (that should in theory get rid of Redis in favor of something else)~~, I would like to choose a new version of memcached and try it on a couple of Production shards for a couple of months to study and tune settings, since from T129963 we know that a lot has changed. Some highlights: * the maximum number of slab classes for a "recent" 1.4 or 1.5 version of memcached is 64, meanwhile we are currently using a lot more (160+) on each shard due to the growth factor that we use. In T129963 we tested the increase of te growth factor to 1.15, it seemed working nicely. * the LRU logic has been completely changed, more info in https://github.com/memcached/memcached/blob/master/doc/new_lru.txt and https://memcached.org/blog/modern-lru * SLAB automover - freed memory can be reclaimed back into a global pool and reassigned to new slab classes (currently memory assigned to a slab class cannot be reclaimed, even if free, for another use). * new features are now ready to use and tested by a lot of people already. ##Upgrade Plan## After we decide what to do with the redis instances residing in our memcached cluster T265643, reimage at least one shard to buster T252391, and resolve whatever minor or major issues arise, we will be ready to move on and upgrade our memcached clusters in **December 2020**. Because we rely on our memcached instances being hot🔥, after the rolling upgrade process commences, **we will be reimaging one memcached server per day**. While a server is being reimaged: **data issues** * All its data (memcached + redis) will be lost * memcached: mcrouter will failover to the gutter pool to replace the missing shard * redis: nutcracker will eject the server and spread its keys across pool **user facing issues** * some unsaved states * some lost user actions * some failed sessions **Redis Lock Manager issues** The [[ https://doc.wikimedia.org/mediawiki-core/master/php/classRedisLockManager.html | Redis Lock Manager ]] (defined in [[ https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/ProductionServices.php | ProductionServices.php ]]) uses 3 redis servers to **a)** help avoid uploading more than one file with the same name **b)** dispatching changes to wikis from Wikidata. The latter is a maintenance script which runs 3 dispatchers each one of which "locks" a wiki while updating, so it won't be updated by another dispatcher. While we are reimaging one server, we should coordinate with someone from WMDE to **reduce the number of dispatchers to 1, so locking will not be needed.** If anyone has any idea how to deal with **a)**, please speak! ** Server list** Reimaging a server can take up to two hours, but we will know exactly how much after we do the first one. **eqiad** [] mc1019 -> LockManager Redis [] mc1020 [] mc1021 [] mc1022 [] mc1023 [] mc1024 -> LockManager Redis [] mc1025 [] mc1026 [] mc1027 [] mc1028 [] mc1029 [] mc1030 [] mc1031 [] mc1032 [] mc1033 -> LockManager Redis [] mc1034 [] mc1035 [] mc1036 **codfw** [] mc2019 -> LockManager Redis [] mc2020 -> LockManager Redis [] mc2021 -> LockManager Redis [] mc2022 [] mc2023 [] mc2024 [] mc2025 [] mc2026 [] mc2027 [] mc2029 [] mc2030 [] mc2031 [] mc2032 [] mc2033 [] mc2034 [] mc2035 [] mc2036 [] mc2037

In T129963 we explored some newer versions of memcache to deploy for the MW object cache, but we have never decided to upgrade all the shards. On all the mcXXXX hosts we are running Jessie, and sooner or later we'll have to think about either Stretch or Buster :) The major complication is the fact that Redis and the MW Session Storage is co-located on the same nodes, so upgrading the OS means upgrading both Memcached and Redis at the same time. T265643 ~~While we could wait for the new Session Storage Service to be alive (that should in theory get rid of Redis in favor of something else)~~, I would like to choose a new version of memcached and try it on a couple of Production shards for a couple of months to study and tune settings, since from T129963 we know that a lot has changed. Some highlights: * the maximum number of slab classes for a "recent" 1.4 or 1.5 version of memcached is 64, meanwhile we are currently using a lot more (160+) on each shard due to the growth factor that we use. In T129963 we tested the increase of te growth factor to 1.15, it seemed working nicely. * the LRU logic has been completely changed, more info in https://github.com/memcached/memcached/blob/master/doc/new_lru.txt and https://memcached.org/blog/modern-lru * SLAB automover - freed memory can be reclaimed back into a global pool and reassigned to new slab classes (currently memory assigned to a slab class cannot be reclaimed, even if free, for another use). * new features are now ready to use and tested by a lot of people already. ##Upgrade Plan## After we decide what to do with the redis instances residing in our memcached cluster T265643, reimage at least one shard to buster T252391, and resolve whatever minor or major issues arise, we will be ready to move on and upgrade our memcached clusters in **December 2020**. Because we rely on our memcached instances being hot🔥, after the rolling upgrade process commences, **we will be reimaging one memcached server per day**. While a server is being reimaged: **data issues** * All its data (memcached + redis) will be lost * memcached: mcrouter will failover to the gutter pool to replace the missing shard * redis: nutcracker will eject the server and spread its keys across pool **user facing issues** * some unsaved states * some lost user actions * some failed sessions **Redis Lock Manager issues** The [[ https://doc.wikimedia.org/mediawiki-core/master/php/classRedisLockManager.html | Redis Lock Manager ]] (defined in [[ https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/ProductionServices.php | ProductionServices.php ]]) uses 3 redis servers to **a)** help avoid uploading more than one file with the same name **b)** dispatching changes to wikis from Wikidata. The latter is a maintenance script which runs 3 dispatchers each one of which "locks" a wiki while updating, so it won't be updated by another dispatcher. While we are reimaging one server, we should coordinate with someone from WMDE to **reduce the number of dispatchers to 1, so locking will not be needed.** If anyone has any idea how to deal with **a)**, please speak! **Server list** Reimaging a server can take up to two hours, but we will know exactly how much after we do the first one. **eqiad** [] mc1019 -> LockManager Redis [] mc1020 [] mc1021 [] mc1022 [] mc1023 [] mc1024 -> LockManager Redis [] mc1025 [] mc1026 [] mc1027 [] mc1028 [] mc1029 [] mc1030 [] mc1031 [] mc1032 [] mc1033 -> LockManager Redis [] mc1034 [] mc1035 [] mc1036 **codfw** [] mc2019 -> LockManager Redis [] mc2020 -> LockManager Redis [] mc2021 -> LockManager Redis [] mc2022 [] mc2023 [] mc2024 [] mc2025 [] mc2026 [] mc2027 [] mc2029 [] mc2030 [] mc2031 [] mc2032 [] mc2033 [] mc2034 [] mc2035 [] mc2036 [] mc2037