Change Details

In T129963 we explored some newer versions of memcache to deploy for the MW object cache, but we have never decided to upgrade all the shards. On all the mcXXXX hosts we are running Jessie, and sooner or later we'll have to think about either Stretch or Buster :) The major complication is the fact that Redis and the MW Session Storage is co-located on the same nodes, so upgrading the OS means upgrading both Memcached and Redis at the same time. T265643 ~~While we could wait for the new Session Storage Service to be alive (that should in theory get rid of Redis in favor of something else)~~, I would like to choose a new version of memcached and try it on a couple of Production shards for a couple of months to study and tune settings, since from T129963 we know that a lot has changed. Some highlights: * the maximum number of slab classes for a "recent" 1.4 or 1.5 version of memcached is 64, meanwhile we are currently using a lot more (160+) on each shard due to the growth factor that we use. In T129963 we tested the increase of te growth factor to 1.15, it seemed working nicely. * the LRU logic has been completely changed, more info in https://github.com/memcached/memcached/blob/master/doc/new_lru.txt and https://memcached.org/blog/modern-lru * SLAB automover - freed memory can be reclaimed back into a global pool and reassigned to new slab classes (currently memory assigned to a slab class cannot be reclaimed, even if free, for another use). * new features are now ready to use and tested by a lot of people already. ##Upgrade Plan## After we decide what to do with the redis instances residing in our memcached cluster T265643, reimage at least one shard to buster T252391, and resolve whatever minor or major issues arise, we will be ready to move on and upgrade our memcached clusters in **December 2020**. Because we rely on our memcached instances being hot🔥, after the rolling upgrade process commences, **we will be reimaging one memcached server per day**. While a server is being reimaged: **data issues** * All its data (memcached + redis) will be lost * memcached: mcrouter will failover to the gutter pool to replace the missing shard * redis: nutcracker will eject the server and spread its keys across pool **user facing issues** * some unsaved states * some lost user actions * some failed sessions **Redis Lock Manager issues** The [[ https://doc.wikimedia.org/mediawiki-core/master/php/classRedisLockManager.html | Redis Lock Manager ]] (defined in [[ https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/ProductionServices.php | ProductionServices.php ]]) uses 3 redis servers to **a)** help avoid uploading more than one file with the same name, this needs at least 2/3 redis servers to be online **b)** dispatching changes to wikis from Wikidata; it is a maintenance script which runs 3 dispatchers each one of which "locks" a wiki while updating, so it won't be updated by another dispatcher. So after we have reimaged 15/18 memcached hosts in the active datacenter we should: * Wikidata dispatch: reduce the number of dispatchers to 1 dispatcher, so redis locking will not be needed at all * file upload: Choose a different set of 3 memcached servers and gradually replace them in `wmf-config/ProductionServices.php` **Server list** Reimaging a server can take up to two hours, but we will know exactly how much after we do the first one. **eqiad** [x] mc1019 A6 [] mc1020 A6 [] mc1021 A6 [x] mc1022 -> LockManager RedisA6 [] mc1023 A6 [] mc1024 B6 [] mc1025 B6 [] mc1026 B6 [] mc1027 B6 [] mc1028 (its pair is **mc2037**) C4 [] mc1029 C4 [] mc1030 C4 [x] mc1031 -> LockManager Redis C4 [x] mc1032 C4 [x] mc1033 D4 [x] mc1034 -> LockManager Redis D4 [x] mc1035 D4 [x] mc1036 D4 **codfw** [x] mc2019 A1 [] mc2020 A5 [] mc2021 A8 [x] mc2022 A8 [] mc2023 B1 [] mc2024 B5 [] mc2025 B8 [] mc2026 B8 [] mc2027 C1 [] mc2029 C3 [] mc2030 C5 [x] mc2031 -> LockManager Redis C5 [x] mc2032 D1 [x] mc2033 D4 [x] mc2034 -> LockManager Redis D4 [x] mc2035 D5 [x] mc2036 D8 [] mc2037 C2

In T129963 we explored some newer versions of memcache to deploy for the MW object cache, but we have never decided to upgrade all the shards. On all the mcXXXX hosts we are running Jessie, and sooner or later we'll have to think about either Stretch or Buster :) The major complication is the fact that Redis and the MW Session Storage is co-located on the same nodes, so upgrading the OS means upgrading both Memcached and Redis at the same time. T265643 ~~While we could wait for the new Session Storage Service to be alive (that should in theory get rid of Redis in favor of something else)~~, I would like to choose a new version of memcached and try it on a couple of Production shards for a couple of months to study and tune settings, since from T129963 we know that a lot has changed. Some highlights: * the maximum number of slab classes for a "recent" 1.4 or 1.5 version of memcached is 64, meanwhile we are currently using a lot more (160+) on each shard due to the growth factor that we use. In T129963 we tested the increase of te growth factor to 1.15, it seemed working nicely. * the LRU logic has been completely changed, more info in https://github.com/memcached/memcached/blob/master/doc/new_lru.txt and https://memcached.org/blog/modern-lru * SLAB automover - freed memory can be reclaimed back into a global pool and reassigned to new slab classes (currently memory assigned to a slab class cannot be reclaimed, even if free, for another use). * new features are now ready to use and tested by a lot of people already. ##Upgrade Plan## After we decide what to do with the redis instances residing in our memcached cluster T265643, reimage at least one shard to buster T252391, and resolve whatever minor or major issues arise, we will be ready to move on and upgrade our memcached clusters in **December 2020**. Because we rely on our memcached instances being hot🔥, after the rolling upgrade process commences, **we will be reimaging one memcached server per day**. While a server is being reimaged: **data issues** * All its data (memcached + redis) will be lost * memcached: mcrouter will failover to the gutter pool to replace the missing shard * redis: nutcracker will eject the server and spread its keys across pool **user facing issues** * some unsaved states * some lost user actions * some failed sessions **Redis Lock Manager issues** The [[ https://doc.wikimedia.org/mediawiki-core/master/php/classRedisLockManager.html | Redis Lock Manager ]] (defined in [[ https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/ProductionServices.php | ProductionServices.php ]]) uses 3 redis servers to **a)** help avoid uploading more than one file with the same name, this needs at least 2/3 redis servers to be online **b)** dispatching changes to wikis from Wikidata; it is a maintenance script which runs 3 dispatchers each one of which "locks" a wiki while updating, so it won't be updated by another dispatcher. So after we have reimaged 15/18 memcached hosts in the active datacenter we should: * Wikidata dispatch: reduce the number of dispatchers to 1 dispatcher, so redis locking will not be needed at all * file upload: Choose a different set of 3 memcached servers and gradually replace them in `wmf-config/ProductionServices.php` **Server list** Reimaging a server can take up to two hours, but we will know exactly how much after we do the first one. **eqiad** [x] mc1019 A6 [x] mc1020 A6 [] mc1021 A6 [x] mc1022 -> LockManager RedisA6 [] mc1023 A6 [] mc1024 B6 [] mc1025 B6 [] mc1026 B6 [] mc1027 B6 [] mc1028 (its pair is **mc2037**) C4 [] mc1029 C4 [] mc1030 C4 [x] mc1031 -> LockManager Redis C4 [x] mc1032 C4 [x] mc1033 D4 [x] mc1034 -> LockManager Redis D4 [x] mc1035 D4 [x] mc1036 D4 **codfw** [x] mc2019 A1 [x] mc2020 A5 [] mc2021 A8 [x] mc2022 > LockManager A8 [] mc2023 B1 [] mc2024 B5 [] mc2025 B8 [] mc2026 B8 [] mc2027 C1 [] mc2029 C3 [] mc2030 C5 [x] mc2031 -> LockManager Redis C5 [x] mc2032 D1 [x] mc2033 D4 [x] mc2034 -> LockManager Redis D4 [x] mc2035 D5 [x] mc2036 D8 [] mc2037 C2