Page MenuHomePhabricator

Upgrade memcached cluster to Debian Buster
Closed, ResolvedPublic

Description

In T129963 we explored some newer versions of memcache to deploy for the MW object cache, but we have never decided to upgrade all the shards. On all the mcXXXX hosts we are running Jessie, and sooner or later we'll have to think about either Stretch or Buster :)

The major complication is the fact that Redis and the MW Session Storage is co-located on the same nodes, so upgrading the OS means upgrading both Memcached and Redis at the same time. T265643

While we could wait for the new Session Storage Service to be alive (that should in theory get rid of Redis in favor of something else), I would like to choose a new version of memcached and try it on a couple of Production shards for a couple of months to study and tune settings, since from T129963 we know that a lot has changed. Some highlights:

  • the maximum number of slab classes for a "recent" 1.4 or 1.5 version of memcached is 64, meanwhile we are currently using a lot more (160+) on each shard due to the growth factor that we use. In T129963 we tested the increase of te growth factor to 1.15, it seemed working nicely.
  • the LRU logic has been completely changed, more info in https://github.com/memcached/memcached/blob/master/doc/new_lru.txt and https://memcached.org/blog/modern-lru
  • SLAB automover - freed memory can be reclaimed back into a global pool and reassigned to new slab classes (currently memory assigned to a slab class cannot be reclaimed, even if free, for another use).
  • new features are now ready to use and tested by a lot of people already.

Upgrade Plan

After we decide what to do with the redis instances residing in our memcached cluster T265643, reimage at least one shard to buster T252391, and resolve whatever minor or major issues arise, we will be ready to move on and upgrade our memcached clusters in December 2020. Because we rely on our memcached instances being hot🔥, after the rolling upgrade process commences, we will be reimaging one memcached server per day.

While a server is being reimaged:

data issues

  • All its data (memcached + redis) will be lost
  • memcached: mcrouter will failover to the gutter pool to replace the missing shard
  • redis: nutcracker will eject the server and spread its keys across pool

user facing issues

  • some unsaved states
  • some lost user actions
  • some failed sessions

Redis Lock Manager issues

The Redis Lock Manager (defined in ProductionServices.php) uses 3 redis servers to a) help avoid uploading more than one file with the same name, this needs at least 2/3 redis servers to be online b) dispatching changes to wikis from Wikidata; it is a maintenance script which runs 3 dispatchers each one of which "locks" a wiki while updating, so it won't be updated by another dispatcher. So after we have reimaged 15/18 memcached hosts in the active datacenter we should:

  • Wikidata dispatch: reduce the number of dispatchers to 1 dispatcher, so redis locking will not be needed at all
  • file upload: Choose a different set of 3 memcached servers and gradually replace them in wmf-config/ProductionServices.php

Server list
Reimaging a server can take up to two hours, but we will know exactly how much after we do the first one.

eqiad

  • mc1019 A6
  • mc1020 A6
  • mc1021 A6
  • mc1022 -> LockManager RedisA6
  • mc1023 A6
  • mc1024 B6
  • mc1025 B6
  • mc1026 B6
  • mc1027 B6
  • mc1028 (its pair is mc2037) C4
  • mc1029 C4
  • mc1030 C4
  • mc1031 -> LockManager Redis C4
  • mc1032 C4
  • mc1033 D4
  • mc1034 -> LockManager Redis D4
  • mc1035 D4
  • mc1036 D4

codfw

  • mc2019 A1
  • mc2020 A5
  • mc2021 A8
  • mc2022 > LockManager A8
  • mc2023 B1
  • mc2024 B5
  • mc2025 B8
  • mc2026 B8
  • mc2027 C1
  • mc2029 C3
  • mc2030 C5
  • mc2031 -> LockManager Redis C5
  • mc2032 D1
  • mc2033 D4
  • mc2034 -> LockManager Redis D4
  • mc2035 D5
  • mc2036 D8
  • mc2037 C2

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+7 -329
operations/puppetproduction+18 -0
operations/puppetproduction+18 -0
operations/puppetproduction+18 -0
operations/puppetproduction+18 -0
operations/puppetproduction+18 -0
operations/puppetproduction+18 -7
operations/puppetproduction+18 -0
operations/puppetproduction+18 -0
operations/puppetproduction+18 -0
operations/puppetproduction+18 -0
operations/puppetproduction+18 -0
operations/puppetproduction+18 -0
operations/puppetproduction+18 -0
operations/puppetproduction+18 -0
operations/puppetproduction+16 -0
operations/puppetproduction+0 -28
operations/puppetproduction+16 -2
operations/puppetproduction+2 -0
operations/puppetproduction+168 -39
operations/puppetproduction+16 -8
operations/puppetproduction+0 -8
operations/puppetproduction+58 -8
operations/puppetproduction+19 -1
operations/puppetproduction+3 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Completed auto-reimage of hosts:

['mc1031.eqiad.wmnet']

and were ALL successful.

Change 649608 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: upgrade mc1033, mc2033 to buster

https://gerrit.wikimedia.org/r/649608

Change 649608 merged by Effie Mouzeli:
[operations/puppet@production] hiera: upgrade mc1033, mc2033 to buster

https://gerrit.wikimedia.org/r/649608

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1033.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012151123_jiji_2053_mc1033_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2033.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012151123_jiji_2080_mc2033_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc1033.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mc2033.codfw.wmnet']

and were ALL successful.

Change 649844 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: reimage mc1022,mc2022 to buster

https://gerrit.wikimedia.org/r/649844

Change 649844 merged by Effie Mouzeli:
[operations/puppet@production] hiera: reimage mc1022,mc2022 to buster

https://gerrit.wikimedia.org/r/649844

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1022.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012161058_jiji_29479_mc1022_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2022.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012161059_jiji_29594_mc2022_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc2022.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mc1022.eqiad.wmnet']

and were ALL successful.

Change 649904 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: upgrade mc1019, mc2019 to buster

https://gerrit.wikimedia.org/r/649904

Change 649904 merged by Effie Mouzeli:
[operations/puppet@production] hiera: upgrade mc1019, mc2019 to buster

https://gerrit.wikimedia.org/r/649904

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1019.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012161622_jiji_29246_mc1019_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2019.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012161622_jiji_29305_mc2019_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc1019.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mc2019.codfw.wmnet']

and were ALL successful.

Change 650078 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: upgrade mc1020, mc2020 to buster

https://gerrit.wikimedia.org/r/650078

Change 650078 merged by Effie Mouzeli:
[operations/puppet@production] hiera: upgrade mc1020, mc2020 to buster

https://gerrit.wikimedia.org/r/650078

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2020.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012171621_jiji_20034_mc2020_codfw_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1020.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012171621_jiji_20012_mc1020_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mc1020.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mc2020.codfw.wmnet']

and were ALL successful.

Change 650491 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: upgrade mc1021, mc2021 to buster

https://gerrit.wikimedia.org/r/650491

Change 650491 merged by Effie Mouzeli:
[operations/puppet@production] hiera: upgrade mc1021, mc2021 to buster

https://gerrit.wikimedia.org/r/650491

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1021.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012181315_jiji_25083_mc1021_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2021.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012181315_jiji_25182_mc2021_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc1021.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mc2021.codfw.wmnet']

and were ALL successful.

Change 651019 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: upgrade mc1023, mc2023 to buster

https://gerrit.wikimedia.org/r/651019

Change 651019 merged by Effie Mouzeli:
[operations/puppet@production] hiera: upgrade mc1023, mc2023 to buster

https://gerrit.wikimedia.org/r/651019

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1023.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012210733_jiji_12827_mc1023_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2023.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012210733_jiji_12857_mc2023_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc1023.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mc2023.codfw.wmnet']

and were ALL successful.

Change 651227 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: upgrade mc1024, mc2024 to buster

https://gerrit.wikimedia.org/r/651227

Change 651227 merged by Effie Mouzeli:
[operations/puppet@production] hiera: upgrade mc1024, mc2024 to buster

https://gerrit.wikimedia.org/r/651227

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1024.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012211740_jiji_14701_mc1024_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2024.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012211740_jiji_14740_mc2024_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc2024.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mc1024.eqiad.wmnet']

and were ALL successful.

Change 654432 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: upgrade mc1025, mc2025 to buster

https://gerrit.wikimedia.org/r/654432

Change 654432 merged by Effie Mouzeli:
[operations/puppet@production] hiera: upgrade mc1025, mc2025 to buster

https://gerrit.wikimedia.org/r/654432

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1025.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101051519_jiji_31831_mc1025_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2025.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101051519_jiji_31845_mc2025_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc1025.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mc2025.codfw.wmnet']

and were ALL successful.

Change 654639 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: upgrade mc1026, mc2026 to buster

https://gerrit.wikimedia.org/r/654639

Change 654639 merged by Effie Mouzeli:
[operations/puppet@production] hiera: upgrade mc1026, mc2026 to buster

https://gerrit.wikimedia.org/r/654639

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1026.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101061642_jiji_3837_mc1026_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2026.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101061642_jiji_3954_mc2026_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc2026.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mc1026.eqiad.wmnet']

and were ALL successful.

Change 654908 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: upgrade mc1027, mc2027 to buster

https://gerrit.wikimedia.org/r/654908

Change 654908 merged by Effie Mouzeli:
[operations/puppet@production] hiera: upgrade mc1027, mc2027 to buster

https://gerrit.wikimedia.org/r/654908

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2027.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101071831_jiji_32074_mc2027_codfw_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1027.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101071831_jiji_32119_mc1027_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['mc2027.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mc1027.eqiad.wmnet']

and were ALL successful.

Change 655372 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: upgrade mc1030, mc2030 to buster

https://gerrit.wikimedia.org/r/655372

Change 655374 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: upgrade mc1028, mc2037 to buster

https://gerrit.wikimedia.org/r/655374

Change 655373 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: upgrade mc1029, mc2029 to buster

https://gerrit.wikimedia.org/r/655373

Change 655372 merged by Effie Mouzeli:
[operations/puppet@production] hiera: upgrade mc1030, mc2030 to buster

https://gerrit.wikimedia.org/r/655372

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1030.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101110803_jiji_29098_mc1030_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2030.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101110803_jiji_29178_mc2030_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc2030.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mc1030.eqiad.wmnet']

and were ALL successful.

Change 655374 merged by Effie Mouzeli:
[operations/puppet@production] hiera: upgrade mc1028, mc2037 to buster

https://gerrit.wikimedia.org/r/655374

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1028.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101111539_jiji_20058_mc1028_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2037.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101111539_jiji_20266_mc2037_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc1028.eqiad.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['mc2037.codfw.wmnet']

and were ALL successful.

Change 655373 merged by Effie Mouzeli:
[operations/puppet@production] hiera: upgrade mc1029, mc2029 to buster

https://gerrit.wikimedia.org/r/655373

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc1029.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101131133_jiji_27489_mc1029_eqiad_wmnet.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

mc2029.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101131134_jiji_28597_mc2029_codfw_wmnet.log.

Completed auto-reimage of hosts:

['mc1029.eqiad.wmnet']

Of which those FAILED:

['mc1029.eqiad.wmnet']

Completed auto-reimage of hosts:

['mc2029.codfw.wmnet']

Of which those FAILED:

['mc2029.codfw.wmnet']
jijiki claimed this task.

Despite of what the above messages say, mc2029 and mc1029 were properly reimaged 🎉

Change 656385 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hiera: clean up memcached configuration

https://gerrit.wikimedia.org/r/656385

Change 656385 merged by Effie Mouzeli:
[operations/puppet@production] hiera: clean up memcached configuration

https://gerrit.wikimedia.org/r/656385