Page MenuHomePhabricator

Reduce number of shards in redis_sessions cluster
Closed, ResolvedPublic

Description

The redis_sessions cluster currently hosts ~2.5GB of data, sharded across 18 servers, per DC which is an overkill. Let's reduce the number of shards to a more sensible number, eg 8.

Requirements:

  • 2 shards per row (if possible)
  • mediawiki rdb servers in InitialiseSettings.php
    • mc1030 C4
    • mc1022 A6
    • mc1034 D4
    • mc2030 C5
    • mc2022 A8
    • mc2034 D4

How?
We will remove one server pair (eqiad-codfw) first, and then the rest of them in one go.

Notes
The following memcached servers should be reimaged:

  • mc2019
  • mc2021
  • mc2023
  • mc2025
  • mc2027
  • mc2037
  • mc2029
  • mc2031
  • mc2033
  • mc2035
  • mc1037
  • mc1038

Event Timeline

Note that deployment of T113916 was halted because the Redis capacity was actually considered too small. (That task is about migrating module dep store from a MW core DB table written to during GET requests, to instead use Main Stash)

I do believe however that after MainStash DB (x2, T113916) is fully online and configured as backend for MainStash in MW, that indeed this cluster can be downsized, even much further to only 3 or 4 servers perhaps? ChronologyProtector would likely be its only consumer, perhaps with one or two other consumers besides it that store small and short-lived dc-local values at an every lower frequency.

Note that deployment of T113916 was halted because the Redis capacity was actually considered too small. (That task is about migrating module dep store from a MW core DB table written to during GET requests, to instead use Main Stash)

I do believe however that after MainStash DB (x2, T113916) is fully online and configured as backend for MainStash in MW, that indeed this cluster can be downsized, even much further to only 3 or 4 servers perhaps? ChronologyProtector would likely be its only consumer, perhaps with one or two other consumers besides it that store small and short-lived dc-local values at an every lower frequency.

The first step is to reduce it to 8 servers. The next one would be to use the redis_misc cluster for that data T280586, once T212129 is completed

akosiaris triaged this task as Medium priority.Apr 21 2021, 11:16 AM

@kostajh @Krinkle, I would like to move this task forward. My plan is to remove one or two redis shard(s) per day, until we have 8 left. Right now the size of the dataset is 2.2GB with about 110-120MB of data in each shard. Our current maxmem setting is 500MB, meaning that each shard should reach the size of 500MB before LRU kicks in. That gives us a max capacity of 4GB, before we start seeing evictions. We have enough memory to push that to 1GB and more if needed.

Good to go from both of us. Last time we did maintenance (T252391) it was realized that the instrumentation that relies on the stronger persistence was no longer needed and disabled at the time (October 2020). It has not been re-enalbed since then so the code we looked at in WikimediaEvents is actually unused at the moment. All other consumers of MainStash in prod are, to my knowledge, resistent enough to total or partial churn to be okay without notice if it's just once or twice a year.

My only recommendation would be to remove the shards that are going away all at once on day 1, so that there's only one re-hashing instead of repeated session/data losses several days in a row.

Change 713619 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/mediawiki-config@master] ProductionServices: change rdb* servers in eqiad and codfw

https://gerrit.wikimedia.org/r/713619

Change 713619 merged by Effie Mouzeli:

[operations/mediawiki-config@master] ProductionServices: change rdb* servers in eqiad and codfw

https://gerrit.wikimedia.org/r/713619

Mentioned in SAL (#wikimedia-operations) [2021-08-18T13:57:10Z] <jiji@deploy1002> Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:713619|ProductionServices: change rdb* servers in eqiad and codfw (T280582)]] (duration: 01m 51s)

Change 713655 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: remove shard01 from redis_sessions

https://gerrit.wikimedia.org/r/713655

Change 713655 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: remove shard01 from redis_sessions

https://gerrit.wikimedia.org/r/713655

Change 713842 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: remove 9 redis shards

https://gerrit.wikimedia.org/r/713842

Change 713842 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: remove 9 redis shards

https://gerrit.wikimedia.org/r/713842

jijiki renamed this task from Shrink redis_sessions cluster to Reduce number of shards in redis_sessions cluster.Thu, Aug 19, 6:58 PM

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2019.codfw.wmnet', 'mc1037.eqiad.wmnet', 'mc1038.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108200649_jiji_30725.log.

Completed auto-reimage of hosts:

['mc1037.eqiad.wmnet', 'mc2019.codfw.wmnet', 'mc1038.eqiad.wmnet']

and were ALL successful.

Change 714049 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/mediawiki-config@master] ProductionServices: replace redis_lock eqiad servers

https://gerrit.wikimedia.org/r/714049

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2021.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108201419_jiji_32012.log.

Completed auto-reimage of hosts:

['mc2021.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2023.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108230658_jiji_14160.log.

Completed auto-reimage of hosts:

['mc2023.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2025.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108231035_jiji_13984.log.

Completed auto-reimage of hosts:

['mc2025.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-08-23T12:55:10Z] <jiji@deploy1002> Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:713619|ProductionServices: change rdb* servers in eqiad and codfw (T280582)]] (duration: 00m 57s)

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2027.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108231602_jiji_24205.log.

Completed auto-reimage of hosts:

['mc2027.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2037.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108240555_jiji_4664.log.

Completed auto-reimage of hosts:

['mc2037.codfw.wmnet']

and were ALL successful.

jijiki updated the task description. (Show Details)

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2029.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108241104_jiji_13758.log.

Completed auto-reimage of hosts:

['mc2029.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2031.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108241356_jiji_8432.log.

Completed auto-reimage of hosts:

['mc2031.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2033.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108250836_jiji_17437.log.

Completed auto-reimage of hosts:

['mc2033.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2035.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108251009_jiji_4949.log.

Completed auto-reimage of hosts:

['mc2035.codfw.wmnet']

and were ALL successful.

jijiki updated the task description. (Show Details)