⚓ T280582 Reduce number of shards in redis

Subject	Repo	Branch	Lines +/-
hieradata: remove 9 redis shards	operations/puppet	production	+6 -56
hieradata: remove shard01 from redis_sessions	operations/puppet	production	+39 -40
ProductionServices: change rdb* servers in eqiad and codfw	operations/mediawiki-config	master	+2 -2

		Status	Subtype	Assigned	Task
		Resolved		jijiki	T267581 Phase out "redis_sessions" cluster and away from memcached cluster
		Resolved		None	T280582 Reduce number of shards in redis_sessions cluster

jijiki created this task.Apr 19 2021, 6:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 19 2021, 6:52 PM

jijiki added a parent task: T267581: Phase out "redis_sessions" cluster and away from memcached cluster.Apr 19 2021, 6:52 PM

Krinkle added a project: Performance-Team (Radar).Apr 19 2021, 6:54 PM

Note that deployment of T113916 was halted because the Redis capacity was actually considered too small. (That task is about migrating module dep store from a MW core DB table written to during GET requests, to instead use Main Stash)

I do believe however that after MainStash DB (x2, T113916) is fully online and configured as backend for MainStash in MW, that indeed this cluster can be downsized, even much further to only 3 or 4 servers perhaps? ChronologyProtector would likely be its only consumer, perhaps with one or two other consumers besides it that store small and short-lived dc-local values at an every lower frequency.

In T280582#7015884, @Krinkle wrote:

Note that deployment of T113916 was halted because the Redis capacity was actually considered too small. (That task is about migrating module dep store from a MW core DB table written to during GET requests, to instead use Main Stash)

I do believe however that after MainStash DB (x2, T113916) is fully online and configured as backend for MainStash in MW, that indeed this cluster can be downsized, even much further to only 3 or 4 servers perhaps? ChronologyProtector would likely be its only consumer, perhaps with one or two other consumers besides it that store small and short-lived dc-local values at an every lower frequency.

The first step is to reduce it to 8 servers. The next one would be to use the redis_misc cluster for that data T280586, once T212129 is completed

Krinkle awarded a token.Apr 19 2021, 7:35 PM

Reedy updated the task description. (Show Details)Apr 19 2021, 7:54 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Apr 20 2021, 8:04 PM

akosiaris triaged this task as Medium priority.Apr 21 2021, 11:16 AM

Krinkle mentioned this in T135113: Rationalize our jobqueues redis topology.Apr 25 2021, 10:56 PM

jijiki moved this task from Inbox 🐅 to Q1 2021 on the User-jijiki board.Jul 12 2021, 4:52 PM

jijiki moved this task from Q1 2021 to Next up 🥌 on the User-jijiki board.Aug 11 2021, 3:51 PM

@kostajh @Krinkle, I would like to move this task forward. My plan is to remove one or two redis shard(s) per day, until we have 8 left. Right now the size of the dataset is 2.2GB with about 110-120MB of data in each shard. Our current maxmem setting is 500MB, meaning that each shard should reach the size of 500MB before LRU kicks in. That gives us a max capacity of 4GB, before we start seeing evictions. We have enough memory to push that to 1GB and more if needed.

Good to go from both of us. Last time we did maintenance (T252391) it was realized that the instrumentation that relies on the stronger persistence was no longer needed and disabled at the time (October 2020). It has not been re-enalbed since then so the code we looked at in WikimediaEvents is actually unused at the moment. All other consumers of MainStash in prod are, to my knowledge, resistent enough to total or partial churn to be okay without notice if it's just once or twice a year.

My only recommendation would be to remove the shards that are going away all at once on day 1, so that there's only one re-hashing instead of repeated session/data losses several days in a row.

Change 713619 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/mediawiki-config@master] ProductionServices: change rdb* servers in eqiad and codfw

https://gerrit.wikimedia.org/r/713619

gerritbot added a project: Patch-For-Review.Aug 18 2021, 11:01 AM

Change 713619 merged by Effie Mouzeli:

[operations/mediawiki-config@master] ProductionServices: change rdb* servers in eqiad and codfw

https://gerrit.wikimedia.org/r/713619

Mentioned in SAL (#wikimedia-operations) [2021-08-18T13:57:10Z] <jiji@deploy1002> Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:713619|ProductionServices: change rdb* servers in eqiad and codfw (T280582)]] (duration: 01m 51s)

Maintenance_bot removed a project: Patch-For-Review.Aug 18 2021, 2:10 PM

jijiki updated the task description. (Show Details)Aug 18 2021, 4:49 PM

Change 713655 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: remove shard01 from redis_sessions

https://gerrit.wikimedia.org/r/713655

gerritbot added a project: Patch-For-Review.Aug 18 2021, 4:51 PM

Change 713655 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: remove shard01 from redis_sessions

https://gerrit.wikimedia.org/r/713655

Maintenance_bot removed a project: Patch-For-Review.Aug 19 2021, 10:10 AM

Change 713842 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: remove 9 redis shards

https://gerrit.wikimedia.org/r/713842

gerritbot added a project: Patch-For-Review.Aug 19 2021, 11:44 AM

jijiki mentioned this in T278225: Productionise mc10[37-54].eqiad.wmnet.Aug 19 2021, 2:05 PM

Change 713842 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: remove 9 redis shards

https://gerrit.wikimedia.org/r/713842

Maintenance_bot removed a project: Patch-For-Review.Aug 19 2021, 3:10 PM

As expected:

jijiki renamed this task from Shrink redis_sessions cluster to Reduce number of shards in redis_sessions cluster.Aug 19 2021, 6:58 PM

jijiki updated the task description. (Show Details)Aug 20 2021, 6:47 AM

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2019.codfw.wmnet', 'mc1037.eqiad.wmnet', 'mc1038.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108200649_jiji_30725.log.

Completed auto-reimage of hosts:

['mc1037.eqiad.wmnet', 'mc2019.codfw.wmnet', 'mc1038.eqiad.wmnet']

and were ALL successful.

jijiki updated the task description. (Show Details)Aug 20 2021, 8:07 AM

Change 714049 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/mediawiki-config@master] ProductionServices: replace redis_lock eqiad servers

https://gerrit.wikimedia.org/r/714049

gerritbot added a project: Patch-For-Review.Aug 20 2021, 1:21 PM

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2021.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108201419_jiji_32012.log.

Completed auto-reimage of hosts:

['mc2021.codfw.wmnet']

and were ALL successful.

jijiki updated the task description. (Show Details)Aug 20 2021, 3:05 PM

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2023.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108230658_jiji_14160.log.

Completed auto-reimage of hosts:

['mc2023.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2025.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108231035_jiji_13984.log.

jijiki updated the task description. (Show Details)Aug 23 2021, 10:35 AM

Completed auto-reimage of hosts:

['mc2025.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2021-08-23T12:55:10Z] <jiji@deploy1002> Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:713619|ProductionServices: change rdb* servers in eqiad and codfw (T280582)]] (duration: 00m 57s)

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2027.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108231602_jiji_24205.log.

jijiki updated the task description. (Show Details)Aug 23 2021, 4:04 PM

Completed auto-reimage of hosts:

['mc2027.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2037.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108240555_jiji_4664.log.

Completed auto-reimage of hosts:

['mc2037.codfw.wmnet']

and were ALL successful.

jijiki updated the task description. (Show Details)Aug 24 2021, 6:31 AM

jijiki updated the task description. (Show Details)

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2029.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108241104_jiji_13758.log.

Completed auto-reimage of hosts:

['mc2029.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2031.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108241356_jiji_8432.log.

Completed auto-reimage of hosts:

['mc2031.codfw.wmnet']

and were ALL successful.

jijiki updated the task description. (Show Details)Aug 24 2021, 7:43 PM

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2033.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108250836_jiji_17437.log.

kostajh unsubscribed.Aug 25 2021, 8:41 AM

Completed auto-reimage of hosts:

['mc2033.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['mc2035.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108251009_jiji_4949.log.

Completed auto-reimage of hosts:

['mc2035.codfw.wmnet']

and were ALL successful.

jijiki closed this task as Resolved.Aug 25 2021, 10:46 AM

jijiki updated the task description. (Show Details)

Reduce number of shards in redis_sessions cluster
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	F34606863: image.png
	Aug 19 2021, 5:43 PM

Reduce number of shards in redis_sessions clusterClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Reduce number of shards in redis_sessions cluster
Closed, ResolvedPublic
Actions

Related Objects
Search...