Page MenuHomePhabricator

Avoid constant evictions on Redis main stash
Closed, ResolvedPublic

Description

Evictions are fine under pressure or when service is degraded but ideally it would not happen by default at a constant rate every second. That suggests we're over capacity and/or that there is too much stuff without any sort of TTL.

It looks like there's about 100 evictions/second across the Redis main stash (aka redis_sessions).

Event Timeline

I checked one of these (mc1020) and noticed that the majority of it space is held by Echo still.

mc1020 Breakdown of redis key scan
* 2.9M   :echo:seen
*   180K centralauth:session:
*    28K :MWSession:
*    11K ChronologyProtector:
*    11K (everything else)

This is quite likely the reason we are still seeing pre-emptive evictions from the MW MainStash. I was hoping that perhaps these evictions are just Redis slowly cleaning up the now-unused Echo keys. However that doesn't make sense because:

50+ evictions per second150K evicions per hour3M evictions per day

The Redis mainstash was last written to by Echo in October 2019 (T222851), so those would've been evicted all by now if it was just that.

Instead, I suspect rather that it was/is at capacity and thus is regularly evicting about as many keys as were added at a fairly constant rate, and those evictions are (to a first approximation) random, thus affecting newer data as well.

Best I can tell, the migration in Oct 2019 (T222851) was final and complete, so these are probably safe to remove all now. But I'll wait for confirmation from @Catrope to approve that before I do that.

From a few spot checks, it looks like most if not all of these actually have no meaningful value (e.g. the 19700101000001 epoch default), and also no TTL setl. That would explain why Redis is trying its best to keep them around still.

As part of T222851, these issues were already fixed. They were given a TTL of 1 year, and Echo no longer stores these "empty" epoch values (it just deletes the key instead of overwriting with Epoch). So at least this portion of keys is fine to remove, even in the hyptothetical case that something still uses this and/or in the event we were to switch back.

Script is at P11211, which I ran on mc1020. This freed up 80% of its storage capacity and there are no longer pre-emptive evictions taking place. Yay :)

Screenshot 2020-05-16 at 17.52.06.png (588×2 px, 242 KB)

1from __future__ import print_function
2import os
3import time
4
5import redis
6
7
8red = redis.StrictRedis(password=os.environ['AUTH'])
9
10query_key_pattern = '*:echo:seen:*'
11
12cond_key = ':echo:seen:'
13cond_ttl = -1
14# cond_value = 's:14:"19700101000001";'
15
16checked = 0
17printed = 0
18deleted = 0
19
20for key in red.scan_iter(match=query_key_pattern, count=1000):
21 if not cond_key in key:
22 # sanity
23 print('F', end='')
24 continue
25
26 checked += 1
27 if checked > 1000:
28 # idle for 100ms
29 time.sleep(0.100)
30 checked = 0
31
32 if red.ttl(key) == cond_ttl:
33 red.delete(key)
34 deleted += 1
35 printed += 1
36 if (printed > 80):
37 print('.. deleted ' + str(deleted) + ' sofar')
38 printed = 0

Mentioned in SAL (#wikimedia-operations) [2020-05-16T17:24:27Z] <Krinkle> krinkle@mc1020 Prune old echo:seen: keys that have ttl:-1 from Redis main stash, ref T252945

Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)
@Krinkle wrote in task description

Screenshot 2020-05-16 at 17.20.10.png (596×2 px, 380 KB)

That suggests we're over capacity and/or that there is too much stuff without any sort of TTL.

@elukey confirmed this on IRC. The invisible ceiling in the above graph matches the memory limit allocated to Redis mainstash shards:

<elukey> this is from /etc/redis/tcp_6379.conf

maxmemory 500Mb
maxmemory-policy volatile-lru

<Krinkle> I know LRU stands for "least-recently-used"...
<Krinkle> I also know to not to expect perfection in caching/lru
<Krinkle> but.. how come it still hasn't gotten to the millions of unused no-ttl keys from months months ago?
<elukey> ahh interesting!
<elukey> "Evicts the least recently used keys out of all keys with an "expire" field set"
<elukey> this is the "volatile-lru"
<elukey> otherwise there is allkeys-lru
<Krinkle> so given that we very bravely gave everything a TTL...
<Krinkle> and that 90% is used up by legacy no-ttl Echo values from pre-2019
<Krinkle> that means it's basically exclusively deleting keys we want to keep
<Krinkle> great.

Mentioned in SAL (#wikimedia-operations) [2020-05-16T17:56:18Z] <Krinkle> krinkle@mc1023 Pruning old echo:seen: Redis keys that didn't use a ttl yet, ref T252945

Mentioned in SAL (#wikimedia-operations) [2020-05-16T18:24:03Z] <Krinkle> krinkle@mc1025 Pruning the old echo:seen: Redis keys that didn't have a ttl yet, ref T252945

Mentioned in SAL (#wikimedia-operations) [2020-05-16T18:30:07Z] <Krinkle> krinkle@mc1024 Pruning the old echo:seen: Redis keys that didn't have a ttl yet, ref T252945

Mentioned in SAL (#wikimedia-operations) [2020-05-16T18:54:55Z] <Krinkle> krinkle@mc1026 Pruning the old echo:seen: Redis keys that didn't have a ttl yet, ref T252945

Mentioned in SAL (#wikimedia-operations) [2020-05-16T18:58:50Z] <Krinkle> krinkle@mc1027 Pruning the old echo:seen: Redis keys that didn't have a ttl yet, ref T252945

[…] Script is at P11211, which I ran on mc1020. This freed up 80% of its storage capacity and there are no longer pre-emptive evictions taking place. Yay :)

Screenshot 2020-05-16 at 17.52.06.png (588×2 px, 242 KB)

I've since ran it on the rest of the 14 shards as well. From Redis dashboard (Grafana):

Screenshot 2020-05-18 at 15.52.51.png (1×2 px, 424 KB)

Screenshot 2020-05-18 at 15.53.21.png (1×2 px, 512 KB)

We no longer use the Echo keys in Redis since January 2020, with rOMWCea2b01c00341: Echo: remove transition echo seen-time storage. When porting from Redis to Kask, we noticed that these keys had infinite TTLs in Redis and that bothered us, but we didn't realize it was this damaging. In October 2019, we set the TTL to one year, stopped storing the epoch values you saw, and a few weeks later we moved to Kask as the primary backend with Redis as a fallback. See T222851: Improve Echo seentime code for multi-DC access and the patches linked from there.

All of these Echo keys should have been unused for months, and should be safe to delete.

Gilles claimed this task.