Page MenuHomePhabricator

Audit redis usage on toollabs
Closed, ResolvedPublic

Description

Is currently full (15G) and staying there. Figure out who is using it for what.

How to audit

$ ssh tools-redis

# get redis-rdb-tools
$ git clone  https://github.com/sripathikrishnan/redis-rdb-tools
$ cd redis-rdb-tools
$ virtualenv venv
$ source venv/bin/activate
$ python setup.py develop

# copy the rdb dump
$ sudo cp /srv/redis/tools-redis-XXXX.rdb tools-redis-valhallasw-dump.rdb
$ sudo chown valhallasw tools-redis-valhallasw-dump.rdb && chmod 600 tools-redis-valhallasw-dump.rdb

# create memory.csv
$ rdb -c memory  tools-redis-valhallasw-dump.rdb > memory.csv

then run the following script:

import csv
import collections
import sys

def sizeof_fmt(num, suffix='B'):
    for unit in ['','K','M','G','T','P','E','Z']:
        if abs(num) < 1024.0:
            return "%3.1f%s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Y', suffix)

reader = csv.DictReader(sys.stdin)
sizes = collections.defaultdict(lambda: 0)
numkeys = collections.defaultdict(lambda: 0)
#sizeperlen = [

for line in reader:
    name = line['key']
    size = int(line['size_in_bytes'])
    for l in range(3,15):
        sizes[name[:l]] += size
        numkeys[name[:l]] += 1
    try:
        int(name, 16)
    except ValueError:
        pass
    else:
        sizes['!! hash without prefix'] += size
        numkeys['!! hash without prefix'] += 1


outfile = open('out1.tab', 'w')
lastsize = -1
lastline = ""

for k in sorted(sizes.keys()):
    size = sizes[k]
    if size < 100000:
        continue
    if k.startswith('rq:job:'):
        continue
    line = k + '\t' + sizeof_fmt(numkeys[k], "") + '\t' + sizeof_fmt(size) + "\n"
    if size != lastsize:
        line = k + '\t' + sizeof_fmt(numkeys[k], "") + '\t' + sizeof_fmt(size) + "\n"
    if size != lastsize:
        outfile.write(lastline)
    lastline = line
    lastsize = size
outfile.write(lastline)

using

$ pv memory.csv | python step1.py

and then read out1.tab.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Toolforge.
yuvipanda added subscribers: yuvipanda, coren, scfc.
coren triaged this task as High priority.
coren moved this task from Backlog to In Progress on the Toolforge board.

Made a copy of the redis db to /srv/redis/tools-redis-valhallasw-dump.rdb. Now running rdb -c memory [1] to see if there's any clear memory hog.

[1] ttps://github.com/sripathikrishnan/redis-rdb-tools

Another idea is to use CLIENT LIST and INFO together to correlate increase in memory use with the client used, but the CLIENT command is disabled.

I now have a memory.csv in the following format:

database,type,key,size_in_bytes,encoding,num_elements,len_largest_element
0,hash,"rq:job:b3b0810c-42c8-448b-9663-9a419388fbdf",2202,hashtable,7,597

Parsing it with simple python tools takes ~1 min for the entire file, so that's quite OK. The main offender seems to be the 'rq:job' job queue (?), which basically takes up 11GB of our 12GB. Now we just have to figure out who this is :-)

Based on the contents of the jobs, I'm guessing it's https://github.com/notconfusing/cocytus that's pushing all these elements in the queue. @notconfusing, do you have any idea why the jobs are not cleared from redis?

This issue can probably juts be solved by setting a sensible TTL; the jobs themselves have a TTL (180s), but apparently the rq library doesn't actually tell Redis about this, as the Redis TTL is -1 (=infinite).

Some more results (prefix, number of keys with that prefix, total size of all keys with that prefix):

AnomieBOT  504   213.9KB
bub_    167.5K  59.5MB
pir^2_iw:http:  129   265.2KB
rq:     5.2M    11.5GB
rq:finished:ch  1     1.1GB
rq:job  5.2M    10.5GB
rq:queue        3     8.4MB
tools.orphanta  1     109.2KB
xtools  808   12.8MB
xtoolsCNTx0000  19    595.0KB
xtoolsI18N_x00  1     537.8KB
xtoolsMetapDat  1     117.0KB
xtoolsNamespac  643   918.3KB
xtoolsrfap      140   10.6MB
xtoolsrfapde.w  2     104.4KB
xtoolsrfapen.w  138   10.5MB

It's clear the issue lies with the rq queue; which has 5.2 million (!) unprocessed entries in the last week alone (the oldest entry is from april 11).

There is also a number of keys that look like a hash, but that don't have a prefix (115 keys, 1.3MB). The total size is so small I don't think we should be bothered too much by it.

I talked to @notconfusing and he was ok with me killing all the rq: jobs and I just did.

Can you specify exactly what action you took to remove these jobs from the queue? Our service has been down while I've been hunting for the bug to no avail so far so I'd like to start it up again but with an additional task that periodically manually clears the queue so we can stay up until a better solution is found.

@Dfko: you'll have to keep track of the requests you put in; it's not possible to get a list of all requests from redis (except for admins). Maybe rq keeps track of them itself (maybe in redis), but that's not clear to me.

In any case, the jobs should have their redis TTL set to a reasonable value (i.e. that of the underlying task expiry time). It's not clear to me why this doesn't happen, because rq is supposed to do this by default. Maybe this is a bug in the rq lib, maybe cocytus is using rq incorrectly.

Yeah, I've been trying to figure this out, my best guess from the documentation so far is the failed jobs may not get timeouts set the same way originally submitted jobs do, need to work with this hypothesis more and maybe dig into the source of rq.

Maybe the best option is to start up cocytus for a while, and see what gets entered in Redis? Just keep an eye on http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1430297062.188&target=tools.tools-redis.redis.6379.memory.internal_view&target=tools.tools-redis.redis.6379.memory.external_view&from=-3hours , and let me know when it gets up to, say, 600MB? Then I'll make another redis dump, and I'll send you a list of all rq: keys (100MB at 2.2kB/key ~ roughly 50k keys), which then hopefully helps to determine what's going on.

valhallasw lowered the priority of this task from High to Medium.Apr 29 2015, 8:48 AM

Great, I was going to ask you about that anyway.

@valhallasw We're up to about a gigabyte, can I get a dump now?

Looks like this is happening again, I've poked the culprits from last time to see if it's them again.

yuvipanda changed the task status from Open to Stalled.Apr 30 2015, 5:01 AM
yuvipanda changed the task status from Stalled to Open.

Just realized @Dfko is part of 'culprits' :) Am making dump now.

We started it up again to get a key dump (see previous 3 comments)
@yuvipanda

Still figuring out the significance of this, but there is a failure in a line that is dealing with TTLs in the RQ library registry.py that is resulting in churn from failure queue back onto the failure queue which may be behind this:

07:11:24 failed: crossref_push.heartbeat() (73066422-b69a-4479-8c8c-d728a91672b8)
07:11:27 TypeError: unsupported operand type(s) for +: 'int' and 'unicode'
Traceback (most recent call last):
  File "/data/project/cocytus/c-env/local/lib/python2.7/site-packages/rq/worker.py", line 572, in perform_job
    finished_job_registry.add(job, result_ttl, pipeline)
  File "/data/project/cocytus/c-env/local/lib/python2.7/site-packages/rq/registry.py", line 29, in add
    score = ttl if ttl < 0 else current_timestamp() + ttl
TypeError: unsupported operand type(s) for +: 'int' and 'unicode'
Traceback (most recent call last):
  File "/data/project/cocytus/c-env/local/lib/python2.7/site-packages/rq/worker.py", line 572, in perform_job
    finished_job_registry.add(job, result_ttl, pipeline)
  File "/data/project/cocytus/c-env/local/lib/python2.7/site-packages/rq/registry.py", line 29, in add
    score = ttl if ttl < 0 else current_timestamp() + ttl
TypeError: unsupported operand type(s) for +: 'int' and 'unicode'
07:11:27 Moving job to failed queue.

Possibly unhelpful suggestion (feel free to ignore) - have you considered
using celery for queuing instead?

Considered early on and this is making me consider considering it again, though there might not remain enough time in the project for a big move like that.

I was not able to find an explicit TTL format that would not trigger this bug, but not specifying it at all seems not to trigger it. I am going to try to start things up again and see if it works that way. The default TTL is 500 seconds.

How are you defining the TTL currently? The error suggests that result_ttl passed by worker.py:572 is a str instead of an int.

As for not defining a TTL: that should use https://github.com/nvie/rq/blob/5cb873b438f1db009447e4feb17fb0f2075b272f/rq/worker.py#L40 which uses a TTL of 420 or 500s. Those values sound reasonable to me, but might be too short for you.

It happens whenever I pass anything via the command line parameter of worker, whether it can be parsed as an int or not, so I am guessing they just forgot to ever call int() on that parameter. Haven't yet gone to track down where that should be happening though.
I think we will have to live with dropping events under load vs. filling queues but 15 minutes seems plenty to keep up with bursts we've seen so far as the final stage in the pipeline does not take long.

I've run for about 20 minutes, seems like queue has plateaued at about 2 gigs and is stable. Will keep on keeping an eye on it for a bit longer.

The problem is that without TTL you fill up the redis queue causing it to
be unusable for other tools as well...

It happens whenever I pass anything via the command line parameter of worker, whether it can be parsed as an int or not, so I am guessing they just forgot to ever call int() on that parameter.

That might be https://github.com/nvie/rq/pull/359

so upgrading rq might help with that.

As for the TTL -- in general, I think that should be a backup measure, not the main method. You probably want to actively collect the failed tasks and do something with them instead.

The problem is that without TTL you fill up the redis queue causing it to
be unusable for other tools as well...

If you don't specify a TTL, rq should use the default TTL (which is 420s for tasks and 500s for results).

Do you think the default of 15 minutes will be too long? >20 events per second is rare, so a very conservative estimate of queue load would be 18,000 events in flight.

We have workers on the failed queue already. We were getting things going from failed -> failed forever due to this but that should be resolved.

That should be fine; the entries are ~2.5kB each, so 18k events is < 50MB.

There seems to be some sort of leak again, growing to ~6GB from the 500MB we had two weeks ago: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1430297062.188&target=tools.tools-redis.redis.6379.memory.internal_view&target=tools.tools-redis.redis.6379.memory.external_view&from=-14days

Some numbers on the analysis tools. rdb runs at approx 1.5MB/s, so the current ~3GB dump takes half an hour. I'm not sure why the memory usage is almost twice the the dump size, though...

From what I can see, most of it is from rq again:

rq:     1.5M    4.7GB
rq:finished:ch  1.0     1.2GB
rq:job  1.5M    3.4GB

so there's a 1.2GB (?!) single key in rq:finished, and 3.4GB of rq:job entries...

Interesting...
Getting back on it.
Can you post that dump for me somewhere?

Yep, it's in /home/dfko/redis/dfko.csv.

I'm disabling the tool causing the issues (same as last time, one from @Dfko) until they fix it.

Suspended all those jobs. Am cleaning out redis of rq:* now