Revscoring 2.0 takes up too much memory
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Oct 5 2017, 9:47 PM

Description

wsgi processes on ores-staging are 3.1GB compared to the former 750MB of RES.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Halfak	T177544 Revscoring 2.0 takes up too much memory
		Resolved		Halfak	T177636 Reduce label_thresholds granularity

Event Timeline

Halfak created this task.Oct 5 2017, 9:47 PM

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptOct 5 2017, 9:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

My hypothesis is that it's all 'thresholds' because that's the new big thing.

>>> m = Model.load(open("submodules/editquality/models/enwiki.damaging.gradient_boosting.model"))
>>> len(m.info['statistics'].label_thresholds[True])
19165

OK so it looks like we're keeping data on ~20000 unique thresholds for every label.

Basically we have a unique threshold for every single input observation in testing.

That is far too many. How do we trim that but preserve information?

Maybe we could round. Rounding to 4 digits should be pretty good.

With one model loaded (enwiki damaging):

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND  
32176 halfak    20   0  802432 191448  30008 S   0.0  2.4   0:02.15 python

With two models loaded (enwiki damaging, eswiki reverted):

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
32176 halfak    20   0  856580 245140  29948 S   0.0  3.1   0:02.62 python

So that adds about 45k res. What if I keep a reference to the model info and then remove the references to the model itself?

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND  
32176 halfak    20   0  856576 245508  30008 S   0.0  3.1   0:02.85 python

Well... somehow that took more res. I even used gc.collect(). I wonder if there a reference to the model itself in model.info somewhere. Hmm. Could it be that the "info" itself accounts for most of the res? Let's trim it down. :) I'll remove the info for enwiki damaging.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
32176 halfak    20   0  855808 244740  30008 S   0.0  3.1   0:03.08 python

Well that had practically no effect. Let's remove info for eswiki damaging.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
32176 halfak    20   0  812272 201544  30008 S   0.0  2.6   0:03.24 python

Oooh interesting! That cause it to drop a lot. I wonder if that's because python could unload a bunch of stuff that is needed to track info at all.

I just tried the article quality model for enwiki and found a much larger RES

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
552 halfak    20   0 1095328 507468  27796 S   0.0  6.4   0:03.27 python

Let's try dropping the model and keeping the info again.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
552 halfak    20   0 1088928 501076  27796 S   0.0  6.4   0:03.45 python

Hmm... Again this is basically zero effect. I think that next time I'll be looking into compressing that model info and seeing what that does. I can experiment with explicitly dropping 'thresholds' and seeing how that goes.

Just thought I should check on the draft quality model. Loading just that into memory and got the biggest boost to RES so far:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                
29532 halfak    20   0 1657316 0.995g  26636 S   0.0 13.3   0:13.87 python

Here's the size after importing sys

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                
29532 halfak    20   0 1657316 0.986g  17392 S   0.0 13.1   0:13.93 python

>>> m = Model.load(bz2.open("ores-wmflabs-deploy/submodules/draftquality/models/enwiki.draft_quality.gradient_boosting.model.bz2"))
>>> m.info['statistics'].label_thresholds.keys()
odict_keys(['OK', 'spam', 'vandalism', 'attack'])
>>> len(m.info['statistics'].label_thresholds['OK'])
152494
>>> import sys
>>> sys.getsizeof(m.info['statistics'].label_thresholds['OK'])
1224128
>>> sys.getsizeof(m.info['statistics'].label_thresholds['OK']) / 1024
1195.4375
>>> sys.getsizeof(m.info['statistics'].label_thresholds['OK']) / 1024*1024
1224128.0
>>> sys.getsizeof(m.info['statistics'].label_thresholds['OK']) / (1024*1024)
1.16741943359375
>>> sys.getsizeof(m.info['statistics'].label_thresholds) / (1024*1024)
5.340576171875e-05
>>> sys.getsizeof(m.info['statistics'].label_thresholds['spam']) / (1024*1024)
1.16741943359375

Looks like each label_thresholds thing is about 1.2MB

>>> stat = m.info['statistics'].label_thresholds['attack'].pop()
>>> (sys.getsizeof(stat) * len(m.info['statistics'].label_thresholds['attack']) ) / (1024*1024)
9.30743408203125

OK but if I multiply the individual stat by the number of those stored in the list, I get 9.3MB -- a whole order of magnitude! That's way too much.

I think I'm going to try implementing __slots__ in ScaledPredictionStatistics.

With slots, I get:

In [10]: sps = ScaledPredictionStatistics(counts=(10,11,12,13))
In [10] used 0.0039 MiB RAM in 0.10s, peaked 0.00 MiB above current, total RAM usage 76.37 MiB

Without slots, I get the exact same thing:

In [2]: sps = ScaledPredictionStatistics(counts=(10,11,12,13))
In [2] used 0.0039 MiB RAM in 0.10s, peaked 0.00 MiB above current, total RAM usage 78.15 MiB

Damn.

Just playing around, I dumped the thresholds table to json:

m = Model.load(open("models/enwiki.damaging.gradient_boosting.model", "r"))
o=json.dumps(m.info['statistics'].label_thresholds.format(formatting='json'))
f=open("out", "w")
f.write(o)
3598434
f.close()

This means that we could have the thresholds in memory as JSON at only 3.5MB. What about doing that, and hydrating into ModelInfo once per day when the threshold cache is recalculated?

awight mentioned this in T177636: Reduce label_thresholds granularity.Oct 6 2017, 5:17 PM

awight created subtask T177636: Reduce label_thresholds granularity.

When formatting json, the thresholds are arounded and limited. In this case, the default is 4 decimal places. You can adjust this with the thresholds_ndigits parameter.

In my tests, I found that not rounding at all got us 129MB, rounding to 4 gets us 3.5MB, and rounding to 3 digits gets us down to 470K. I'll submit a PR

https://github.com/wiki-ai/revscoring/pull/365