wsgi processes on ores-staging are 3.1GB compared to the former 750MB of RES.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Halfak | T177544 Revscoring 2.0 takes up too much memory | |||
Resolved | Halfak | T177636 Reduce label_thresholds granularity |
Event Timeline
>>> m = Model.load(open("submodules/editquality/models/enwiki.damaging.gradient_boosting.model")) >>> len(m.info['statistics'].label_thresholds[True]) 19165
OK so it looks like we're keeping data on ~20000 unique thresholds for every label.
Basically we have a unique threshold for every single input observation in testing.
That is far too many. How do we trim that but preserve information?
Maybe we could round. Rounding to 4 digits should be pretty good.
With one model loaded (enwiki damaging):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32176 halfak 20 0 802432 191448 30008 S 0.0 2.4 0:02.15 python
With two models loaded (enwiki damaging, eswiki reverted):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32176 halfak 20 0 856580 245140 29948 S 0.0 3.1 0:02.62 python
So that adds about 45k res. What if I keep a reference to the model info and then remove the references to the model itself?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32176 halfak 20 0 856576 245508 30008 S 0.0 3.1 0:02.85 python
Well... somehow that took more res. I even used gc.collect(). I wonder if there a reference to the model itself in model.info somewhere. Hmm. Could it be that the "info" itself accounts for most of the res? Let's trim it down. :) I'll remove the info for enwiki damaging.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32176 halfak 20 0 855808 244740 30008 S 0.0 3.1 0:03.08 python
Well that had practically no effect. Let's remove info for eswiki damaging.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32176 halfak 20 0 812272 201544 30008 S 0.0 2.6 0:03.24 python
Oooh interesting! That cause it to drop a lot. I wonder if that's because python could unload a bunch of stuff that is needed to track info at all.
I just tried the article quality model for enwiki and found a much larger RES
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 552 halfak 20 0 1095328 507468 27796 S 0.0 6.4 0:03.27 python
Let's try dropping the model and keeping the info again.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 552 halfak 20 0 1088928 501076 27796 S 0.0 6.4 0:03.45 python
Hmm... Again this is basically zero effect. I think that next time I'll be looking into compressing that model info and seeing what that does. I can experiment with explicitly dropping 'thresholds' and seeing how that goes.
Just thought I should check on the draft quality model. Loading just that into memory and got the biggest boost to RES so far:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29532 halfak 20 0 1657316 0.995g 26636 S 0.0 13.3 0:13.87 python
Here's the size after importing sys
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29532 halfak 20 0 1657316 0.986g 17392 S 0.0 13.1 0:13.93 python
>>> m = Model.load(bz2.open("ores-wmflabs-deploy/submodules/draftquality/models/enwiki.draft_quality.gradient_boosting.model.bz2")) >>> m.info['statistics'].label_thresholds.keys() odict_keys(['OK', 'spam', 'vandalism', 'attack']) >>> len(m.info['statistics'].label_thresholds['OK']) 152494 >>> import sys >>> sys.getsizeof(m.info['statistics'].label_thresholds['OK']) 1224128 >>> sys.getsizeof(m.info['statistics'].label_thresholds['OK']) / 1024 1195.4375 >>> sys.getsizeof(m.info['statistics'].label_thresholds['OK']) / 1024*1024 1224128.0 >>> sys.getsizeof(m.info['statistics'].label_thresholds['OK']) / (1024*1024) 1.16741943359375 >>> sys.getsizeof(m.info['statistics'].label_thresholds) / (1024*1024) 5.340576171875e-05 >>> sys.getsizeof(m.info['statistics'].label_thresholds['spam']) / (1024*1024) 1.16741943359375
Looks like each label_thresholds thing is about 1.2MB
>>> stat = m.info['statistics'].label_thresholds['attack'].pop() >>> (sys.getsizeof(stat) * len(m.info['statistics'].label_thresholds['attack']) ) / (1024*1024) 9.30743408203125
OK but if I multiply the individual stat by the number of those stored in the list, I get 9.3MB -- a whole order of magnitude! That's way too much.
I think I'm going to try implementing __slots__ in ScaledPredictionStatistics.
With slots, I get:
In [10]: sps = ScaledPredictionStatistics(counts=(10,11,12,13)) In [10] used 0.0039 MiB RAM in 0.10s, peaked 0.00 MiB above current, total RAM usage 76.37 MiB
Without slots, I get the exact same thing:
In [2]: sps = ScaledPredictionStatistics(counts=(10,11,12,13)) In [2] used 0.0039 MiB RAM in 0.10s, peaked 0.00 MiB above current, total RAM usage 78.15 MiB
Damn.
Just playing around, I dumped the thresholds table to json:
m = Model.load(open("models/enwiki.damaging.gradient_boosting.model", "r")) o=json.dumps(m.info['statistics'].label_thresholds.format(formatting='json')) f=open("out", "w") f.write(o) 3598434 f.close()
This means that we could have the thresholds in memory as JSON at only 3.5MB. What about doing that, and hydrating into ModelInfo once per day when the threshold cache is recalculated?
When formatting json, the thresholds are arounded and limited. In this case, the default is 4 decimal places. You can adjust this with the thresholds_ndigits parameter.
In my tests, I found that not rounding at all got us 129MB, rounding to 4 gets us 3.5MB, and rounding to 3 digits gets us down to 470K. I'll submit a PR
OK Just released revscoring 2.0.8. Now I'm going to rebuild all of the models -- starting with the big set of editquality wikis.