Page MenuHomePhabricator

Reduce memory footprint of topic models
Closed, ResolvedPublic

Description

The memory footprint of our topic models is too big to safely deploy.

Options:

  • Try to use gensim's memory-map mode and see if that helps
  • Reduce the dimensionality of the vectors to 50 cells
  • Reduce the vocab to 150k or 100k

It's important that we confirm that fitness can be maintained as we experiment so the first option -- to use memory-mapped files might be the best thing to try first.

Event Timeline

Halfak created this task.Jan 23 2020, 4:03 PM

I'm investigating memory usage. I'm working from a python terminal on my dev laptop. Essentially, I'm tracking VSZ and RSS while running commands.

Before loading anything:

  • VSZ: 35600
  • RSS: 9340

After from revscoring import Model:

  • VSZ: 495752
  • RSS: 76216

After enwiki = Model.load(open("models/enwiki.articletopic.gradient_boosting.model"))

  • VSZ: 1010852
  • RSS: 567348

After arwiki = Model.load(open("models/arwiki.articletopic.gradient_boosting.model"))

  • VSZ: 1385732
  • RSS: 941856

After enwiki2 = Model.load(open("models/enwiki.articletopic.gradient_boosting.model"))

  • VSZ: 1464596
  • RSS: 1020768

This is higher memory usage than I think we are really prepared for. After loading all of the models, it ends up being about 3x as much memory as we needed before. As we can see from the final load, that memory gets shared relatively straightforwardly, but it is still too much.

I wonder if we can use gensim's memory-map mode to get around this. Alternatively, we can reduce the dimensions of our vectors or reduce the size of the vocabulary.

First, I'm trying out memory-maps. I converted out word2vec format into gensim's KV objects with:

>>> from gensim.models import KeyedVectors
>>> model = KeyedVectors.load_word2vec_format("enwiki-20191201-learned_vectors.100_cell.300k.vec")
>>> model.save("enwiki-20191201-learned_vectors.100_cell.300k.kv")

Then I experimented with trying to load the raw word2vec format and the "kv" format.

Before loading anything:
VSZ: 33740
RSS: 8676

After model = KeyedVectors.load_word2vec_format("enwiki-20191201-learned_vectors.100_cell.300k.vec"):
VSZ: 2903948
RSS: 358768

After restarting and model = KeyedVectors.load("enwiki-20191201-learned_vectors.100_cell.300k.kv", mmap='r'):
VSZ: 2927572
RSS: 261680

The affect seems to be pretty minimal based on what ps reports for VSZ and RSS, but the model "load()" method runs almost instantly. So I'm guessing there's something funny going on with how RSS is reported.

Loading the regular vectors:

>>> import time
>>> from gensim.models import KeyedVectors
>>> start = time.time(); model = KeyedVectors.load_word2vec_format("enwiki-20191201-learned_vectors.100_cell.300k.vec"); print(time.time() - start, "seconds")
/srv/home/halfak/venv/3.5/lib/python3.5/site-packages/smart_open/smart_open_lib.py:402: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
26.84015417098999 seconds
>>> import time
>>> from gensim.models import KeyedVectors
>>> start = time.time(); model = KeyedVectors.load("enwiki-20191201-learned_vectors.100_cell.300k.kv", mmap='r'); print(time.time() - start, "seconds")
/srv/home/halfak/venv/3.5/lib/python3.5/site-packages/smart_open/smart_open_lib.py:402: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
1.036184310913086 seconds

I wonder what happens if I don't tell gensim to load it as a mmap.

>>> import time
>>> from gensim.models import KeyedVectors
>>> start = time.time(); model = KeyedVectors.load("enwiki-20191201-learned_vectors.100_cell.300k.kv"); print(time.time() - start, "seconds")
/srv/home/halfak/venv/3.5/lib/python3.5/site-packages/smart_open/smart_open_lib.py:402: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
1.0818400382995605 seconds

Interesting! Is it even using a mmap or is that argument being ignored somehow?

Aha! It looks like memory usage is greater when we do not use the mmap='r' option. Here's what I see after I run model = KeyedVectors.load("enwiki-20191201-learned_vectors.100_cell.300k.kv").

VSZ: 2925512
RSS: 378676

OK so I've now generated learned vectors for 50c/100k vocab. I just trained the enwiki articletopic model.

  • The old gnews model: pr_auc (micro=0.718, macro=0.555)
  • The 100d, 300k vocab model: pr_auc (micro=0.801, macro=0.676)
  • The 50d, 100k vocab model: pr_auc (micro=0.789, macro=0.646)

Generally, this isn't much of a loss. I think we should move forward with this.

Halfak claimed this task.Jan 24 2020, 6:54 PM
Halfak moved this task from Active to Review on the Scoring-platform-team (Current) board.

All models updated. Looks good: https://github.com/wikimedia/drafttopic/pull/47

Change 567117 had a related patch set uploaded (by Halfak; owner: Halfak):
[scoring/ores/assets@master] Switch 100 cell, 300k vocab for 50 cell, 100k vocab.

https://gerrit.wikimedia.org/r/567117

Change 567117 merged by Accraze:
[scoring/ores/assets@master] Switch 100 cell, 300k vocab for 50 cell, 100k vocab.

https://gerrit.wikimedia.org/r/567117

Halfak closed this task as Resolved.Feb 5 2020, 4:27 PM