Page MenuHomePhabricator

Compress Gensim models
Closed, ResolvedPublic

Description

See https://gist.github.com/generall/68fddb87ae1845d6f54c958ed3d0addb
and https://medium.com/@vasnetsov93/shrinking-fasttext-embeddings-so-that-it-fits-google-colab-cd59ab75959e

Essentially, we should be able to get our vector embeddings into a smaller RAM footprint. Let's experiment with this to see if it can help us.

Event Timeline

Halfak triaged this task as Low priority.Apr 6 2020, 5:04 PM

according to Medium article (but this issue is common on forums) :

Main RAM issues of fasttext are:

  1. "binary model carries not only weights for words and n-grams, but also weights for translating vectors back to words", i.e. negative vectors
  1. "vocab and n-gram matrices are very large"
    • the author solves it by : (i) shrinking vocab to N most common words and (ii) shrinking/remapping Ngram matrix (this matrix uses hashes not words) to take less space -> higher possibility of collisions
    • I think it's interesting, but it sounds like a complex way to achieve results that could be done by fasttext parameters

Fasttext parameters that could help with memory usage:

  1. parameters for fasttext that are already set in Makefile
    • vector_dimensions=50
    • vector_params=--param 'dim=$(vector_dimensions)' --param 'loss="ova"'
  1. ngram size
  1. bucket size and min word count
    • https://medium.com/@adityamohanty/understanding-fasttext-an-embedding-to-look-forward-to-3ee9aa08787
    • "The bucket-size represents the total size of arrays available for all n-grams." -> if you decrease the default bucket size, the number of n-grams getting hashed into the same bucket will increase -> model size will decrease
    • "min word count, which can be increased to ignore words below a certain threshold." -> parameter "-minCount" default [1]: if you increase this parameter the model should get smaller

Other options to consider:

  • fasttext options - https://fasttext.cc/docs/en/options.html
  • "-epoch"[5] - increasing the number could increase precision that was lost because of model shrinking
  • "-lr"[0.1] - increase/decrease (same as epoch)

Evaluation:

  • test range of values for (i) bucket-size, (ii) wordNgram, (iii) min Word count in combination with (i) epoch, (ii) learning rate.
  • evaluate the results: either by (i) comparison to the initial model (cosine similarity - code snippet in Medium article) or (ii) precision (FB research Github tutorial).
  • use (i) psutils (https://psutil.readthedocs.io/en/latest/), or other library, to record used memory or (ii) record model sizes.

I'm not sure if we can use bigrams with our current pipeline. We could extend our pipeline to produce bigrams though. I'm not sure how this makes the model smaller though. Maybe we can get better fitness from a smaller vocab with bigrams?

We've already limited the vector size to 50 and the vocab size to 100k. I imagine that this would cut out all of the words that would fall under any reasonable -minCount. This is one of the reasons I think compression (hashing) is a good idea. I can't see any other way to get this vector set to be smaller.

We have a pipeline in https://github.com/wikimedia/drafttopic that uses the limited embeddings to train and test models. I can help by running that pipeline to look for losses/gains in fitness resulting from changes to the embeddings.

I went through the documentation for gensim and fasttext(very limited info), and a lot of other pages.. :) . To proceed further: what is the goal of python-mwtext?

if I understand it correctly:

  • python-mwtext is doing:
    1. the supervised fasttext model is trained with a labeled dataset
    2. the word embeddings are extracted from the model and saved for further use by gensim KeyedVectors.load_word2vec_format()
  • gensim :
    1. supports only unsupervised models, i.e. it does not support labeled embeddings so in python-mw you had to extract them manually from fasttext model:
output.write("{0} {1}\n".format(words, dimensions))
for word in model.get_words():
        vector = model.get_word_vector(word)
        output.write(" ".join([word] + [str(v) for v in vector]))
        output.write("\n")

Comments on that:

Additional comments:

  • you are right, bigrams do not make the model smaller(my mistake) it could only improve the results
  • fasttext uses word AND character embedding, both are important, character embedding is mostly useful for unknown words
    • [if I understand it correctly] the hashing matrices are included in the model
  • there is "min/max length of char ngram" parameter; I couldn't figure out yet how char embeddings are stored in fasttext and if this option has any effect on its size, i.e. if the characters are stored individually for each word or in ngrams or...?
  • the best article about the fasttext insides I could find(even though it's outdated) :

Back to the Medium article from the initial post:

  • there is no note about supervised/unsupervised learning. It seems the author's goal is to take only the word embeddings produced from fasttext unsupervised model with default parameters and redo the hashing matrix. It seems he is losing the ability to embed words that are not in vocabulary as I can't see anything about char embeddings.

Summary:

  • what is the purpose of extracted word embeddings from the supervised model?
  • I will create Jupyter notebook for evaluation of the fasttext parameters where I will plot precision and size of the model as a function of its parameters, I will use dpedia dataset for testing : https://github.com/facebookresearch/fastText/blob/master/classification-example.sh
  • Evaluation :
    1. adjust the model size by changing [bucket size, min word count, dim] parameters, set max dictionary size (as previously)
    2. evaluate the model performance by changing [epoch, learning rate] parameters on train/test dataset
    3. MOST IMPORTATNT : there is a quantization parameter that compresses the model, this might solve the model memory issue

[again, limited documentation, I did my best to summarize how I understand it]

what is the purpose of getting word embedding from supervised model? these embeddings are specifically tailored for fasttext classification. They can be used to train a new model. Is this the purpose?

Right. We let the fasttext model adjust the embedding by training a classifier and then we re-use the adjusted embeddings for another classifier that will allow us to add more features -- e.g. pronoun counts. We have found that this allows us to increase fitness of the eventual model.

fasttext uses word AND character embedding, both are important, character embedding is mostly useful for unknown words

I think we drop the sub-word embeddings as part of our process of extracting the embeddings from the trained fasttext model. This is probably good because we do want to keep the model small and mispellings/rare words are less useful to us.

there is "min/max length of char ngram" parameter; I couldn't figure out yet how char embeddings are stored in fasttext and if this option has any effect on its size, i.e. if the characters are stored individually for each word or in ngrams or...?

The file format we work from is very simple. It us basically "<word> <vector_val1> <vector_val2>..." So presumably it couldn't differentiate between a word and a character when re-reading this file. E.g. the word "a" would like identical to the character "a".

there is no note about supervised/unsupervised learning. It seems the author's goal is to take only the word embeddings produced from fasttext unsupervised model with default parameters and redo the hashing matrix. It seems he is losing the ability to embed words that are not in vocabulary as I can't see anything about char embeddings.

Right. I think this works for us. When we encounter a word that doesn't appear in the most common 100k, we ignore it when vectorizing an article.

Ultimately, we want some embeddings that look like the ones described on this line: https://github.com/mediawiki-utilities/python-mwtext/blob/master/Makefile#L26 But we need them to be smaller without a substantial loss in our ability to build topic prediction models using them. I'm not sure if making the full embedding produced by fasttext smaller will help with that. Am I missing something?

MOST IMPORTATNT : there is a quantization parameter that compresses the model, this might solve the model memory issue https://fasttext.cc/blog/2017/10/02/blog-post.html#model-compression

Woah. This looks really interesting -- especially the feature selection. I wonder if we could work with the 10k most *important* words and reduce our vector size by an order of magnitude! I'm not sure I understand weight quantization. It looks like it might not reduce the size of the model in memory -- just the size of the serialization of the model.

Short update

  • initial DraftTopic testing issues[resolved] : (i) module versions issues, (ii) default MiniConda python3.8 issue - 3.8 has some issues with GenSim and other modules used in DraftTopic, (iii) calculations are memory/CPU intense (I created virtual env. on a machine I don't use), (iv) sample dataset too big, downloading crashed multiple times - Aaron provided smaller version
  • I created word2vec GenSim models according to python-mwtext : https://github.com/mediawiki-utilities/python-mwtext/blob/master/mwtext/utilities/word2vec2gensim.py
  • I created limit = [10k,25k,50k,75k,100k] dictionary size models from provided preprocessed text as a baseline with :
model = KeyedVectors.load_word2vec_format(input_path, limit=limit)
  • I created quantized models with cutoff dictionaries

(again issue with Python 3.8 so I ended up with using FastText command line/terminal commands)

model = fasttext.train_supervised(input_path, **params)
model.quantize(input=train_data, retrain=True, cutoff=[10k,25k,50k,75k,100k])

https://github.com/mediawiki-utilities/python-mwtext/blob/master/mwtext/utilities/learn_vectors.py

  • as I had so many issues with python 3.8 I switched to 3.7.2 -> fewer issues
  • I added commands to Makefile to create [10k,25k,50k,50k,75k,100k] versions of enwiki.balanced_article_sample.w_article_text.json and enwiki.balanced_article_sample.w_draft_text.json
  • I created cached versions of sampled datasets
  • I trained the models with new balanced cached datasets - I didn't notice the model training command is calling "drafttopic.feature_lists.enwiki.drafttopic" which has a hardcoded word2vec GenSim model/dictionary (.kv) - "enwiki-20191201-learned_vectors.50_cell.100k.kv"
    • I followed KISS principle ("keep it simple, stupid") and created "drafttopic.feature_lists.enwiki" file per word2vec dictionary, eg. :
drafttopic.feature_lists.enwiki_gs100k (referring to gensim limited model to 100k dict)
drafttopic.feature_lists.enwiki_qt100k (referring to quantized model with cutof=100k)
drafttopic.feature_lists.enwiki_gs75k 
...

IMPORTANT Note :
I believe there is an inconsistency in DraftTopic Makefile :

  • you download "enwiki-20191201-learned_vectors.50_cell.100k.kv" but in "datasets/enwiki.balanced_article_sample.w_draft_cache.json" you use 100_cell.300k (I guess the previous version) and in model training "models/enwiki.drafttopic.gradient_boosting.model" you use "drafttopic.feature_lists.enwiki.drafttopic" which has a hardcoded "enwiki-20191201-learned_vectors.50_cell.100k.kv"

https://github.com/wikimedia/drafttopic/blob/master/Makefile

Current status
Models are being trained, hopefully, the calculations will finish in <24h and I can finally compare "FastText quantization cutoff" to "GenSim limit".

Nice catch of that 100_cell.300k item in the Makefile. I'm surprised we were still able to build the models with that there.

Halfak renamed this task from Compress Gensim models with term hashing to Compress Gensim models.May 27 2020, 4:19 PM

@aaron, I need help with understanding the results and if it makes sense what I did to test vocabulary sizes.
(in following I use brackets to list file names options)
(i) Does it make sense what I did?

  • As I mentioned in previous update I extracted word2vec vocab with sizes [10k,25k,50k,75k,100k] using gensim limit parameter and then using fasttext quantize function => I create 10 word2vec datasets names w2v_[gensim,quantized][10k,25k,50k,75k,100k].kv
  • I created new file for each word2vec dataset in drafttopic\drafttopic\feature_lists : enwiki_[gs,qt][10k,25k,50k,75k,100k].py
  • I added new commands to drafttopic\Makefile
  • I ran the commands related to enwiki except "tuning" - I killed the process when I saw it uses gridsearch and it will try 16 parameter combinations :)

(ii) Results explanation

  • Statistics - UNDERSTAND
  • rates/match_rate/filter_rate - DONT UNDERSTAND (fractions from the dataset?)
  • recall/precision/accuracy/f1/fpr/roc_auc/pr_auc - UNDESRTAND
  • !recall/!precision/!f1 - DONT UNDERSTAND (what does the "!" mean?)

(iii) Summary
Difference between classification stats of 10k and 100k versions are small, but quantized 10k word2vec dataset performs a bit better than gensim limited 100k dataset, eg:

f1 microf1 macro
enwiki.articletopic_gs100k0.6810.512
enwiki.articletopic_qt100k0.690.533
enwiki.articletopic_gs10k0.6720.504
enwiki.articletopic_qt10k0.6850.53
enwiki.drafttopic_gs100kNoneNone
enwiki.drafttopic_qt100k0.6450.485
enwiki.drafttopic_gs10k0.6270.454
enwiki.drafttopic_qt10k0.6410.478

The none values are due to missing 0 in "Geography.Regions.Africa.Central Africa" in the report - I think.
(iv) You may find the results in the attachment
I included :

  • drafttopic\Makefile
  • reports from drafttopic\model_info
  • files from drafttopic\drafttopic\feature_lists (I changed the enwiki_kvs in each file, nothing else)

gs = gensim
qt = quantized

ADDITIONAL NOTES

  • in Makefil, section enwiki, tuning reports, there is :
	   	--labels-config=labels-config.yaml \

it should be .json like in every other wiki (I guess just a typo..)

	   	--labels-config=labels-config.json \
  • just a thought, for future it might be better to have .kv file as an argument for .py in drafttopic\feature_lists\ , it took me a while to figure it out when I deleted the default enwiki-20191201-learned_vectors.50_cell.100k.kv and the script was suddenly not working :D

Please, give me a feedback what you think about it ...

Sorry @aaron. I think this ping was meant for me. @Pavol86, thanks for your work! I'll review this and get back to you tomorrow.

Results of dimensionality reduction on the testing(sampeld dataset):

model namequantizedvocabulary cutoffdimensionsretrainqnormf1 microf1 macro
enwiki.articletopicTrue10k10FalseFalse0.6390.45
enwiki.articletopicTrue10k10TrueFalse0.6490.468
enwiki.articletopicTrue10k10FalseTrue0.6390.451
enwiki.articletopicTrue10k10TrueTrue0.650.468
enwiki.articletopicTrue10k25FalseFalse0.6930.54
enwiki.articletopicTrue10k25TrueFalse0.6980.557
enwiki.articletopicTrue10k25FalseTrue0.6930.54
enwiki.articletopicTrue10k25TrueTrue0.6980.557
enwiki.articletopicTrue10k50FalseFalse0.7030.558
enwiki.articletopicTrue10k50TrueFalse0.7060.573
enwiki.articletopicTrue10k50FalseTrue0.7040.558
enwiki.articletopicTrue10k50TrueTrue0.7060.576
enwiki.drafttopicTrue10k10FalseFalse0.5980.406
enwiki.drafttopicTrue10k10TrueFalse0.610.425
enwiki.drafttopicTrue10k10FalseTrue0.5980.407
enwiki.drafttopicTrue10k10TrueTrue0.6110.426
enwiki.drafttopicTrue10k25FalseFalse0.6510.493
enwiki.drafttopicTrue10k25TrueFalse0.6560.508
enwiki.drafttopicTrue10k25FalseTrue0.6520.494
enwiki.drafttopicTrue10k25TrueTrue0.6540.507
enwiki.drafttopicTrue10k50FalseFalse0.6620.512
enwiki.drafttopicTrue10k50TrueFalseNoneNone
enwiki.drafttopicTrue10k50FalseTrue0.6630.51
enwiki.drafttopicTrue10k50TrueTrueNoneNone

Note:
again, missing value in "Geography.Regions.Africa.Central Africa" -> "None"

Comments to results:

  • retrain:

According to FASTTEXT.ZIP: COMPRESSING TEXT CLASSIFICATION MODELS :

Bottom-up strategy: re-training.The first works aiming at compressing CNN models like the one proposed by (Gong et al., 2014) used the reconstruction from off-the-shelf PQ,i.e., without any retraining. However, as observed in Sablayrolles et al. (2016), when using quantization methods like PQ, it is better to re-train the layers occurring after the quantization, so that the network can re-adjust itself to the quantization. There is a strong argument arguing for this re-training strategy: the square magnitude of vectors is reduced, on average, by the average quantization error for any quantizer satisfying the Lloyd conditions; see Jegou et al. (2011) for details. This suggests a bottom-up learning strategy where we first quantize the input matrix, then retrain and quantize the output matrix (the input matrix being frozen). Experiments in Section 4 show that it is worth adopting this strategy.

From the results you may see it really makes sense as the results are better with retrain.

  • qnorm :

From FastText github :

-qnorm quantizing the norm separately [0]

I couldn't find any explanation for this parameter, anywhere. As you see, it does not make difference in results. People usually set it to "True", so I advise to set it to True :P . My basic understanding of quantization from the paper is that you select "the most important" words for classification of each label until you fill the cutoff limit(10k in this case), then you retrain the model -> this readjust the vector values in the new vocabulary. I could not decipher where is the "qnorm".

Directory structure of attached model reports:
dimensions : retrain = False, qnorm = False
dimensions_retrain : retrain = True, qnorm = False
dimensions_qnorm : retrain = False, qnorm = True
dimensions_retrain_qnorm : retrain = True, qnorm = True

Conclusion:
There is a drop of ~1 % in accuracy between 25 and 50 dimensions, but ~6 % between 25 and 10 dimensions.
I suggest :
model.quantize(input=train_data, cutoff=10000, qnorm = True, retrain=True)

If you/we need to shrink the models even more then adjust 'dim' in params to 25. This can be done in Makefile
model = fasttext.train_supervised(input_path, params)
I created a pull-request on github but it ended up in some error.. it seems.. we can resolve it on a call

Halfak assigned this task to Pavol86.
Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.

We new have models that are built using the compressed vectors. They seem to give us good fitness.