Page MenuHomePhabricator

Generate word vectors for ar, cs, en, and ko using FastText
Closed, ResolvedPublic

Description

We should generate embeddings in different vector lengths. E.g. 50, 150, 250, etc. Then we can use the different vector lengths to experiment with fitness and performance of models built on top of them.

Event Timeline

Copy-pasted from T235183:

I found https://fasttext.cc/docs/en/unsupervised-tutorial.html. It seems like a great tutorial for generating word vectors. I think we should start here with length 50 vectors and compare them to length 100 vectors.
We should set up a job on stat1007 or maybe even a hadoop job to clean up the text of XML dumps and then generate vectors from them.

I suggest starting with running the scripts from the tutorial on stat1007.

@Halfak I used FastText to generate 50 and 100 cell vector models. Please find them on stat1007 under:

$ cd /home/kevinbazira/fasttext_word_representations/2__unsupervised-tutorial

You'll find the following files;

  1. For cbow dim 50
    • enwiki_latest_pages_articles_result_cbow_50.bin
    • enwiki_latest_pages_articles_result_cbow_50.vec
  1. For skipgram dim 50
    • enwiki_latest_pages_articles_result_skipgram_50.bin
    • enwiki_latest_pages_articles_result_skipgram_50.vec
  1. For skipgram dim 100 (PS: These are still generating, they'll be complete soon)
    • enwiki_latest_pages_articles_result_skipgram_100.bin
    • enwiki_latest_pages_articles_result_skipgram_100.vec

PS: .bin is the model binary file and .vec text file contains the word vectors.