Page MenuHomePhabricator

Generate word vectors for ar, cs, en, and ko using FastText
Closed, ResolvedPublic

Description

We should generate embeddings in different vector lengths. E.g. 50, 150, 250, etc. Then we can use the different vector lengths to experiment with fitness and performance of models built on top of them.

Event Timeline

Halfak created this task.Oct 10 2019, 3:13 PM
Halfak added a comment.Dec 9 2019, 4:35 PM

Copy-pasted from T235183:

I found https://fasttext.cc/docs/en/unsupervised-tutorial.html. It seems like a great tutorial for generating word vectors. I think we should start here with length 50 vectors and compare them to length 100 vectors.
We should set up a job on stat1007 or maybe even a hadoop job to clean up the text of XML dumps and then generate vectors from them.

I suggest starting with running the scripts from the tutorial on stat1007.

kevinbazira added a comment.EditedDec 17 2019, 3:28 PM

@Halfak I used FastText to generate 50 and 100 cell vector models. Please find them on stat1007 under:

$ cd /home/kevinbazira/fasttext_word_representations/2__unsupervised-tutorial

You'll find the following files;

  1. For cbow dim 50
    • enwiki_latest_pages_articles_result_cbow_50.bin
    • enwiki_latest_pages_articles_result_cbow_50.vec
  1. For skipgram dim 50
    • enwiki_latest_pages_articles_result_skipgram_50.bin
    • enwiki_latest_pages_articles_result_skipgram_50.vec
  1. For skipgram dim 100 (PS: These are still generating, they'll be complete soon)
    • enwiki_latest_pages_articles_result_skipgram_100.bin
    • enwiki_latest_pages_articles_result_skipgram_100.vec

PS: .bin is the model binary file and .vec text file contains the word vectors.

Halfak claimed this task.Jan 21 2020, 3:58 PM
Halfak moved this task from Active to Pending deployment on the Scoring-platform-team (Current) board.
Halfak added a subscriber: kevinbazira.
Halfak closed this task as Resolved.Feb 5 2020, 4:27 PM