We should generate embeddings in different vector lengths. E.g. 50, 150, 250, etc. Then we can use the different vector lengths to experiment with fitness and performance of models built on top of them.
|Resolved||Halfak||T243451 Deploy ORES -- Late Jan 2020|
|Resolved||Halfak||T235181 Build WikiProject directory topic models for ar, cs, and kowiki|
|Resolved||Halfak||T235183 Experiment with different vector lengths for ar, cs, en, and kowiki topic models.|
|Resolved||Halfak||T235184 Generate word vectors for ar, cs, en, and ko using FastText|
|Open||None||T242013 Implement native NN model in revscoring|
|Open||None||T241270 Add wikidata features to topic models|
Copy-pasted from T235183:
I found https://fasttext.cc/docs/en/unsupervised-tutorial.html. It seems like a great tutorial for generating word vectors. I think we should start here with length 50 vectors and compare them to length 100 vectors.
We should set up a job on stat1007 or maybe even a hadoop job to clean up the text of XML dumps and then generate vectors from them.
I suggest starting with running the scripts from the tutorial on stat1007.
@Halfak I used FastText to generate 50 and 100 cell vector models. Please find them on stat1007 under:
$ cd /home/kevinbazira/fasttext_word_representations/2__unsupervised-tutorial
You'll find the following files;
- For cbow dim 50
- For skipgram dim 50
- For skipgram dim 100 (PS: These are still generating, they'll be complete soon)
PS: .bin is the model binary file and .vec text file contains the word vectors.