Generate word vectors for ar, cs, en, and ko using FastText
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Oct 10 2019, 3:13 PM

Description

We should generate embeddings in different vector lengths. E.g. 50, 150, 250, etc. Then we can use the different vector lengths to experiment with fitness and performance of models built on top of them.

Related Objects
Search...

Status	Assigned	Task
Resolved	Halfak	T243451 Deploy ORES -- Late Jan 2020
Resolved	Halfak	T235181 Build WikiProject directory topic models for ar, cs, and kowiki
Resolved	Halfak	T235183 Experiment with different vector lengths for ar, cs, en, and kowiki topic models.
Resolved	Halfak	T235184 Generate word vectors for ar, cs, en, and ko using FastText
Resolved	Isaac	T242013 Implement native NN model in revscoring
Resolved	Isaac	T241270 Add wikidata features to topic models

Event Timeline

Halfak created this task.Oct 10 2019, 3:13 PM

Halfak moved this task from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.Oct 23 2019, 9:06 PM

Copy-pasted from T235183:

I found https://fasttext.cc/docs/en/unsupervised-tutorial.html. It seems like a great tutorial for generating word vectors. I think we should start here with length 50 vectors and compare them to length 100 vectors.
We should set up a job on stat1007 or maybe even a hadoop job to clean up the text of XML dumps and then generate vectors from them.

I suggest starting with running the scripts from the tutorial on stat1007.

Halfak assigned this task to kevinbazira.Dec 16 2019, 5:42 PM

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.

@Halfak I used FastText to generate 50 and 100 cell vector models. Please find them on stat1007 under:

$ cd /home/kevinbazira/fasttext_word_representations/2__unsupervised-tutorial

You'll find the following files;

For cbow dim 50
- enwiki_latest_pages_articles_result_cbow_50.bin
- enwiki_latest_pages_articles_result_cbow_50.vec

For skipgram dim 50
- enwiki_latest_pages_articles_result_skipgram_50.bin
- enwiki_latest_pages_articles_result_skipgram_50.vec

For skipgram dim 100 (PS: These are still generating, they'll be complete soon)
- enwiki_latest_pages_articles_result_skipgram_100.bin
- enwiki_latest_pages_articles_result_skipgram_100.vec

PS: .bin is the model binary file and .vec text file contains the word vectors.

Halfak claimed this task.Jan 21 2020, 3:58 PM

Halfak moved this task from Parked to Pending deployment on the Machine-Learning-Team (Active Tasks) board.

Halfak added a subscriber: kevinbazira.

Halfak added a parent task: T243451: Deploy ORES -- Late Jan 2020.Jan 22 2020, 8:43 PM

Halfak moved this task from Pending deployment to Completed on the Machine-Learning-Team (Active Tasks) board.Feb 5 2020, 4:26 PM

Halfak closed this task as Resolved.Feb 5 2020, 4:27 PM

Isaac closed subtask T242013: Implement native NN model in revscoring as Resolved.Sep 10 2020, 1:52 PM