Aug 6 2020

Dibyaaaaax added a comment to T254289: Add wikidata to articletopic pipeline.

I tested out some of the above-mentioned models using a testing dataset (with ~150k items) that is completely different from the data the model was trained on. It was observed that all those models performed more or less the same on the new dataset. Since none of these models had a significantly better performance compared to others, and all other models in drafttopic are trained on a balanced dataset using GBC, we have decided to stick to the same for Wikidata too.

Aug 6 2020, 5:16 PM · drafttopic-modeling, Machine-Learning-Team (Active Tasks), Research

Aug 4 2020

Dibyaaaaax added a comment to T254289: Add wikidata to articletopic pipeline.

Classifier	GradientBoosting
Parameters	`n_estimators=150` `max_depth=5` `max_features="log2"` `learning_rate=0.1`
Number of Samples	63961, imbalanced samples
Vocab Size	50000
Embeddings dimension	50

Aug 4 2020, 2:01 PM · drafttopic-modeling, Machine-Learning-Team (Active Tasks), Research

Dibyaaaaax added a comment to T254289: Add wikidata to articletopic pipeline.

Classifier	Fasttext
Parameters	`loss=ova` `epoch=25` `dim=50` `lr=0.1` `pretrainedVectors=word2vec/wikidata-20200501-learned_vectors.50_cell.50k.vec` `minCount=1000000`
Number of Samples	63961, imbalanced samples
Pretrained vectors- vocab size	50000
Pretrained vectors- Embeddings dimension	50

Aug 4 2020, 10:47 AM · drafttopic-modeling, Machine-Learning-Team (Active Tasks), Research

Dibyaaaaax added a comment to T254289: Add wikidata to articletopic pipeline.

Classifier	Fasttext
Parameters	`loss=ova` `epoch=25` `dim=50` `lr=0.1` `pretrainedVectors=word2vec/wikidata-20200501-learned_vectors.50_cell.vec` `minCount=1000000`
Number of Samples	63961, imbalanced samples
Pretrained vectors- vocab size	10000
Pretrained vectors- Embeddings dimension	50

Aug 4 2020, 6:18 AM · drafttopic-modeling, Machine-Learning-Team (Active Tasks), Research

Jul 30 2020

Dibyaaaaax added a comment to T254289: Add wikidata to articletopic pipeline.

Classifier	Fasttext
Parameters	`loss=ova` `epoch=25` `dim=50` `lr=0.1` `pretrainedVectors=word2vec/wikidata-20200501-learned_vectors.50_cell.50k.vec` `minCount=1000000`
Number of Samples	255785, balanced samples (atleast 4000 per label)
Pretrained vectors- vocab size	50000
Pretrained vectors- Embeddings dimension	50

Jul 30 2020, 9:01 PM · drafttopic-modeling, Machine-Learning-Team (Active Tasks), Research

Dibyaaaaax added a comment to T254289: Add wikidata to articletopic pipeline.

Classifier	Fasttext
Parameters	`loss=ova` `epoch=25` `dim=50` `lr=0.1` `pretrainedVectors=word2vec/wikidata-20200501-learned_vectors.50_cell.vec`
Number of Samples	63944, balanced samples
Pretrained vectors- vocab size	10000
Pretrained vectors- Embeddings dimension	50

Jul 30 2020, 5:11 PM · drafttopic-modeling, Machine-Learning-Team (Active Tasks), Research

Jul 29 2020

Dibyaaaaax added a comment to T254289: Add wikidata to articletopic pipeline.

Classifier	GradientBoosting
Parameters	`n_estimators=150` `max_depth=5` `max_features="log2"` `learning_rate=0.1`
Number of Samples	63944, balanced samples
Vocab Size	50000
Embeddings dimension	50

Jul 29 2020, 4:44 AM · drafttopic-modeling, Machine-Learning-Team (Active Tasks), Research

Jul 28 2020

Dibyaaaaax added a comment to T254289: Add wikidata to articletopic pipeline.

Subsequent comments will have the performance reports for different Wikidata models. This is to get an idea about the changes in their performance while varying different factors like classifier (Fasttext vs GradientBoosting), vocab size, training data size, etc.

Jul 28 2020, 5:57 PM · drafttopic-modeling, Machine-Learning-Team (Active Tasks), Research

Dibyaaaaax closed T252775: Write Python util for converting Wikidata claims to features for ML models, a subtask of T245848: Productionize Wikidata-based Topic Model on ORES, as Resolved.

Jul 28 2020, 4:21 PM · Outreachy (Round 20), Outreach-Programs-Projects

Dibyaaaaax closed T252775: Write Python util for converting Wikidata claims to features for ML models as Resolved.

Jul 28 2020, 4:21 PM · Research, Machine-Learning-Team

Dibyaaaaax added a comment to T252775: Write Python util for converting Wikidata claims to features for ML models.

Sort order of Wikidata statements
We decided to order the Wikidata statements instead of randomizing them, for the reasons @Isaac mentioned above. For that, the order of properties is collected from here (SortedProperties) using the API. We then sort the statements based on their properties using the list of SortedProperties as reference.
There were some statements with properties that do not appear in the SortedProperties list. Those statements are simply sent to the end of the list, making them appear after all other statements that are on their correct position.

Jul 28 2020, 4:19 PM · Research, Machine-Learning-Team

Jul 13 2020

Dibyaaaaax added a comment to T252775: Write Python util for converting Wikidata claims to features for ML models.

There are some Wikidata items that represent Wikipedia pages that aren't articles. Eg. Q8207058 sitelinks to Portal:Earth Sciences in English Wikipedia (same case for other languages).

Jul 13 2020, 4:54 PM · Research, Machine-Learning-Team

Mar 12 2020

Dibyaaaaax added a comment to T246013: Outreachy Application Task: Simple example of topic classification.

Could someone help me with this error?

Screenshot from 2020-03-12 15-55-01.png (1×2 px, 442 KB)

Mar 12 2020, 10:32 AM · Outreachy (Round 20)

Dibyaaaaax (Dibya)
User

Projects

Calendar

Today

Tomorrow

Sunday

User Details

Recent Activity
View All

Aug 6 2020

Aug 4 2020

Jul 30 2020

Jul 29 2020

Jul 28 2020

Jul 13 2020

Mar 12 2020

Dibyaaaaax (Dibya)User

Projects

Calendar

Today

Tomorrow

Sunday

User Details

Recent ActivityView All

Aug 6 2020

Aug 4 2020

Jul 30 2020

Jul 29 2020

Jul 28 2020

Jul 13 2020

Mar 12 2020

Dibyaaaaax (Dibya)
User

Recent Activity
View All