Page MenuHomePhabricator
Feed Advanced Search

Mar 2 2021

Pavol86 committed rDRAFTTOPICa06db9ec4953: it is alive.
it is alive
Mar 2 2021, 8:25 AM

Feb 23 2021

Pavol86 committed rDRAFTTOPICc199e4b069e5: cjk models.
cjk models
Feb 23 2021, 5:39 AM

Feb 22 2021

Pavol86 committed rDRAFTTOPIC340552b9f690: japanase model.
japanase model
Feb 22 2021, 7:15 PM
Pavol86 committed rDRAFTTOPIC8a4d31990081: update.
update
Feb 22 2021, 3:49 PM
Pavol86 committed rDRAFTTOPIC8e2cf6a18a1d: updated code.
updated code
Feb 22 2021, 3:26 PM
Pavol86 committed rDRAFTTOPIC42adfae355eb: new.
new
Feb 22 2021, 2:17 PM

Feb 18 2021

Pavol86 committed rDRAFTTOPIC6b2516f6139f: update.
update
Feb 18 2021, 9:49 AM

Feb 16 2021

Pavol86 committed rDRAFTTOPICc42c40c8cdb7: update.
update
Feb 16 2021, 12:13 PM

Feb 14 2021

Pavol86 committed rDRAFTTOPIC9e68530451da: CJK_models.
CJK_models
Feb 14 2021, 1:12 AM

Feb 13 2021

Pavol86 committed rOEQ7892c8bd025f: update.
update
Feb 13 2021, 7:16 PM

Feb 7 2021

Pavol86 committed rOEQ2eb91643755d: update.
update
Feb 7 2021, 11:16 PM

Feb 4 2021

Pavol86 committed rOEQe83914895a62: new models.
new models
Feb 4 2021, 4:07 PM

Jan 26 2021

Pavol86 committed rOEQc7ec75e4531d: model_info added.
model_info added
Jan 26 2021, 9:04 PM

Jan 14 2021

Pavol86 committed rOEQaa628903e953: updated test-requirements.
updated test-requirements
Jan 14 2021, 5:08 PM
Pavol86 committed rOEQ0c788dae17f4: update.
update
Jan 14 2021, 10:04 AM

Jan 13 2021

Pavol86 committed rOEQf4ffad4e3fb5: updated tests.
updated tests
Jan 13 2021, 8:34 PM

Jan 4 2021

Pavol86 committed rOEQ811202cbaa0a: update.
update
Jan 4 2021, 4:33 PM

Dec 26 2020

Pavol86 committed rOEQ4222d4639ca3: added cjk features.
added cjk features
Dec 26 2020, 5:33 PM

Dec 24 2020

Pavol86 committed rOEQ7778eaf31b1e: initial_commit_with_cjk_features.
initial_commit_with_cjk_features
Dec 24 2020, 5:46 PM

Aug 17 2020

Pavol86 updated Pavol86.
Aug 17 2020, 4:08 PM
Pavol86 added a comment to T238712: [Open question] How to more effectively detect spambot accounts?.

@leila , If this task is still opened I would like to help.

Aug 17 2020, 1:00 PM · ConfirmEdit (CAPTCHA extension), Research-Freezer

Jul 23 2020

Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.

@VulpesVulpes825 thank you for the recommendation! I do not speak any of the languages so I am "best guessing" all the way :). The CJK tokenization should be part of the deltas library at the end - https://github.com/halfak/deltas . I prepared the code to be merged(pull request) with deltas and I have a call with @Halfak today. I will keep you updated..

Jul 23 2020, 9:26 AM · Machine-Learning-Team (Active Tasks), Chinese-Sites, artificial-intelligence, revscoring
Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.

FINAL NOTES (hopefully :) ):
Japanese:

  • I did haven't tried SudachiPy as I saw poor performance stats, It is the only JP tokenizer that I was able to get running just by "pip install" without any additional instructions
  • SudachiPy model loads quickly:
jp_sudachi model load time: 0.03719305992126465
Sudachi provides three modes of splitting. In A mode, texts are divided into the shortest units equivalent to the UniDic short unit. In C mode, it extracts named entities. In B mode, into the middle units.
Small: includes only the vocabulary of UniDic
Core: includes basic vocabulary (default)
Full: includes miscellaneous proper nouns
  • there is only a slight difference in the performance of tokenizer with each dict. (small slightly faster than core, etc.), see:

Deltas_Japanese_Tokenizer-Sudachi_Dictionaries_cjk_True.png (501×486 px, 38 KB)

  • I recommend use of full dict with split mode A, to download/use full dict:
pip install sudachidict_full
sudachipy link -t full
Jul 23 2020, 8:29 AM · Machine-Learning-Team (Active Tasks), Chinese-Sites, artificial-intelligence, revscoring

Jul 15 2020

Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.

@Halfak I need your feedback on following. According to our call last week I did following :

  1. make the ch, jp, ko tokenizer decision more explicit in the code
  2. add "# noqa" to lines that should have >85 chars - as a workaround for flake8/pep8 test
  3. performance tests : run the code 100-1000x times on the same article and compare performance between prev/new version, cjk tokenization True/False

3.1 I tested the performance of original deltas tokenizer, see following boxplots - y-axis marks the type of wiki and type of text (EN wiki with EN text, Chinese wiki with Chinese text,... EN wiki with Chinese text, etc..)

Deltas Orig.png (332×392 px, 11 KB)

3.2 I found out that the loading of the Chinese tokenizer model is a bottleneck, so I tested pkuseg, thulac, jieba on Chinese wiki with Chinese text. Jieba is the only tokenizer that needs to be initialized only once and then it is kept in memory. Pkuseseg and Thulac take 2-3s to initialize. Model load of each tokenizer (including jap and kor)

Jul 15 2020, 2:21 PM · Machine-Learning-Team (Active Tasks), Chinese-Sites, artificial-intelligence, revscoring

Jul 8 2020

Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.

NOTE1: I have some issues with creating a pull request for delta package, we should be able to resolve it with Aaron...
NOTE2: this is the application + explanation of the code, I thank @jeena and @VulpesVulpes825 for their ideas, we will check the performance of other dictionaries/tools, but for now I wanted to have a working basic CJK tokenizer..

Jul 8 2020, 11:16 PM · Machine-Learning-Team (Active Tasks), Chinese-Sites, artificial-intelligence, revscoring

Jun 21 2020

Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.
Note

Export from jupyter notebook after I struggled with JapaneseTokenizers I decided to get at least 1 working - hardest part is to find some free segmented dataset, now I know why this task is still opened :) - I forgot that nothing is free in Japan and OpenSource is a very new concept.. (my exp. ...)

Jun 21 2020, 8:31 PM · Machine-Learning-Team (Active Tasks), Chinese-Sites, artificial-intelligence, revscoring

Jun 19 2020

Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.

@VulpesVulpes825 thank you for your response. Do you know if there are any datasets/dictionaries that are used for benchmarking of tokenization methods?

Jun 19 2020, 2:27 PM · Machine-Learning-Team (Active Tasks), Chinese-Sites, artificial-intelligence, revscoring

Jun 17 2020

Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.

I can see that this thread is quite old, so first of all only couple of notes that we can talk about with Aaron on Thursday call..

Jun 17 2020, 3:02 PM · Machine-Learning-Team (Active Tasks), Chinese-Sites, artificial-intelligence, revscoring

Jun 8 2020

Pavol86 added a comment to T247523: Compress Gensim models.

Results of dimensionality reduction on the testing(sampeld dataset):

model namequantizedvocabulary cutoffdimensionsretrainqnormf1 microf1 macro
enwiki.articletopicTrue10k10FalseFalse0.6390.45
enwiki.articletopicTrue10k10TrueFalse0.6490.468
enwiki.articletopicTrue10k10FalseTrue0.6390.451
enwiki.articletopicTrue10k10TrueTrue0.650.468
enwiki.articletopicTrue10k25FalseFalse0.6930.54
enwiki.articletopicTrue10k25TrueFalse0.6980.557
enwiki.articletopicTrue10k25FalseTrue0.6930.54
enwiki.articletopicTrue10k25TrueTrue0.6980.557
enwiki.articletopicTrue10k50FalseFalse0.7030.558
enwiki.articletopicTrue10k50TrueFalse0.7060.573
enwiki.articletopicTrue10k50FalseTrue0.7040.558
enwiki.articletopicTrue10k50TrueTrue0.7060.576
enwiki.drafttopicTrue10k10FalseFalse0.5980.406
enwiki.drafttopicTrue10k10TrueFalse0.610.425
enwiki.drafttopicTrue10k10FalseTrue0.5980.407
enwiki.drafttopicTrue10k10TrueTrue0.6110.426
enwiki.drafttopicTrue10k25FalseFalse0.6510.493
enwiki.drafttopicTrue10k25TrueFalse0.6560.508
enwiki.drafttopicTrue10k25FalseTrue0.6520.494
enwiki.drafttopicTrue10k25TrueTrue0.6540.507
enwiki.drafttopicTrue10k50FalseFalse0.6620.512
enwiki.drafttopicTrue10k50TrueFalseNoneNone
enwiki.drafttopicTrue10k50FalseTrue0.6630.51
enwiki.drafttopicTrue10k50TrueTrueNoneNone

Jun 8 2020, 8:12 AM · Machine-Learning-Team (Active Tasks), drafttopic-modeling

May 30 2020

Pavol86 updated subscribers of T247523: Compress Gensim models.

@aaron, I need help with understanding the results and if it makes sense what I did to test vocabulary sizes.
(in following I use brackets to list file names options)
(i) Does it make sense what I did?

  • As I mentioned in previous update I extracted word2vec vocab with sizes [10k,25k,50k,75k,100k] using gensim limit parameter and then using fasttext quantize function => I create 10 word2vec datasets names w2v_[gensim,quantized][10k,25k,50k,75k,100k].kv
  • I created new file for each word2vec dataset in drafttopic\drafttopic\feature_lists : enwiki_[gs,qt][10k,25k,50k,75k,100k].py
  • I added new commands to drafttopic\Makefile
  • I ran the commands related to enwiki except "tuning" - I killed the process when I saw it uses gridsearch and it will try 16 parameter combinations :)

(ii) Results explanation

  • Statistics - UNDERSTAND
  • rates/match_rate/filter_rate - DONT UNDERSTAND (fractions from the dataset?)
  • recall/precision/accuracy/f1/fpr/roc_auc/pr_auc - UNDESRTAND
  • !recall/!precision/!f1 - DONT UNDERSTAND (what does the "!" mean?)

(iii) Summary
Difference between classification stats of 10k and 100k versions are small, but quantized 10k word2vec dataset performs a bit better than gensim limited 100k dataset, eg:

f1 microf1 macro
enwiki.articletopic_gs100k0.6810.512
enwiki.articletopic_qt100k0.690.533
enwiki.articletopic_gs10k0.6720.504
enwiki.articletopic_qt10k0.6850.53
enwiki.drafttopic_gs100kNoneNone
enwiki.drafttopic_qt100k0.6450.485
enwiki.drafttopic_gs10k0.6270.454
enwiki.drafttopic_qt10k0.6410.478
May 30 2020, 6:08 PM · Machine-Learning-Team (Active Tasks), drafttopic-modeling

May 26 2020

Pavol86 added a comment to T247523: Compress Gensim models.

Short update

  • initial DraftTopic testing issues[resolved] : (i) module versions issues, (ii) default MiniConda python3.8 issue - 3.8 has some issues with GenSim and other modules used in DraftTopic, (iii) calculations are memory/CPU intense (I created virtual env. on a machine I don't use), (iv) sample dataset too big, downloading crashed multiple times - Aaron provided smaller version
  • I created word2vec GenSim models according to python-mwtext : https://github.com/mediawiki-utilities/python-mwtext/blob/master/mwtext/utilities/word2vec2gensim.py
  • I created limit = [10k,25k,50k,75k,100k] dictionary size models from provided preprocessed text as a baseline with :
model = KeyedVectors.load_word2vec_format(input_path, limit=limit)
  • I created quantized models with cutoff dictionaries

(again issue with Python 3.8 so I ended up with using FastText command line/terminal commands)

model = fasttext.train_supervised(input_path, **params)
model.quantize(input=train_data, retrain=True, cutoff=[10k,25k,50k,75k,100k])

https://github.com/mediawiki-utilities/python-mwtext/blob/master/mwtext/utilities/learn_vectors.py

  • as I had so many issues with python 3.8 I switched to 3.7.2 -> fewer issues
  • I added commands to Makefile to create [10k,25k,50k,50k,75k,100k] versions of enwiki.balanced_article_sample.w_article_text.json and enwiki.balanced_article_sample.w_draft_text.json
  • I created cached versions of sampled datasets
  • I trained the models with new balanced cached datasets - I didn't notice the model training command is calling "drafttopic.feature_lists.enwiki.drafttopic" which has a hardcoded word2vec GenSim model/dictionary (.kv) - "enwiki-20191201-learned_vectors.50_cell.100k.kv"
    • I followed KISS principle ("keep it simple, stupid") and created "drafttopic.feature_lists.enwiki" file per word2vec dictionary, eg. :
drafttopic.feature_lists.enwiki_gs100k (referring to gensim limited model to 100k dict)
drafttopic.feature_lists.enwiki_qt100k (referring to quantized model with cutof=100k)
drafttopic.feature_lists.enwiki_gs75k 
...
May 26 2020, 10:55 PM · Machine-Learning-Team (Active Tasks), drafttopic-modeling

May 6 2020

Pavol86 added a comment to T247523: Compress Gensim models.

I went through the documentation for gensim and fasttext(very limited info), and a lot of other pages.. :) . To proceed further: what is the goal of python-mwtext?

May 6 2020, 1:11 PM · Machine-Learning-Team (Active Tasks), drafttopic-modeling

May 4 2020

Pavol86 added a comment to T247523: Compress Gensim models.

according to Medium article (but this issue is common on forums) :

May 4 2020, 8:56 PM · Machine-Learning-Team (Active Tasks), drafttopic-modeling