Page MenuHomePhabricator

Pavol86 (Pavol Mulinka)
Data Scientist / Machine learning Enthusiast

Projects

User does not belong to any projects.

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
May 4 2020, 8:37 PM (37 w, 16 h)
Availability
Available
IRC Nick
pavol
LDAP User
Unknown
MediaWiki User
Pavol86 [ Global Accounts ]

Recent Activity

Thu, Jan 14

Pavol86 committed rOEQaa628903e953: updated test-requirements (authored by Pavol86).
updated test-requirements
Thu, Jan 14, 5:08 PM
Pavol86 committed rOEQ0c788dae17f4: update (authored by Pavol86).
update
Thu, Jan 14, 10:04 AM

Wed, Jan 13

Pavol86 committed rOEQf4ffad4e3fb5: updated tests (authored by Pavol86).
updated tests
Wed, Jan 13, 8:34 PM

Mon, Jan 4

Pavol86 committed rOEQ811202cbaa0a: update (authored by Pavol86).
update
Mon, Jan 4, 4:33 PM

Sat, Dec 26

Pavol86 committed rOEQ4222d4639ca3: added cjk features (authored by Pavol86).
added cjk features
Sat, Dec 26, 5:33 PM

Thu, Dec 24

Pavol86 committed rOEQ7778eaf31b1e: initial_commit_with_cjk_features (authored by Pavol86).
initial_commit_with_cjk_features
Thu, Dec 24, 5:46 PM

Aug 17 2020

Pavol86 updated Pavol86.
Aug 17 2020, 4:08 PM
Pavol86 added a comment to T238712: [Open question] How to more effectively detect spambot accounts?.

@leila , If this task is still opened I would like to help.

Aug 17 2020, 1:00 PM · ConfirmEdit (CAPTCHA extension), Research-Backlog

Jul 23 2020

Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.

@VulpesVulpes825 thank you for the recommendation! I do not speak any of the languages so I am "best guessing" all the way :). The CJK tokenization should be part of the deltas library at the end - https://github.com/halfak/deltas . I prepared the code to be merged(pull request) with deltas and I have a call with @Halfak today. I will keep you updated..

Jul 23 2020, 9:26 AM · Machine Learning Platform (Current), Chinese-Sites, artificial-intelligence, revscoring
Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.

FINAL NOTES (hopefully :) ):
Japanese:

  • I did haven't tried SudachiPy as I saw poor performance stats, It is the only JP tokenizer that I was able to get running just by "pip install" without any additional instructions
  • SudachiPy model loads quickly:
jp_sudachi model load time: 0.03719305992126465
Sudachi provides three modes of splitting. In A mode, texts are divided into the shortest units equivalent to the UniDic short unit. In C mode, it extracts named entities. In B mode, into the middle units.
Small: includes only the vocabulary of UniDic
Core: includes basic vocabulary (default)
Full: includes miscellaneous proper nouns
  • there is only a slight difference in the performance of tokenizer with each dict. (small slightly faster than core, etc.), see:

  • I recommend use of full dict with split mode A, to download/use full dict:
pip install sudachidict_full
sudachipy link -t full
Jul 23 2020, 8:29 AM · Machine Learning Platform (Current), Chinese-Sites, artificial-intelligence, revscoring

Jul 15 2020

Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.

@Halfak I need your feedback on following. According to our call last week I did following :

  1. make the ch, jp, ko tokenizer decision more explicit in the code
  2. add "# noqa" to lines that should have >85 chars - as a workaround for flake8/pep8 test
  3. performance tests : run the code 100-1000x times on the same article and compare performance between prev/new version, cjk tokenization True/False

3.1 I tested the performance of original deltas tokenizer, see following boxplots - y-axis marks the type of wiki and type of text (EN wiki with EN text, Chinese wiki with Chinese text,... EN wiki with Chinese text, etc..)


3.2 I found out that the loading of the Chinese tokenizer model is a bottleneck, so I tested pkuseg, thulac, jieba on Chinese wiki with Chinese text. Jieba is the only tokenizer that needs to be initialized only once and then it is kept in memory. Pkuseseg and Thulac take 2-3s to initialize. Model load of each tokenizer (including jap and kor)

Jul 15 2020, 2:21 PM · Machine Learning Platform (Current), Chinese-Sites, artificial-intelligence, revscoring

Jul 8 2020

Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.

NOTE1: I have some issues with creating a pull request for delta package, we should be able to resolve it with Aaron...
NOTE2: this is the application + explanation of the code, I thank @jeena and @VulpesVulpes825 for their ideas, we will check the performance of other dictionaries/tools, but for now I wanted to have a working basic CJK tokenizer..

Jul 8 2020, 11:16 PM · Machine Learning Platform (Current), Chinese-Sites, artificial-intelligence, revscoring

Jun 21 2020

Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.
Note

Export from jupyter notebook after I struggled with JapaneseTokenizers I decided to get at least 1 working - hardest part is to find some free segmented dataset, now I know why this task is still opened :) - I forgot that nothing is free in Japan and OpenSource is a very new concept.. (my exp. ...)

Jun 21 2020, 8:31 PM · Machine Learning Platform (Current), Chinese-Sites, artificial-intelligence, revscoring

Jun 19 2020

Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.

@VulpesVulpes825 thank you for your response. Do you know if there are any datasets/dictionaries that are used for benchmarking of tokenization methods?

Jun 19 2020, 2:27 PM · Machine Learning Platform (Current), Chinese-Sites, artificial-intelligence, revscoring

Jun 17 2020

Pavol86 added a comment to T111179: Tokenization of "word" things for CJK.

I can see that this thread is quite old, so first of all only couple of notes that we can talk about with Aaron on Thursday call..

Jun 17 2020, 3:02 PM · Machine Learning Platform (Current), Chinese-Sites, artificial-intelligence, revscoring

Jun 8 2020

Pavol86 added a comment to T247523: Compress Gensim models.

Results of dimensionality reduction on the testing(sampeld dataset):

model namequantizedvocabulary cutoffdimensionsretrainqnormf1 microf1 macro
enwiki.articletopicTrue10k10FalseFalse0.6390.45
enwiki.articletopicTrue10k10TrueFalse0.6490.468
enwiki.articletopicTrue10k10FalseTrue0.6390.451
enwiki.articletopicTrue10k10TrueTrue0.650.468
enwiki.articletopicTrue10k25FalseFalse0.6930.54
enwiki.articletopicTrue10k25TrueFalse0.6980.557
enwiki.articletopicTrue10k25FalseTrue0.6930.54
enwiki.articletopicTrue10k25TrueTrue0.6980.557
enwiki.articletopicTrue10k50FalseFalse0.7030.558
enwiki.articletopicTrue10k50TrueFalse0.7060.573
enwiki.articletopicTrue10k50FalseTrue0.7040.558
enwiki.articletopicTrue10k50TrueTrue0.7060.576
enwiki.drafttopicTrue10k10FalseFalse0.5980.406
enwiki.drafttopicTrue10k10TrueFalse0.610.425
enwiki.drafttopicTrue10k10FalseTrue0.5980.407
enwiki.drafttopicTrue10k10TrueTrue0.6110.426
enwiki.drafttopicTrue10k25FalseFalse0.6510.493
enwiki.drafttopicTrue10k25TrueFalse0.6560.508
enwiki.drafttopicTrue10k25FalseTrue0.6520.494
enwiki.drafttopicTrue10k25TrueTrue0.6540.507
enwiki.drafttopicTrue10k50FalseFalse0.6620.512
enwiki.drafttopicTrue10k50TrueFalseNoneNone
enwiki.drafttopicTrue10k50FalseTrue0.6630.51
enwiki.drafttopicTrue10k50TrueTrueNoneNone

Jun 8 2020, 8:12 AM · Machine Learning Platform (Current), drafttopic-modeling

May 30 2020

Pavol86 updated subscribers of T247523: Compress Gensim models.

@aaron, I need help with understanding the results and if it makes sense what I did to test vocabulary sizes.
(in following I use brackets to list file names options)
(i) Does it make sense what I did?

  • As I mentioned in previous update I extracted word2vec vocab with sizes [10k,25k,50k,75k,100k] using gensim limit parameter and then using fasttext quantize function => I create 10 word2vec datasets names w2v_[gensim,quantized][10k,25k,50k,75k,100k].kv
  • I created new file for each word2vec dataset in drafttopic\drafttopic\feature_lists : enwiki_[gs,qt][10k,25k,50k,75k,100k].py
  • I added new commands to drafttopic\Makefile
  • I ran the commands related to enwiki except "tuning" - I killed the process when I saw it uses gridsearch and it will try 16 parameter combinations :)

(ii) Results explanation

  • Statistics - UNDERSTAND
  • rates/match_rate/filter_rate - DONT UNDERSTAND (fractions from the dataset?)
  • recall/precision/accuracy/f1/fpr/roc_auc/pr_auc - UNDESRTAND
  • !recall/!precision/!f1 - DONT UNDERSTAND (what does the "!" mean?)

(iii) Summary
Difference between classification stats of 10k and 100k versions are small, but quantized 10k word2vec dataset performs a bit better than gensim limited 100k dataset, eg:

f1 microf1 macro
enwiki.articletopic_gs100k0.6810.512
enwiki.articletopic_qt100k0.690.533
enwiki.articletopic_gs10k0.6720.504
enwiki.articletopic_qt10k0.6850.53
enwiki.drafttopic_gs100kNoneNone
enwiki.drafttopic_qt100k0.6450.485
enwiki.drafttopic_gs10k0.6270.454
enwiki.drafttopic_qt10k0.6410.478
May 30 2020, 6:08 PM · Machine Learning Platform (Current), drafttopic-modeling

May 26 2020

Pavol86 added a comment to T247523: Compress Gensim models.

Short update

  • initial DraftTopic testing issues[resolved] : (i) module versions issues, (ii) default MiniConda python3.8 issue - 3.8 has some issues with GenSim and other modules used in DraftTopic, (iii) calculations are memory/CPU intense (I created virtual env. on a machine I don't use), (iv) sample dataset too big, downloading crashed multiple times - Aaron provided smaller version
  • I created word2vec GenSim models according to python-mwtext : https://github.com/mediawiki-utilities/python-mwtext/blob/master/mwtext/utilities/word2vec2gensim.py
  • I created limit = [10k,25k,50k,75k,100k] dictionary size models from provided preprocessed text as a baseline with :
model = KeyedVectors.load_word2vec_format(input_path, limit=limit)
  • I created quantized models with cutoff dictionaries

(again issue with Python 3.8 so I ended up with using FastText command line/terminal commands)

model = fasttext.train_supervised(input_path, **params)
model.quantize(input=train_data, retrain=True, cutoff=[10k,25k,50k,75k,100k])

https://github.com/mediawiki-utilities/python-mwtext/blob/master/mwtext/utilities/learn_vectors.py

  • as I had so many issues with python 3.8 I switched to 3.7.2 -> fewer issues
  • I added commands to Makefile to create [10k,25k,50k,50k,75k,100k] versions of enwiki.balanced_article_sample.w_article_text.json and enwiki.balanced_article_sample.w_draft_text.json
  • I created cached versions of sampled datasets
  • I trained the models with new balanced cached datasets - I didn't notice the model training command is calling "drafttopic.feature_lists.enwiki.drafttopic" which has a hardcoded word2vec GenSim model/dictionary (.kv) - "enwiki-20191201-learned_vectors.50_cell.100k.kv"
    • I followed KISS principle ("keep it simple, stupid") and created "drafttopic.feature_lists.enwiki" file per word2vec dictionary, eg. :
drafttopic.feature_lists.enwiki_gs100k (referring to gensim limited model to 100k dict)
drafttopic.feature_lists.enwiki_qt100k (referring to quantized model with cutof=100k)
drafttopic.feature_lists.enwiki_gs75k 
...
May 26 2020, 10:55 PM · Machine Learning Platform (Current), drafttopic-modeling

May 6 2020

Pavol86 added a comment to T247523: Compress Gensim models.

I went through the documentation for gensim and fasttext(very limited info), and a lot of other pages.. :) . To proceed further: what is the goal of python-mwtext?

May 6 2020, 1:11 PM · Machine Learning Platform (Current), drafttopic-modeling

May 4 2020

Pavol86 added a comment to T247523: Compress Gensim models.

according to Medium article (but this issue is common on forums) :

May 4 2020, 8:56 PM · Machine Learning Platform (Current), drafttopic-modeling