User Details
- User Since
- Apr 20 2021, 8:39 AM (170 w, 3 d)
- Roles
- Disabled
- IRC Nick
- tanny411
- LDAP User
- AKhatun
- MediaWiki User
- AKhatun (WMF) [ Global Accounts ]
Thu, Jun 27
Update week of 17 June - 23 June, 2024:
- Change read/write methods to fix memory issues. Fixed for small-medium wikis. Additional errors for larger wikis (e.g. enwiki, jawiki)
- Refactor generate_anchor_dictionary script to modularize better.
Jun 17 2024
Update week of 10 June - 16 June, 2024:
- found and fixed bug causing precision loss
- updated dag to follow guidelines, fixed formatting and ran tests
- added shards in dag (to be used later)
- airflow-dags MR sent
- worked on fixing memory issues. converted a script to use spark instead if purely python.
- Rebase MR and make changes in dag (to incorporate new changes in research-datasets) [New changes not used yet]
Jun 11 2024
Update week of 03 June - 09 June, 2024:
- fixed pre-commit errors
- find bug that causes drop in performance
May 31 2024
Update week of 27 May - 02 June, 2024:
- Finish debugging. Fix dag and research-datasets code. Push updated code to both repos.
- fix moving embedding to hdfs
- fix ICU import
- push MR
May 23 2024
Update week of 20 - 26 May, 2024:
- Debug and fix code to run airflow dag
May 19 2024
Update week of 13 - 19 May, 2024:
May 11 2024
Update week of 6 - 12 May, 2024:
- successfully ran test airflow dag, included a main file and fsspec for testing
- created all required functions in main file.
- creating dag for the entire pipeline.
May 2 2024
Update week of 29 April - 5 May, 2024:
- Set up code in research_datasets, create a test file
- Chat with Fabian on how to connect code to airflow_dags repo
- Set up airflow dags repo properly and resolve issues setting up a test dag
Apr 28 2024
Update week of 22-28 April, 2024:
- Refactor code and add CLI arguments.
- Set up dev airflow instances.
Apr 20 2024
Update week 15 to 21 April 2024:
- Discussed how to change our code to Airflow friendly version
- Identifying changes and decisions to be made wrt add-a-link repo
Apr 15 2024
Update week 8 to 14 April 2024:
- Went over airflow and research dataset repos
- Sketched an overview of our current code base workflow and a few research airflow repos.
Apr 6 2024
Apr 5 2024
The exploratory part of link-recommendation for add-a-link is done.
Apr 4 2024
Update 1/4/2024 - 7/4/2024:
- Update MR according to comments
- Update meta for project
- Run an experiment with 100k cap of samples on ALL wikis.
- Start running another experiment with 1M cap of samples on ALL wikis.
- Discuss Airflow with Fabian.
Mar 30 2024
Update 25/3/2024 - 31/3/2024:
- Modify and push code. Collect all wikis data.
- Train a model on all wikis
- Discuss possible pipeline solutions
Mar 22 2024
Update 18/3/2024 - 24/3/2024:
- Clean and upload main and secondary wiki results. Add single language result for comparison.
- Prepare and run script to gather data for all languages (anchor dicts etc)
- Clean code to push as an intermediate stage of language agnostic modeling
Mar 17 2024
Update 11/3/2024 - 17/3/2024:
- Scale the model to 50 languages: Run pipeline for 50 languages and train a model with max 100k samples per language. Used fall back chains to select languages at the center.
- Test on a different set of 50 languages (randomly chosen) in a 0-shot manner and compare performance.
Mar 8 2024
Update 4/3/2024 - 10/3/2024:
- Finished grid search and stored best fit model
- Add mwtokenizer in one more place, fix code to accommodate wiki_db feature, fix label encoder, push draft MR
Mar 3 2024
Update 26/2/2024 - 3/3/2024:
- Trained a combined model with all data and with stratified split
- Hyperparameter tuning in progress
Feb 23 2024
Update 19/2/2024 - 25/2/2024:
- Replace w2v with outlink embedding, created baseline and ran 11 test wikis, MR sent
- Training and evaluating a combined model with all test language wikis
- with and without wiki_db feature
Feb 18 2024
Update 29/01/2024 - 04/02/2024:
- Understanding the feature generation and model training component of link-recommendation model
- Tested language performance of several models by changing the model's language. Tested multilingual models 2 languages at a time.
Jan 29 2024
Jan 27 2024
Update week 22/1/2024 - 28/1/2024::
- Fixed regex that was causing a lot of the models to have low-recall
- MR sent
Jan 19 2024
Results of evaluations after solving this ticket can be found here: https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task#Results.
Update week 15/1/2024 - 21/1/2024::
- MR sent to fix unicode errors. Multiple languages tested.
- Tested all previously failed languages. wikipedia2vec==2.0.0 introduces a new IndexError that occurs in several languages.
- Reverted to 2 venvs. This time conda has w2v==2.0.0 for jawiki and fywiki. venv has w2v==1.0.5 for rest of the languages.
- Sent MR4
Jan 14 2024
Update week 8/1/2024 - 14/1/2024:
- Test and fix jawiki error by adding required dependencies.
- Attempt to fix Unicode errors in zhwiki and fywiki (using different version of Wikipedia2Vec)
Jan 7 2024
Update week 1/1/2024 - 7/1/2024:
- mwtokenizer MR merged, new version released
- link-recommendation MR updated and refactored to integrate new mwtokenizer
- Ran non WS languages ad some previously Failed languages. There were some improvements. More debugging required.
- MR merged
Dec 23 2023
Update week 19/12/2023 - 24/12/2023:
- Make changes in mwtokenizer
- replace ▁ with " " in the tokenizer
- separate punctuation from tokens.
- Sent MR
- In progress:
- Use the updated mwtokenizer to improve link-recommendation.
- Refactor and consolidate ngram functions in link-recommendation code.
Dec 20 2023
Update week 11/12/2023 - 17/12/2023:
- Fix sentence tokenization errors in link-recommendation. Send MR. Improves bowiki, but no improvement in mywiki. WS languages remain same.
- Some analysis into the cause of the issues above.
Dec 10 2023
Update week 27/11/2023 - 3/12/2023:
- mwtokenizer issues resolved. MR merged.
- Read through link recommendation docs
- Pull code, set up dev env, run code for test wikis.
- Some errors reported and fixed (T352525)
Nov 28 2023
Nov 24 2023
Update week 20/11/2023 - 26/11/2023:
- Revised MR for issue 38. Merged after few iterations.
- Working on issue 32. Pushed.
Nov 18 2023
Update week 13/11/2023 - 19/11/2023:
- Discussed Issue 32 and 38.
- Pushed code for Issue 38
- Make diff for 38
- Create new issues for edge cases found while solving 38
Nov 12 2023
Update week 6/11/2023 - 12/11/2023:
- Issue 37 solved and merged.
- Start looking at Issue 38 and 32
Nov 5 2023
Update week 30/10/2023 - 5/11/2023:
- Working on Issue 37. Dissecting regex.
Oct 30 2023
Update week 23/10/2023 - 29/10/2023:
- Done finding sentence terminator symbols using the sentence per paragraph method. Also listed more symbols from Terminal_punctuation. See DOC for details. MR 21 sent, merged.
- Discussed Issue 37 and 9. MR merged for Issue 9.
- Started Issue 37
Oct 20 2023
Update week 16/10/2023 - 22/10/2023:
Oct 15 2023
Update week 9/10/2023 - 15/10/2023:
- Working on updating sentence tokenizer for non white space languages
Oct 6 2023
Update week 2/10/2023 - 8/10/2023:
- 1st MR merged
- Discussed how evaluation was set up and next steps.
- Researched and took notes for non-whitespace separated languages. Doc.
- Wikimedia Connect!!
Sep 30 2023
Update week 25/09/2023 - 1/10/2023:
Sep 28 2023
Thanks @colewhite. I'm all set!
Sep 25 2023
Update week 18/09/2023 - 24/09/2023:
- Read onboarding docs to familiarize with mwtokenizer and add-a-link tasks: link-recommendation-2023_work-plan
- Going over mwtokenizer code base
- Issues setting up:
- Don’t have kinit access yet
- Cannot access Jupyter as well
- Issues setting up venv in personal pc (couldn’t solve)
- Created venv and error installing in stat1008. [resolved. needed proxy]
Sep 23 2023
I am getting this error when I kinit
kinit: Client 'akhatun@WIKIMEDIA' not found in Kerberos database while getting initial credentials
Am I supposed to get a temporary password though email?
Sep 18 2023
Jun 30 2023
Week 26/6/23 - 2/7/23 Update:
- Listed and analyzed redirects in Wiktionary.
Jun 29 2023
Week 19/6/23 - 25/6/23 Update:
Jun 19 2023
Week 12/6/23 - 18/6/23 Update:
- Finished report (Images not added yet)
- MR11 sent for readme and figures
Jun 11 2023
Week 5/6/23 - 11/6/23 Update:
- Fixed template parsing error, MR10 got merged.
- Started working on report
Jun 9 2023
Jun 3 2023
Week 29/5/23 - 4/6/23 Update:
- Fixed template parsing to accommodate the use of lang param in template
- Parsed and saved all language wiktionary misspelling.
- Did some analysis, de-duplication, and saved all_wiki combined wiktionary misspellings.
- More errors found in template parsing (named params occur before un-named params causing incorrect parsing)
May 28 2023
Week 22/5/23 - 28/5/23 Update:
- Updated MR9 with summary
- Created Issue 14. Changed wiktionary parser script to make it work with all languages. Need to figure out some changes in template params.
May 22 2023
Week 15/5/23 - 21/5/23 Update:
- Checked example use of misspelling of templates in all the collected 16 languages. All languages look similar to enwiktionary except trwiktionary (small change) and viwiktionary
May 12 2023
Week 8/5/23 - 14/5/23 Update:
- Created Issues 12 and 13. Started working on them: identify misspelling of templates in other languages and find usage of these templates. The templates would be collected from Q50368067, misspelling of named templates in other languages, and their redirects.
Week 1/5/23 - 7/5/23 Update:
- Incorporated feedback and had MR7 merged (refactor repo)
- Analysis done on extracted misspellings, sent MR8. Based on feedback, some more analysis done.
Apr 30 2023
Week 24/4/23 - 30/4/23 Update:
- Incorporate Isaacs feedback for MR5 and 6. All MRs merged after some editing and discussion.
- Created MR7 to Refactored repo
- extracted misspelling from all language wikipedias.
- Todo: analysis on extracted data
Apr 24 2023
Week 10/4/23 - 16/4/23 Update:
- Add info on language detection (language, confidence, text sent to model). Analyze examples.
- Add proxy tables: tables that were not detected by mwparserfromhell.
- Separate cell data of tables: each cell in table is now a node. Stuck with cell data/paragraph text to send to model.
Apr 8 2023
Week 3/4/23 - 9/4/23 Update:
- Pushed revised code that includes all additional formatting as a list (as discussed).
- Fixed quotations detected. Added fasttext language detection.
- Analysed collected misspellings from context. Some work need to be done to increase precision of detected language.
Apr 1 2023
Week 27/3/23 - 2/4/23 Update:
- Apply additional filter information to extracted misspellings: Capitalization, word length, part of a list item, inside of quotations (in any language)
- Still need to figure out the data's structure and add fasttext detected language information
Mar 25 2023
Week 20/3/23 - 26/3/23 Update:
Mar 18 2023
Week 13/3/23 - 19/3/23 Update:
Mar 11 2023
Week 6/3/23 - 12/3/23 Update:
- Compared collected en and fr misspellings with AutowikiBrowser Typo list. Merge requested. Summary here
- Started working on extracting wikipedia text to find the ratio of misspellings
Mar 4 2023
Week 27/2/23 - 5/3/23 Update:
- Address comments for Issue #5
- Parse sections line by line, consider templates in # items (numbered list)
- Count the number of definitions by # count, excluding ## #: #; and #*
- Also change the data format a bit to make it more readable
- To address Issue 6: get list of misspellings from another Language and compare the collected lists to existing approaches
- collected bnwiktionary templates. It does not have much Bangla words. Its the same as present in enwiktionary. Will work with existing collected Spanish misspellings instead.
- for English, compared collected list with enwiki Lists_of_common_misspellings
Mar 3 2023
Feb 25 2023
Week 20/2/23 - 26/2/23 Update:
Feb 18 2023
Week 13/2/23 - 19/2/23 Update:
Feb 16 2023
Week 6/2/23 - 12/2/23 Update:
- Set up jupyter notebook (fix issues with getting spark3)
- Get list of enwiktionary pages that use missplelling_of template using the following tables:
- mediawiki_templatelinks, mediawiki_linktarget, mediawiki_wikitext_current
- Parsed enwiktionary pages to get heading name (typically POS: Noun, Adj, etc), language of misspelling, and the correct spelling from the template
- Some analysis on parsed wikis to get languauge and heading distribution
Feb 9 2023
Thank you, accessed!
Feb 6 2023
Week 1/2/23 - 5/2/23 Update:
- Caught up on previous work on copy editing both in research team and growth team
- Learned about templates in Wiktionary in different langauges and the possible categories they may be in
Feb 3 2023
Jul 11 2022
Jul 8 2022
Jul 7 2022
Update:
I tested a few options in the statbox, I am not sure how much this will represent the prod env, but here goes: