Page MenuHomePhabricator
Feed Advanced Search

Mon, Apr 15

AKhatun_WMF added a comment to T361926: Improve training and inference pipeline for multilingual link recommendation model.

Update week 8 to 14 April 2024:

  • Went over airflow and research dataset repos
  • Sketched an overview of our current code base workflow and a few research airflow repos.
Mon, Apr 15, 3:46 AM · Research (FY2023-24-Research-April-June)

Sat, Apr 6

AKhatun_WMF closed T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link, a subtask of T342526: Improving multilingual support for link recommendation model for add-a-link task, as Resolved.
Sat, Apr 6, 12:00 AM · address-knowledge-gaps, Epic, Research
AKhatun_WMF closed T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link as Resolved.
Sat, Apr 6, 12:00 AM · Research (FY2023-24-Research-January-March)

Fri, Apr 5

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

The exploratory part of link-recommendation for add-a-link is done.

Fri, Apr 5, 5:14 AM · Research (FY2023-24-Research-January-March)

Thu, Apr 4

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 1/4/2024 - 7/4/2024:

  • Update MR according to comments
  • Run an experiment with 100k cap of samples on ALL wikis.
  • Start running another experiment with 1M cap of samples on ALL wikis.
  • Discuss Airflow with Fabian.
Thu, Apr 4, 7:00 PM · Research (FY2023-24-Research-January-March)

Sat, Mar 30

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 25/3/2024 - 31/3/2024:

  • Modify and push code. Collect all wikis data.
  • Train a model on all wikis
  • Discuss possible pipeline solutions
Sat, Mar 30, 5:11 PM · Research (FY2023-24-Research-January-March)

Fri, Mar 22

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 18/3/2024 - 20/3/2024:

  • Clean and upload main and secondary wiki results. Add single language result for comparison.
  • Prepare and run script to gather data for all languages (anchor dicts etc)
  • Clean code to push as an intermediate stage of language agnostic modeling
Fri, Mar 22, 3:00 PM · Research (FY2023-24-Research-January-March)

Mar 17 2024

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 11/3/2024 - 17/3/2024:

  • Scale the model to 50 languages: Run pipeline for 50 languages and train a model with max 100k samples per language. Used fall back chains to select languages at the center.
  • Test on a different set of 50 languages (randomly chosen) in a 0-shot manner and compare performance.
Mar 17 2024, 4:37 PM · Research (FY2023-24-Research-January-March)

Mar 8 2024

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 4/3/2024 - 10/3/2024:

  • Finished grid search and stored best fit model
  • Add mwtokenizer in one more place, fix code to accommodate wiki_db feature, fix label encoder, push draft MR
Mar 8 2024, 7:12 PM · Research (FY2023-24-Research-January-March)

Mar 3 2024

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 26/2/2024 - 3/3/2024:

  • Trained a combined model with all data and with stratified split
  • Hyperparameter tuning in progress
Mar 3 2024, 3:26 PM · Research (FY2023-24-Research-January-March)

Feb 23 2024

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 19/2/2024 - 25/2/2024:

  • Replace w2v with outlink embedding, created baseline and ran 11 test wikis, MR sent
  • Training and evaluating a combined model with all test language wikis
    • with and without wiki_db feature
Feb 23 2024, 5:18 PM · Research (FY2023-24-Research-January-March)

Feb 18 2024

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 29/01/2024 - 04/02/2024:

  • Understanding the feature generation and model training component of link-recommendation model
  • Tested language performance of several models by changing the model's language. Tested multilingual models 2 languages at a time.
Feb 18 2024, 2:57 PM · Research (FY2023-24-Research-January-March)

Jan 29 2024

AKhatun_WMF closed T347696: Improving language-dependent models for add-a-link, a subtask of T309263: Support languages whose add-a-link models were not published, as Resolved.
Jan 29 2024, 2:22 PM · MoveComms-Support (Oct-Dec-2023), Chinese-Sites, Machine-Learning-Team, Growth-Team, Add-Link
AKhatun_WMF closed T347696: Improving language-dependent models for add-a-link, a subtask of T342526: Improving multilingual support for link recommendation model for add-a-link task, as Resolved.
Jan 29 2024, 2:22 PM · address-knowledge-gaps, Epic, Research
AKhatun_WMF closed T347696: Improving language-dependent models for add-a-link as Resolved.
Jan 29 2024, 2:22 PM · Research (FY2023-24-Research-January-March)

Jan 27 2024

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 22/1/2024 - 28/1/2024::

  • Fixed regex that was causing a lot of the models to have low-recall
  • MR sent
Jan 27 2024, 5:40 AM · Research (FY2023-24-Research-January-March)

Jan 19 2024

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Results of evaluations after solving this ticket can be found here: https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task#Results.

Jan 19 2024, 4:17 PM · Research (FY2023-24-Research-January-March)
AKhatun_WMF updated the task description for T347696: Improving language-dependent models for add-a-link.
Jan 19 2024, 4:16 PM · Research (FY2023-24-Research-January-March)
AKhatun_WMF updated the task description for T347696: Improving language-dependent models for add-a-link.
Jan 19 2024, 4:04 PM · Research (FY2023-24-Research-January-March)
AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 15/1/2024 - 21/1/2024::

  • MR sent to fix unicode errors. Multiple languages tested.
  • Tested all previously failed languages. wikipedia2vec==2.0.0 introduces a new IndexError that occurs in several languages.
    • Reverted to 2 venvs. This time conda has w2v==2.0.0 for jawiki and fywiki. venv has w2v==1.0.5 for rest of the languages.
    • Sent MR4
Jan 19 2024, 3:41 PM · Research (FY2023-24-Research-January-March)

Jan 14 2024

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 8/1/2024 - 14/1/2024:

  • Test and fix jawiki error by adding required dependencies.
  • Attempt to fix Unicode errors in zhwiki and fywiki (using different version of Wikipedia2Vec)
Jan 14 2024, 6:50 PM · Research (FY2023-24-Research-January-March)

Jan 7 2024

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 1/1/2024 - 7/1/2024:

  • mwtokenizer MR merged, new version released
  • link-recommendation MR updated and refactored to integrate new mwtokenizer
  • Ran non WS languages ad some previously Failed languages. There were some improvements. More debugging required.
  • MR merged
Jan 7 2024, 7:02 PM · Research (FY2023-24-Research-January-March)

Dec 23 2023

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 19/12/2023 - 24/12/2023:

  • Make changes in mwtokenizer
    • replace ▁ with " " in the tokenizer
    • separate punctuation from tokens.
    • Sent MR
  • In progress:
    • Use the updated mwtokenizer to improve link-recommendation.
    • Refactor and consolidate ngram functions in link-recommendation code.
Dec 23 2023, 12:34 AM · Research (FY2023-24-Research-January-March)

Dec 20 2023

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 11/12/2023 - 17/12/2023:

  • Fix sentence tokenization errors in link-recommendation. Send MR. Improves bowiki, but no improvement in mywiki. WS languages remain same.
  • Some analysis into the cause of the issues above.
Dec 20 2023, 6:18 AM · Research (FY2023-24-Research-January-March)

Dec 10 2023

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 27/11/2023 - 3/12/2023:

  • mwtokenizer issues resolved. MR merged.
  • Read through link recommendation docs
  • Pull code, set up dev env, run code for test wikis.
  • Some errors reported and fixed (T352525)
Dec 10 2023, 4:09 AM · Research (FY2023-24-Research-January-March)

Nov 28 2023

AKhatun_WMF closed T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model as Resolved.
Nov 28 2023, 5:27 PM · Research (FY2023-24-Research-October-December)
AKhatun_WMF closed T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model, a subtask of T342526: Improving multilingual support for link recommendation model for add-a-link task, as Resolved.
Nov 28 2023, 5:27 PM · address-knowledge-gaps, Epic, Research

Nov 24 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 20/11/2023 - 26/11/2023:

  • Revised MR for issue 38. Merged after few iterations.
  • Working on issue 32. Pushed.
Nov 24 2023, 8:47 PM · Research (FY2023-24-Research-October-December)

Nov 18 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 13/11/2023 - 19/11/2023:

  • Discussed Issue 32 and 38.
  • Pushed code for Issue 38
  • Make diff for 38
  • Create new issues for edge cases found while solving 38
Nov 18 2023, 10:18 PM · Research (FY2023-24-Research-October-December)

Nov 12 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 6/11/2023 - 12/11/2023:

  • Issue 37 solved and merged.
  • Start looking at Issue 38 and 32
Nov 12 2023, 1:47 AM · Research (FY2023-24-Research-October-December)

Nov 5 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 30/10/2023 - 5/11/2023:

  • Working on Issue 37. Dissecting regex.
Nov 5 2023, 7:39 PM · Research (FY2023-24-Research-October-December)

Oct 30 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 23/10/2023 - 29/10/2023:

  • Done finding sentence terminator symbols using the sentence per paragraph method. Also listed more symbols from Terminal_punctuation. See DOC for details. MR 21 sent, merged.
  • Discussed Issue 37 and 9. MR merged for Issue 9.
  • Started Issue 37
Oct 30 2023, 12:00 PM · Research (FY2023-24-Research-October-December)

Oct 20 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 16/10/2023 - 22/10/2023:

  • Done Issue 41 to move Cree off of non-WS language list
  • Worked on Issue 40: Find list of languages where most paragraphs have 1 sentence, analyze few random wiki pages, detect missing sentence ending punctuation, if any.
Oct 20 2023, 7:04 PM · Research (FY2023-24-Research-October-December)

Oct 15 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 9/10/2023 - 15/10/2023:

  • Working on updating sentence tokenizer for non white space languages
Oct 15 2023, 6:58 PM · Research (FY2023-24-Research-October-December)

Oct 6 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 2/10/2023 - 8/10/2023:

  • 1st MR merged
  • Discussed how evaluation was set up and next steps.
  • Researched and took notes for non-whitespace separated languages. Doc.
  • Wikimedia Connect!!
Oct 6 2023, 5:04 PM · Research (FY2023-24-Research-October-December)

Sep 30 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 25/09/2023 - 1/10/2023:

Sep 30 2023, 1:38 PM · Research (FY2023-24-Research-October-December)

Sep 28 2023

AKhatun_WMF updated AKhatun_WMF.
Sep 28 2023, 10:39 PM
AKhatun_WMF updated AKhatun_WMF.
Sep 28 2023, 10:37 PM
AKhatun_WMF added a comment to T346796: Requesting access to analytics-privatedata-users for Aisha Khatun.
Sep 28 2023, 1:45 AM · SRE, SRE-Access-Requests
AKhatun_WMF added a comment to T346796: Requesting access to analytics-privatedata-users for Aisha Khatun.

Thanks @colewhite. I'm all set!

Sep 28 2023, 1:45 AM · SRE, SRE-Access-Requests

Sep 25 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 18/09/2023 - 24/09/2023:

  • Read onboarding docs to familiarize with mwtokenizer and add-a-link tasks: link-recommendation-2023_work-plan
  • Going over mwtokenizer code base
  • Issues setting up:
    • Don’t have kinit access yet
    • Cannot access Jupyter as well
    • Issues setting up venv in personal pc (couldn’t solve)
    • Created venv and error installing in stat1008. [resolved. needed proxy]
Sep 25 2023, 1:26 AM · Research (FY2023-24-Research-October-December)
AKhatun_WMF placed T288266: Better understand the makeup of specific Wikidata object types that probably can't be dropped up for grabs.
Sep 25 2023, 1:22 AM · Epic, Wikidata, Wikidata-Query-Service

Sep 23 2023

AKhatun_WMF added a comment to T346796: Requesting access to analytics-privatedata-users for Aisha Khatun.

I am getting this error when I kinit
kinit: Client 'akhatun@WIKIMEDIA' not found in Kerberos database while getting initial credentials
Am I supposed to get a temporary password though email?

Sep 23 2023, 12:17 AM · SRE, SRE-Access-Requests

Sep 18 2023

AKhatun_WMF placed T288259: Get estimates for how many Wikidata items don't have at least 3 backlinks up for grabs.
Sep 18 2023, 5:24 PM · Wikidata, Wikidata-Query-Service
AKhatun_WMF placed T288260: Get estimates for size of non-normalized values in Wikidata up for grabs.
Sep 18 2023, 5:24 PM · Wikidata, Wikidata-Query-Service
AKhatun_WMF placed T288261: Determine if there are consistently used top ranked Wikidata statements, and how many of them are there up for grabs.
Sep 18 2023, 5:22 PM · Wikidata, Wikidata-Query-Service
AKhatun_WMF placed T288264: Get estimates for all Wikidata statements of a specific datatype up for grabs.
Sep 18 2023, 5:21 PM · Wikidata, Wikidata-Query-Service
AKhatun_WMF placed T288265: Get estimates for Wikidata items without hot properties that are being queried up for grabs.
Sep 18 2023, 5:20 PM · Wikidata, Wikidata-Query-Service

Jun 30 2023

AKhatun_WMF closed T328742: Generate list of common misspellings from wiktionary as Resolved.
Jun 30 2023, 8:08 PM · Research (FY2022-23-Research-April-June)
AKhatun_WMF closed T328742: Generate list of common misspellings from wiktionary, a subtask of T293034: [EPIC] Research support for Copyediting as a structured tasks, as Resolved.
Jun 30 2023, 8:08 PM · Research, Epic
AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 26/6/23 - 2/7/23 Update:

  • Listed and analyzed redirects in Wiktionary.
Jun 30 2023, 7:58 PM · Research (FY2022-23-Research-April-June)

Jun 29 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 19/6/23 - 25/6/23 Update:

Jun 29 2023, 12:37 PM · Research (FY2022-23-Research-April-June)

Jun 19 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 12/6/23 - 18/6/23 Update:

  • Finished report (Images not added yet)
  • MR11 sent for readme and figures
Jun 19 2023, 5:14 PM · Research (FY2022-23-Research-April-June)

Jun 11 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 5/6/23 - 11/6/23 Update:

  • Fixed template parsing error, MR10 got merged.
  • Started working on report
Jun 11 2023, 11:22 PM · Research (FY2022-23-Research-April-June)

Jun 9 2023

AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.
Jun 9 2023, 2:29 AM · Research (FY2022-23-Research-April-June)

Jun 3 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 29/5/23 - 4/6/23 Update:

  • Fixed template parsing to accommodate the use of lang param in template
  • Parsed and saved all language wiktionary misspelling.
  • Did some analysis, de-duplication, and saved all_wiki combined wiktionary misspellings.
  • More errors found in template parsing (named params occur before un-named params causing incorrect parsing)
Jun 3 2023, 5:23 AM · Research (FY2022-23-Research-April-June)

May 28 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 22/5/23 - 28/5/23 Update:

  • Updated MR9 with summary
  • Created Issue 14. Changed wiktionary parser script to make it work with all languages. Need to figure out some changes in template params.
May 28 2023, 11:41 PM · Research (FY2022-23-Research-April-June)

May 22 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 15/5/23 - 21/5/23 Update:

  • Checked example use of misspelling of templates in all the collected 16 languages. All languages look similar to enwiktionary except trwiktionary (small change) and viwiktionary
May 22 2023, 3:32 AM · Research (FY2022-23-Research-April-June)

May 12 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 8/5/23 - 14/5/23 Update:

  • Created Issues 12 and 13. Started working on them: identify misspelling of templates in other languages and find usage of these templates. The templates would be collected from Q50368067, misspelling of named templates in other languages, and their redirects.
May 12 2023, 10:29 PM · Research (FY2022-23-Research-April-June)
AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 1/5/23 - 7/5/23 Update:

  • Incorporated feedback and had MR7 merged (refactor repo)
  • Analysis done on extracted misspellings, sent MR8. Based on feedback, some more analysis done.
May 12 2023, 3:05 AM · Research (FY2022-23-Research-April-June)

Apr 30 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 24/4/23 - 30/4/23 Update:

  • Incorporate Isaacs feedback for MR5 and 6. All MRs merged after some editing and discussion.
  • Created MR7 to Refactored repo
  • extracted misspelling from all language wikipedias.
    • Todo: analysis on extracted data
Apr 30 2023, 11:48 PM · Research (FY2022-23-Research-April-June)

Apr 24 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 10/4/23 - 16/4/23 Update:

  • Add info on language detection (language, confidence, text sent to model). Analyze examples.
  • Add proxy tables: tables that were not detected by mwparserfromhell.
  • Separate cell data of tables: each cell in table is now a node. Stuck with cell data/paragraph text to send to model.
Apr 24 2023, 4:19 PM · Research (FY2022-23-Research-April-June)

Apr 8 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 3/4/23 - 9/4/23 Update:

  • Pushed revised code that includes all additional formatting as a list (as discussed).
  • Fixed quotations detected. Added fasttext language detection.
  • Analysed collected misspellings from context. Some work need to be done to increase precision of detected language.
Apr 8 2023, 8:06 PM · Research (FY2022-23-Research-April-June)

Apr 1 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 27/3/23 - 2/4/23 Update:

  • Apply additional filter information to extracted misspellings: Capitalization, word length, part of a list item, inside of quotations (in any language)
  • Still need to figure out the data's structure and add fasttext detected language information
Apr 1 2023, 7:26 PM · Research (FY2022-23-Research-April-June)

Mar 25 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 20/3/23 - 26/3/23 Update:

Mar 25 2023, 7:30 PM · Research (FY2022-23-Research-April-June)

Mar 18 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 13/3/23 - 19/3/23 Update:

Mar 18 2023, 4:16 AM · Research (FY2022-23-Research-April-June)

Mar 11 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 6/3/23 - 12/3/23 Update:

  • Compared collected en and fr misspellings with AutowikiBrowser Typo list. Merge requested. Summary here
  • Started working on extracting wikipedia text to find the ratio of misspellings
Mar 11 2023, 2:46 AM · Research (FY2022-23-Research-April-June)
AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.
Mar 11 2023, 2:42 AM · Research (FY2022-23-Research-April-June)

Mar 4 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 27/2/23 - 5/3/23 Update:

  • Address comments for Issue #5
    • Parse sections line by line, consider templates in # items (numbered list)
    • Count the number of definitions by # count, excluding ## #: #; and #*
    • Also change the data format a bit to make it more readable
  • To address Issue 6: get list of misspellings from another Language and compare the collected lists to existing approaches
    • collected bnwiktionary templates. It does not have much Bangla words. Its the same as present in enwiktionary. Will work with existing collected Spanish misspellings instead.
    • for English, compared collected list with enwiki Lists_of_common_misspellings
Mar 4 2023, 1:30 AM · Research (FY2022-23-Research-April-June)

Mar 3 2023

AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.
Mar 3 2023, 1:59 AM · Research (FY2022-23-Research-April-June)

Feb 25 2023

AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.
Feb 25 2023, 6:22 AM · Research (FY2022-23-Research-April-June)
AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 20/2/23 - 26/2/23 Update:

Feb 25 2023, 6:20 AM · Research (FY2022-23-Research-April-June)

Feb 18 2023

AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.
Feb 18 2023, 4:40 AM · Research (FY2022-23-Research-April-June)
AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 13/2/23 - 19/2/23 Update:

Feb 18 2023, 4:40 AM · Research (FY2022-23-Research-April-June)

Feb 16 2023

AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.
Feb 16 2023, 9:12 PM · Research (FY2022-23-Research-April-June)
AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.
Feb 16 2023, 6:50 PM · Research (FY2022-23-Research-April-June)
AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 6/2/23 - 12/2/23 Update:

  • Set up jupyter notebook (fix issues with getting spark3)
  • Get list of enwiktionary pages that use missplelling_of template using the following tables:
    • mediawiki_templatelinks, mediawiki_linktarget, mediawiki_wikitext_current
  • Parsed enwiktionary pages to get heading name (typically POS: Noun, Adj, etc), language of misspelling, and the correct spelling from the template
  • Some analysis on parsed wikis to get languauge and heading distribution
Feb 16 2023, 6:45 PM · Research (FY2022-23-Research-April-June)

Feb 9 2023

AKhatun_WMF added a comment to T328733: Requesting access to analytics-privatedata-users for Aisha Khatun.

Thank you, accessed!

Feb 9 2023, 5:41 PM · SRE, SRE-Access-Requests

Feb 6 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 1/2/23 - 5/2/23 Update:

  • Caught up on previous work on copy editing both in research team and growth team
  • Learned about templates in Wiktionary in different langauges and the possible categories they may be in
Feb 6 2023, 10:57 PM · Research (FY2022-23-Research-April-June)

Feb 3 2023

AKhatun_WMF updated the task description for T328733: Requesting access to analytics-privatedata-users for Aisha Khatun.
Feb 3 2023, 9:21 PM · SRE, SRE-Access-Requests

Jul 11 2022

AKhatun_WMF added a project to T279416: Deploy Image content filtration model for Wikimedia Commons: WMF-Inspiration-Week-2022-ML-Collab.
Jul 11 2022, 8:58 AM · WMF-Inspiration-Week-2022-ML-Collab, artificial-intelligence

Jul 8 2022

AKhatun_WMF added a comment to T303831: Productionize Wikidata subgraph analysis.

In terms of the exact code causing this, spark is terrible at telling us exactly where but trying to infer from the SparkUI output i think it's this join:

def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = {
  wikidataTriples
    .filter(s"predicate='<$p31>'")
    .selectExpr("object as subgraph", "subject as item")
    .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right")
Jul 8 2022, 5:47 AM · Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)

Jul 7 2022

AKhatun_WMF added a comment to T303831: Productionize Wikidata subgraph analysis.

Update:
I tested a few options in the statbox, I am not sure how much this will represent the prod env, but here goes:

Jul 7 2022, 12:20 PM · Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)
AKhatun_WMF added a comment to T303831: Productionize Wikidata subgraph analysis.

the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).

subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.

Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.

Jul 7 2022, 7:41 AM · Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)

Jun 5 2022

AKhatun_WMF placed T271400: Collect analytics data such as pageview up for grabs.
Jun 5 2022, 12:56 PM · Abstract Wikipedia team

Mar 15 2022

AKhatun_WMF moved T303831: Productionize Wikidata subgraph analysis from Incoming to In Progress on the Discovery-Search (Current work) board.
Mar 15 2022, 2:10 PM · Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)
AKhatun_WMF moved T303831: Productionize Wikidata subgraph analysis from Incoming to Current work on the Wikidata-Query-Service board.
Mar 15 2022, 2:10 PM · Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)
AKhatun_WMF created T303831: Productionize Wikidata subgraph analysis.
Mar 15 2022, 2:08 PM · Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)
AKhatun_WMF placed T299921: Estimate benefits of splitting and federating Wikidata subgraphs up for grabs.
Mar 15 2022, 1:52 PM · Wikidata, Wikidata-Query-Service

Feb 10 2022

AKhatun_WMF updated the task description for T299453: Coordinate Wikimedia's participation in GSoC 2022 and Outreachy Round 24.
Feb 10 2022, 8:45 AM · Developer-Advocacy (Oct-Dec 2022), Outreachy (Round 24), Google-Summer-of-Code (2022)

Jan 31 2022

AKhatun_WMF moved T299921: Estimate benefits of splitting and federating Wikidata subgraphs from Analysis to Current work on the Wikidata-Query-Service board.
Jan 31 2022, 2:02 PM · Wikidata, Wikidata-Query-Service

Jan 20 2022

AKhatun_WMF moved T288262: Estimate how many Wikidata items have low/no ORES score from In Progress to Needs Reporting on the Discovery-Search (Current work) board.

The analysis is done here (for Q-ids): Wikidata_Item_ORES_Score_Analysis

Jan 20 2022, 3:24 PM · Machine-Learning-Team, ORES, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Jan 18 2022

AKhatun_WMF added a comment to T288262: Estimate how many Wikidata items have low/no ORES score.

@AKhatun_WMF: You mention on the wiki that some Items don't have an ORES score. All Items should have one 😬 Do you have an example of one that does not?

Jan 18 2022, 5:44 PM · Machine-Learning-Team, ORES, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
AKhatun_WMF added a comment to T288262: Estimate how many Wikidata items have low/no ORES score.

@AKhatun_WMF , sorry, it's been a while since I wrote this, but I think what I meant when I wrote the question about "optimal separation" is given some distribution of ORES scores (e.g. a normal distribution), is it clear what the threshold is for what qualifies as a "high" vs "low" score: e.g. anything over .75 is a high score. But that's assuming the scores are continuous. I guess it's moot if they're binary (I don't actually know).

If this isn't a sensible way of thinking about the issue, let me know if there's a better way.

Jan 18 2022, 3:16 PM · Machine-Learning-Team, ORES, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
AKhatun_WMF updated subscribers of T288262: Estimate how many Wikidata items have low/no ORES score.

@MPhamWMF Hi, could you please clarify the question Is there an optimal separation between high/low ORES scores?. Separation in what respect? To my mind comes the separation of items with respect to the subgraph it is part of.

Jan 18 2022, 6:52 AM · Machine-Learning-Team, ORES, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Jan 12 2022

AKhatun_WMF added a comment to T288262: Estimate how many Wikidata items have low/no ORES score.

@ACraze Indeed! I was confusing the models for revision (item quality) with edits (damaging/good faith). The latest revision is all I will need. Thank you!

Jan 12 2022, 4:02 AM · Machine-Learning-Team, ORES, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Jan 10 2022

AKhatun_WMF moved T288262: Estimate how many Wikidata items have low/no ORES score from Incoming to In Progress on the Discovery-Search (Current work) board.
Jan 10 2022, 7:28 AM · Machine-Learning-Team, ORES, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
AKhatun_WMF moved T288262: Estimate how many Wikidata items have low/no ORES score from Analysis to Current work on the Wikidata-Query-Service board.
Jan 10 2022, 7:28 AM · Machine-Learning-Team, ORES, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Jan 6 2022

AKhatun_WMF moved T288257: Get estimates for size of astronomical objects and queries in Wikidata graph from Incoming to Needs Reporting on the Discovery-Search (Current work) board.

Counts of queries and triples for astronomical objects were done here: Wikidata_Subgraph_Query_Analysis, along with the top ~300 large subgraphs.
For the specific case of Astronomical objects (and only astronomical objects), a list of all its subclasses was obtained and manually inspected for relevance to astronomical objects. The subclass list also consists of subclasses of subclasses and so on.

Jan 6 2022, 6:12 AM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
AKhatun_WMF moved T295188: Create aggregate list of potential Blazegraph data deletion sources in case of catastrophic failure from In Progress to Needs Reporting on the Discovery-Search (Current work) board.

Details can be found here: Wikidata_Subgraph_Query_Analysis

Jan 6 2022, 5:48 AM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service