Clean and upload main and secondary wiki results. Add single language result for comparison.
Prepare and run script to gather data for all languages (anchor dicts etc)
Clean code to push as an intermediate stage of language agnostic modeling

Mar 22 2024, 3:00 PM · Research (FY2023-24-Research-January-March)

Mar 17 2024

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 11/3/2024 - 17/3/2024:

Scale the model to 50 languages: Run pipeline for 50 languages and train a model with max 100k samples per language. Used fall back chains to select languages at the center.
Test on a different set of 50 languages (randomly chosen) in a 0-shot manner and compare performance.

Mar 17 2024, 4:37 PM · Research (FY2023-24-Research-January-March)

Mar 8 2024

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 4/3/2024 - 10/3/2024:

Finished grid search and stored best fit model
Add mwtokenizer in one more place, fix code to accommodate wiki_db feature, fix label encoder, push draft MR

Mar 8 2024, 7:12 PM · Research (FY2023-24-Research-January-March)

Mar 3 2024

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 26/2/2024 - 3/3/2024:

Trained a combined model with all data and with stratified split
Hyperparameter tuning in progress

Mar 3 2024, 3:26 PM · Research (FY2023-24-Research-January-March)

Feb 23 2024

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 19/2/2024 - 25/2/2024:

Replace w2v with outlink embedding, created baseline and ran 11 test wikis, MR sent
Training and evaluating a combined model with all test language wikis
- with and without wiki_db feature

Feb 23 2024, 5:18 PM · Research (FY2023-24-Research-January-March)

Feb 18 2024

AKhatun_WMF added a comment to T354659: Exploratory work on language-agnostic model for link recommendation for add-a-link.

Update 29/01/2024 - 04/02/2024:

Understanding the feature generation and model training component of link-recommendation model
Tested language performance of several models by changing the model's language. Tested multilingual models 2 languages at a time.

Feb 18 2024, 2:57 PM · Research (FY2023-24-Research-January-March)

Jan 29 2024

AKhatun_WMF closed T347696: Improving language-dependent models for add-a-link, a subtask of T309263: Support languages whose add-a-link models were not published, as Resolved.

Jan 29 2024, 2:22 PM · MoveComms-Support (Oct-Dec-2023), Chinese-Sites, Machine-Learning-Team, Growth-Team, Add-Link

AKhatun_WMF closed T347696: Improving language-dependent models for add-a-link, a subtask of T342526: Improving multilingual support for link recommendation model for add-a-link task, as Resolved.

Jan 29 2024, 2:22 PM · address-knowledge-gaps, Epic, Research

AKhatun_WMF closed T347696: Improving language-dependent models for add-a-link as Resolved.

Jan 29 2024, 2:22 PM · Research (FY2023-24-Research-January-March)

Jan 27 2024

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 22/1/2024 - 28/1/2024::

Fixed regex that was causing a lot of the models to have low-recall
MR sent

Jan 27 2024, 5:40 AM · Research (FY2023-24-Research-January-March)

Jan 19 2024

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Results of evaluations after solving this ticket can be found here: https://meta.wikimedia.org/wiki/Research:Improving_multilingual_support_for_link_recommendation_model_for_add-a-link_task#Results.

Jan 19 2024, 4:17 PM · Research (FY2023-24-Research-January-March)

AKhatun_WMF updated the task description for T347696: Improving language-dependent models for add-a-link.

Jan 19 2024, 4:16 PM · Research (FY2023-24-Research-January-March)

AKhatun_WMF updated the task description for T347696: Improving language-dependent models for add-a-link.

Jan 19 2024, 4:04 PM · Research (FY2023-24-Research-January-March)

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 15/1/2024 - 21/1/2024::

MR sent to fix unicode errors. Multiple languages tested.
Tested all previously failed languages. wikipedia2vec==2.0.0 introduces a new IndexError that occurs in several languages.
- Reverted to 2 venvs. This time conda has w2v==2.0.0 for jawiki and fywiki. venv has w2v==1.0.5 for rest of the languages.
- Sent MR4

Jan 19 2024, 3:41 PM · Research (FY2023-24-Research-January-March)

Jan 14 2024

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 8/1/2024 - 14/1/2024:

Test and fix jawiki error by adding required dependencies.
Attempt to fix Unicode errors in zhwiki and fywiki (using different version of Wikipedia2Vec)

Jan 14 2024, 6:50 PM · Research (FY2023-24-Research-January-March)

Jan 7 2024

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 1/1/2024 - 7/1/2024:

mwtokenizer MR merged, new version released
link-recommendation MR updated and refactored to integrate new mwtokenizer
Ran non WS languages ad some previously Failed languages. There were some improvements. More debugging required.
MR merged

Jan 7 2024, 7:02 PM · Research (FY2023-24-Research-January-March)

Dec 23 2023

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 19/12/2023 - 24/12/2023:

Make changes in mwtokenizer
- replace ▁ with " " in the tokenizer
- separate punctuation from tokens.
- Sent MR
In progress:
- Use the updated mwtokenizer to improve link-recommendation.
- Refactor and consolidate ngram functions in link-recommendation code.

Dec 23 2023, 12:34 AM · Research (FY2023-24-Research-January-March)

Dec 20 2023

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 11/12/2023 - 17/12/2023:

Fix sentence tokenization errors in link-recommendation. Send MR. Improves bowiki, but no improvement in mywiki. WS languages remain same.
Some analysis into the cause of the issues above.

Dec 20 2023, 6:18 AM · Research (FY2023-24-Research-January-March)

Dec 10 2023

AKhatun_WMF added a comment to T347696: Improving language-dependent models for add-a-link.

Update week 27/11/2023 - 3/12/2023:

mwtokenizer issues resolved. MR merged.
Read through link recommendation docs
Pull code, set up dev env, run code for test wikis.
Some errors reported and fixed (T352525)

Dec 10 2023, 4:09 AM · Research (FY2023-24-Research-January-March)

Nov 28 2023

AKhatun_WMF closed T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model as Resolved.

Nov 28 2023, 5:27 PM · Research (FY2023-24-Research-October-December)

AKhatun_WMF closed T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model, a subtask of T342526: Improving multilingual support for link recommendation model for add-a-link task, as Resolved.

Nov 28 2023, 5:27 PM · address-knowledge-gaps, Epic, Research

Nov 24 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 20/11/2023 - 26/11/2023:

Revised MR for issue 38. Merged after few iterations.
Working on issue 32. Pushed.

Nov 24 2023, 8:47 PM · Research (FY2023-24-Research-October-December)

Nov 18 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 13/11/2023 - 19/11/2023:

Discussed Issue 32 and 38.
Pushed code for Issue 38
Make diff for 38
Create new issues for edge cases found while solving 38

Nov 18 2023, 10:18 PM · Research (FY2023-24-Research-October-December)

Nov 12 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 6/11/2023 - 12/11/2023:

Issue 37 solved and merged.
Start looking at Issue 38 and 32

Nov 12 2023, 1:47 AM · Research (FY2023-24-Research-October-December)

Nov 5 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 30/10/2023 - 5/11/2023:

Working on Issue 37. Dissecting regex.

Nov 5 2023, 7:39 PM · Research (FY2023-24-Research-October-December)

Oct 30 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 23/10/2023 - 29/10/2023:

Done finding sentence terminator symbols using the sentence per paragraph method. Also listed more symbols from Terminal_punctuation. See DOC for details. MR 21 sent, merged.
Discussed Issue 37 and 9. MR merged for Issue 9.
Started Issue 37

Oct 30 2023, 12:00 PM · Research (FY2023-24-Research-October-December)

Oct 20 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 16/10/2023 - 22/10/2023:

Done Issue 41 to move Cree off of non-WS language list
Worked on Issue 40: Find list of languages where most paragraphs have 1 sentence, analyze few random wiki pages, detect missing sentence ending punctuation, if any.

Oct 20 2023, 7:04 PM · Research (FY2023-24-Research-October-December)

Oct 15 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 9/10/2023 - 15/10/2023:

Working on updating sentence tokenizer for non white space languages

Oct 15 2023, 6:58 PM · Research (FY2023-24-Research-October-December)

Oct 6 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 2/10/2023 - 8/10/2023:

1st MR merged
Discussed how evaluation was set up and next steps.
Researched and took notes for non-whitespace separated languages. Doc.
Wikimedia Connect!!

Oct 6 2023, 5:04 PM · Research (FY2023-24-Research-October-December)

Sep 30 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 25/09/2023 - 1/10/2023:

Sep 30 2023, 1:38 PM · Research (FY2023-24-Research-October-December)

Sep 28 2023

AKhatun_WMF updated AKhatun_WMF.

Sep 28 2023, 10:39 PM

AKhatun_WMF updated AKhatun_WMF.

Sep 28 2023, 10:37 PM

AKhatun_WMF added a comment to T346796: Requesting access to analytics-privatedata-users for Aisha Khatun.

Sep 28 2023, 1:45 AM · SRE, SRE-Access-Requests

AKhatun_WMF added a comment to T346796: Requesting access to analytics-privatedata-users for Aisha Khatun.

Thanks @colewhite. I'm all set!

Sep 28 2023, 1:45 AM · SRE, SRE-Access-Requests

Sep 25 2023

AKhatun_WMF added a comment to T346798: Fix issues in mwtokenizer package blocking usage in link recommendation model.

Update week 18/09/2023 - 24/09/2023:

Read onboarding docs to familiarize with mwtokenizer and add-a-link tasks: link-recommendation-2023_work-plan
Going over mwtokenizer code base
Issues setting up:
- Don’t have kinit access yet
- Cannot access Jupyter as well
- Issues setting up venv in personal pc (couldn’t solve)
- Created venv and error installing in stat1008. [resolved. needed proxy]

Sep 25 2023, 1:26 AM · Research (FY2023-24-Research-October-December)

AKhatun_WMF placed T288266: Better understand the makeup of specific Wikidata object types that probably can't be dropped up for grabs.

Sep 25 2023, 1:22 AM · Epic, Wikidata, Wikidata-Query-Service

Sep 23 2023

AKhatun_WMF added a comment to T346796: Requesting access to analytics-privatedata-users for Aisha Khatun.

I am getting this error when I kinit
kinit: Client 'akhatun@WIKIMEDIA' not found in Kerberos database while getting initial credentials
Am I supposed to get a temporary password though email?

Sep 23 2023, 12:17 AM · SRE, SRE-Access-Requests

Sep 18 2023

AKhatun_WMF placed T288259: Get estimates for how many Wikidata items don't have at least 3 backlinks up for grabs.

Sep 18 2023, 5:24 PM · Wikidata, Wikidata-Query-Service

AKhatun_WMF placed T288260: Get estimates for size of non-normalized values in Wikidata up for grabs.

Sep 18 2023, 5:24 PM · Wikidata, Wikidata-Query-Service

AKhatun_WMF placed T288261: Determine if there are consistently used top ranked Wikidata statements, and how many of them are there up for grabs.

Sep 18 2023, 5:22 PM · Wikidata, Wikidata-Query-Service

AKhatun_WMF placed T288264: Get estimates for all Wikidata statements of a specific datatype up for grabs.

Sep 18 2023, 5:21 PM · Wikidata, Wikidata-Query-Service

AKhatun_WMF placed T288265: Get estimates for Wikidata items without hot properties that are being queried up for grabs.

Sep 18 2023, 5:20 PM · Wikidata, Wikidata-Query-Service

Jun 30 2023

AKhatun_WMF closed T328742: Generate list of common misspellings from wiktionary as Resolved.

Jun 30 2023, 8:08 PM · Research (FY2022-23-Research-April-June)

AKhatun_WMF closed T328742: Generate list of common misspellings from wiktionary, a subtask of T293034: [EPIC] Research support for Copyediting as a structured tasks, as Resolved.

Jun 30 2023, 8:08 PM · Research, Epic

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 26/6/23 - 2/7/23 Update:

Listed and analyzed redirects in Wiktionary.

Jun 30 2023, 7:58 PM · Research (FY2022-23-Research-April-June)

Jun 29 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 19/6/23 - 25/6/23 Update:

Jun 29 2023, 12:37 PM · Research (FY2022-23-Research-April-June)

Jun 19 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 12/6/23 - 18/6/23 Update:

Finished report (Images not added yet)
MR11 sent for readme and figures

Jun 19 2023, 5:14 PM · Research (FY2022-23-Research-April-June)

Jun 11 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 5/6/23 - 11/6/23 Update:

Fixed template parsing error, MR10 got merged.
Started working on report

Jun 11 2023, 11:22 PM · Research (FY2022-23-Research-April-June)

Jun 9 2023

AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.

Jun 9 2023, 2:29 AM · Research (FY2022-23-Research-April-June)

Jun 3 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 29/5/23 - 4/6/23 Update:

Fixed template parsing to accommodate the use of lang param in template
Parsed and saved all language wiktionary misspelling.
Did some analysis, de-duplication, and saved all_wiki combined wiktionary misspellings.
More errors found in template parsing (named params occur before un-named params causing incorrect parsing)

Jun 3 2023, 5:23 AM · Research (FY2022-23-Research-April-June)

May 28 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 22/5/23 - 28/5/23 Update:

Updated MR9 with summary
Created Issue 14. Changed wiktionary parser script to make it work with all languages. Need to figure out some changes in template params.

May 28 2023, 11:41 PM · Research (FY2022-23-Research-April-June)

May 22 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 15/5/23 - 21/5/23 Update:

Checked example use of misspelling of templates in all the collected 16 languages. All languages look similar to enwiktionary except trwiktionary (small change) and viwiktionary

May 22 2023, 3:32 AM · Research (FY2022-23-Research-April-June)

May 12 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 8/5/23 - 14/5/23 Update:

Created Issues 12 and 13. Started working on them: identify misspelling of templates in other languages and find usage of these templates. The templates would be collected from Q50368067, misspelling of named templates in other languages, and their redirects.

May 12 2023, 10:29 PM · Research (FY2022-23-Research-April-June)

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 1/5/23 - 7/5/23 Update:

Incorporated feedback and had MR7 merged (refactor repo)
Analysis done on extracted misspellings, sent MR8. Based on feedback, some more analysis done.

May 12 2023, 3:05 AM · Research (FY2022-23-Research-April-June)

Apr 30 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 24/4/23 - 30/4/23 Update:

Incorporate Isaacs feedback for MR5 and 6. All MRs merged after some editing and discussion.
Created MR7 to Refactored repo
extracted misspelling from all language wikipedias.
- Todo: analysis on extracted data

Apr 30 2023, 11:48 PM · Research (FY2022-23-Research-April-June)

Apr 24 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 10/4/23 - 16/4/23 Update:

Add info on language detection (language, confidence, text sent to model). Analyze examples.
Add proxy tables: tables that were not detected by mwparserfromhell.
Separate cell data of tables: each cell in table is now a node. Stuck with cell data/paragraph text to send to model.

Apr 24 2023, 4:19 PM · Research (FY2022-23-Research-April-June)

Apr 8 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 3/4/23 - 9/4/23 Update:

Pushed revised code that includes all additional formatting as a list (as discussed).
Fixed quotations detected. Added fasttext language detection.
Analysed collected misspellings from context. Some work need to be done to increase precision of detected language.

Apr 8 2023, 8:06 PM · Research (FY2022-23-Research-April-June)

Apr 1 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 27/3/23 - 2/4/23 Update:

Apply additional filter information to extracted misspellings: Capitalization, word length, part of a list item, inside of quotations (in any language)
Still need to figure out the data's structure and add fasttext detected language information

Apr 1 2023, 7:26 PM · Research (FY2022-23-Research-April-June)

Mar 25 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 20/3/23 - 26/3/23 Update:

Mar 25 2023, 7:30 PM · Research (FY2022-23-Research-April-June)

Mar 18 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 13/3/23 - 19/3/23 Update:

Mar 18 2023, 4:16 AM · Research (FY2022-23-Research-April-June)

Mar 11 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 6/3/23 - 12/3/23 Update:

Compared collected en and fr misspellings with AutowikiBrowser Typo list. Merge requested. Summary here
Started working on extracting wikipedia text to find the ratio of misspellings

Mar 11 2023, 2:46 AM · Research (FY2022-23-Research-April-June)

AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.

Mar 11 2023, 2:42 AM · Research (FY2022-23-Research-April-June)

Mar 4 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 27/2/23 - 5/3/23 Update:

Address comments for Issue #5
- Parse sections line by line, consider templates in # items (numbered list)
- Count the number of definitions by # count, excluding ## #: #; and #*
- Also change the data format a bit to make it more readable
To address Issue 6: get list of misspellings from another Language and compare the collected lists to existing approaches
- collected bnwiktionary templates. It does not have much Bangla words. Its the same as present in enwiktionary. Will work with existing collected Spanish misspellings instead.
- for English, compared collected list with enwiki Lists_of_common_misspellings

Mar 4 2023, 1:30 AM · Research (FY2022-23-Research-April-June)

Mar 3 2023

AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.

Mar 3 2023, 1:59 AM · Research (FY2022-23-Research-April-June)

Feb 25 2023

AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.

Feb 25 2023, 6:22 AM · Research (FY2022-23-Research-April-June)

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 20/2/23 - 26/2/23 Update:

Feb 25 2023, 6:20 AM · Research (FY2022-23-Research-April-June)

Feb 18 2023

AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.

Feb 18 2023, 4:40 AM · Research (FY2022-23-Research-April-June)

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 13/2/23 - 19/2/23 Update:

Feb 18 2023, 4:40 AM · Research (FY2022-23-Research-April-June)

Feb 16 2023

AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.

Feb 16 2023, 9:12 PM · Research (FY2022-23-Research-April-June)

AKhatun_WMF updated the task description for T328742: Generate list of common misspellings from wiktionary.

Feb 16 2023, 6:50 PM · Research (FY2022-23-Research-April-June)

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 6/2/23 - 12/2/23 Update:

Set up jupyter notebook (fix issues with getting spark3)
Get list of enwiktionary pages that use missplelling_of template using the following tables:
- mediawiki_templatelinks, mediawiki_linktarget, mediawiki_wikitext_current
Parsed enwiktionary pages to get heading name (typically POS: Noun, Adj, etc), language of misspelling, and the correct spelling from the template
Some analysis on parsed wikis to get languauge and heading distribution

Feb 16 2023, 6:45 PM · Research (FY2022-23-Research-April-June)

Feb 9 2023

AKhatun_WMF added a comment to T328733: Requesting access to analytics-privatedata-users for Aisha Khatun.

Thank you, accessed!

Feb 9 2023, 5:41 PM · SRE, SRE-Access-Requests

Feb 6 2023

AKhatun_WMF added a comment to T328742: Generate list of common misspellings from wiktionary.

Week 1/2/23 - 5/2/23 Update:

Caught up on previous work on copy editing both in research team and growth team
Learned about templates in Wiktionary in different langauges and the possible categories they may be in

Feb 6 2023, 10:57 PM · Research (FY2022-23-Research-April-June)

Feb 3 2023

AKhatun_WMF updated the task description for T328733: Requesting access to analytics-privatedata-users for Aisha Khatun.

Feb 3 2023, 9:21 PM · SRE, SRE-Access-Requests

Jul 11 2022

AKhatun_WMF added a project to T279416: Deploy Image content filtration model for Wikimedia Commons: WMF-Inspiration-Week-2022-ML-Collab.

Jul 11 2022, 8:58 AM · WMF-Inspiration-Week-2022-ML-Collab, artificial-intelligence

Jul 8 2022

AKhatun_WMF added a comment to T303831: Productionize Wikidata subgraph analysis.

In T303831#8063021, @EBernhardson wrote:
In terms of the exact code causing this, spark is terrible at telling us exactly where but trying to infer from the SparkUI output i think it's this join:
def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = {
  wikidataTriples
    .filter(s"predicate='<$p31>'")
    .selectExpr("object as subgraph", "subject as item")
    .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right")

Jul 8 2022, 5:47 AM · Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)

Jul 7 2022

AKhatun_WMF added a comment to T303831: Productionize Wikidata subgraph analysis.

Update:
I tested a few options in the statbox, I am not sure how much this will represent the prod env, but here goes:

Jul 7 2022, 12:20 PM · Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)

AKhatun_WMF added a comment to T303831: Productionize Wikidata subgraph analysis.

In T303831#8058159, @EBernhardson wrote:

the airflow patch is deployed but i only turned on *_init dags and subgraph_mapping_weekly today (ran out of time, will do rest tomorrow).

subgraph_mapping_weekly failed the first time through. I updated executor memory from 8g to 12g but the second execution is still failing. something is quite unbalanced about the topSubgraphItems, of the 8 shards they have inputs varying from 100MB to 450MB giving executions times of ~30s on the small ones and ~8m before the final one fails.

Not specifically related to this patch, but i wonder if we could change up the SparkUtils.saveTables method to somehow take parameters in the path to specify coalesce vs repartition and the number of partitions to save by, so we only have to update the airflow invocation and not the jar as well to test variations there.

Jul 7 2022, 7:41 AM · Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)

Jun 5 2022

AKhatun_WMF placed T271400: Collect analytics data such as pageview up for grabs.

Jun 5 2022, 12:56 PM · Abstract Wikipedia team

Mar 15 2022

AKhatun_WMF moved T303831: Productionize Wikidata subgraph analysis from Incoming to In Progress on the Discovery-Search (Current work) board.

Mar 15 2022, 2:10 PM · Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)

AKhatun_WMF moved T303831: Productionize Wikidata subgraph analysis from Incoming to Current work on the Wikidata-Query-Service board.

Mar 15 2022, 2:10 PM · Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)

AKhatun_WMF created T303831: Productionize Wikidata subgraph analysis.

Mar 15 2022, 2:08 PM · Wikidata, Wikidata-Query-Service, Discovery-Search (Current work)

AKhatun_WMF placed T299921: Estimate benefits of splitting and federating Wikidata subgraphs up for grabs.

Mar 15 2022, 1:52 PM · Wikidata, Wikidata-Query-Service

Feb 10 2022

AKhatun_WMF updated the task description for T299453: Coordinate Wikimedia's participation in GSoC 2022 and Outreachy Round 24.

Feb 10 2022, 8:45 AM · Developer-Advocacy (Oct-Dec 2022), Outreachy (Round 24), Google-Summer-of-Code (2022)

Jan 31 2022

AKhatun_WMF moved T299921: Estimate benefits of splitting and federating Wikidata subgraphs from Analysis to Current work on the Wikidata-Query-Service board.

Jan 31 2022, 2:02 PM · Wikidata, Wikidata-Query-Service

Jan 20 2022

AKhatun_WMF moved T288262: Estimate how many Wikidata items have low/no ORES score from In Progress to Needs Reporting on the Discovery-Search (Current work) board.

The analysis is done here (for Q-ids): Wikidata_Item_ORES_Score_Analysis

Jan 20 2022, 3:24 PM · Machine-Learning-Team, ORES, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Jan 18 2022

AKhatun_WMF added a comment to T288262: Estimate how many Wikidata items have low/no ORES score.

In T288262#7629267, @Lydia_Pintscher wrote:

@AKhatun_WMF: You mention on the wiki that some Items don't have an ORES score. All Items should have one 😬 Do you have an example of one that does not?

Jan 18 2022, 5:44 PM · Machine-Learning-Team, ORES, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

AKhatun_WMF added a comment to T288262: Estimate how many Wikidata items have low/no ORES score.

In T288262#7628599, @MPhamWMF wrote:

@AKhatun_WMF , sorry, it's been a while since I wrote this, but I think what I meant when I wrote the question about "optimal separation" is given some distribution of ORES scores (e.g. a normal distribution), is it clear what the threshold is for what qualifies as a "high" vs "low" score: e.g. anything over .75 is a high score. But that's assuming the scores are continuous. I guess it's moot if they're binary (I don't actually know).

If this isn't a sensible way of thinking about the issue, let me know if there's a better way.

Jan 18 2022, 3:16 PM · Machine-Learning-Team, ORES, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

AKhatun_WMF (Aisha Khatun)Research Data Scientist (NLP) @ Research Team

Projects

Calendar

Today

Tomorrow

Saturday

User Details

Recent ActivityView All

Today

Sun, May 19

Sat, May 11

Thu, May 2

Sun, Apr 28

Apr 20 2024

Apr 15 2024

Apr 6 2024

Apr 5 2024

Apr 4 2024

Mar 30 2024

Mar 22 2024

Mar 17 2024

Mar 8 2024

Mar 3 2024

Feb 23 2024

Feb 18 2024

Jan 29 2024

Jan 27 2024

Jan 19 2024

Jan 14 2024

Jan 7 2024

Dec 23 2023

Dec 20 2023

Dec 10 2023

Nov 28 2023

Nov 24 2023

Nov 18 2023

Nov 12 2023

Nov 5 2023

Oct 30 2023

Oct 20 2023

Oct 15 2023

Oct 6 2023

Sep 30 2023

Sep 28 2023

Sep 25 2023

Sep 23 2023

Sep 18 2023

Jun 30 2023

Jun 29 2023

Jun 19 2023

Jun 11 2023

Jun 9 2023

Jun 3 2023

May 28 2023

May 22 2023

May 12 2023

Apr 30 2023

Apr 24 2023

Apr 8 2023

Apr 1 2023

Mar 25 2023

Mar 18 2023

Mar 11 2023

Mar 4 2023

Mar 3 2023

Feb 25 2023

Feb 18 2023

Feb 16 2023

Feb 9 2023

Feb 6 2023

Feb 3 2023

Jul 11 2022

Jul 8 2022

Jul 7 2022

Jun 5 2022

Mar 15 2022

Feb 10 2022

Jan 31 2022

Jan 20 2022

Jan 18 2022

AKhatun_WMF (Aisha Khatun)
Research Data Scientist (NLP) @ Research Team

Recent Activity
View All