⚓ T328742 Generate list of common misspellings from wiktionary

		Status	Subtype	Assigned	Task
		Resolved		MGerlach	T293034 [EPIC] Research support for Copyediting as a structured tasks
		Resolved		AKhatun_WMF	T328742 Generate list of common misspellings from wiktionary

MGerlach created this task.Feb 3 2023, 9:42 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 3 2023, 9:42 AM

Week 1/2/23 - 5/2/23 Update:

Caught up on previous work on copy editing both in research team and growth team
Learned about templates in Wiktionary in different langauges and the possible categories they may be in

MGerlach updated the task description. (Show Details)Feb 9 2023, 11:10 AM

Week 6/2/23 - 12/2/23 Update:

Set up jupyter notebook (fix issues with getting spark3)
Get list of enwiktionary pages that use missplelling_of template using the following tables:
- mediawiki_templatelinks, mediawiki_linktarget, mediawiki_wikitext_current
Parsed enwiktionary pages to get heading name (typically POS: Noun, Adj, etc), language of misspelling, and the correct spelling from the template
Some analysis on parsed wikis to get languauge and heading distribution

Notes:

There are other templates like deliberate_missplelling_of
There are template redirects missp is used to call `missplelling_of
This template is also used in non-enwiktionary pages.

AKhatun_WMF updated the task description. (Show Details)Feb 16 2023, 6:50 PM

AKhatun_WMF updated the task description. (Show Details)Feb 16 2023, 9:12 PM

Week 13/2/23 - 19/2/23 Update:

Issue 1

Added redirects of misspelling_of using redirect and page tables.
- some debugging was required to realize that the number of templates remains the same, but we will need the list of redirects anyways to match the name of the template in wikitext.
Added _ and space variations of the same template
Parsed and saved enwiktionary misspellings as tsv file (Merge requested)

Issue 2

Went through Nazia's code. Adapted it to parse enwiki articles and collected frequency of misspelled words. Then calculated the ratio of freq_misspelling/freq_correct_spelling
- Code running, takes quite some time.

AKhatun_WMF updated the task description. (Show Details)Feb 18 2023, 4:40 AM

Week 20/2/23 - 26/2/23 Update:

Re-prioritization of issues. We decided on doing Issue #5 first, after Issue #1.
Followed and fixed review comments for Issue #1.
Completed Issue #5 - Parse wikitext by L2 (language) sections and get number of definitions and templates per section. Some comments left.

AKhatun_WMF updated the task description. (Show Details)Feb 25 2023, 6:22 AM

AKhatun_WMF updated the task description. (Show Details)Mar 3 2023, 1:59 AM

Week 27/2/23 - 5/3/23 Update:

Address comments for Issue #5
- Parse sections line by line, consider templates in # items (numbered list)
- Count the number of definitions by # count, excluding ## #: #; and #*
- Also change the data format a bit to make it more readable
To address Issue 6: get list of misspellings from another Language and compare the collected lists to existing approaches
- collected bnwiktionary templates. It does not have much Bangla words. Its the same as present in enwiktionary. Will work with existing collected Spanish misspellings instead.
- for English, compared collected list with enwiki Lists_of_common_misspellings

AKhatun_WMF updated the task description. (Show Details)Mar 11 2023, 2:42 AM

Week 6/3/23 - 12/3/23 Update:

Compared collected en and fr misspellings with AutowikiBrowser Typo list. Merge requested. Summary here
Started working on extracting wikipedia text to find the ratio of misspellings

Week 13/3/23 - 19/3/23 Update:

MR4 (misspelling comparison) comments addressed. Merged.
Created Issue 7. Parsed simplewiki, extracted misspellings, saved as tsv. Code pushed.

Week 20/3/23 - 26/3/23 Update:

Continue working on Issue 7: collected edit node type from Isaac's updated code. Analyze and report the correctly and incorrectly identified misspellings by manually analyzing 100 examples from each node type.

Week 27/3/23 - 2/4/23 Update:

Apply additional filter information to extracted misspellings: Capitalization, word length, part of a list item, inside of quotations (in any language)
Still need to figure out the data's structure and add fasttext detected language information

Miriam moved this task from FY2022-23-Research-January-March to FY2022-23-Research-April-June on the Research board.Apr 4 2023, 4:05 PM

Miriam edited projects, added Research (FY2022-23-Research-April-June); removed Research (FY2022-23-Research-January-March).

Week 3/4/23 - 9/4/23 Update:

Pushed revised code that includes all additional formatting as a list (as discussed).
Fixed quotations detected. Added fasttext language detection.
Analysed collected misspellings from context. Some work need to be done to increase precision of detected language.

Week 10/4/23 - 16/4/23 Update:

Add info on language detection (language, confidence, text sent to model). Analyze examples.
Add proxy tables: tables that were not detected by mwparserfromhell.
Separate cell data of tables: each cell in table is now a node. Stuck with cell data/paragraph text to send to model.

Week 17/4/23 - 23/4/23 Update:

After meeting with Martin, decided on sending each paragraph to language detected model, and also send cell data to model. These only happen if misspellings are detected.
Refactor a bit, add docs for functions.
Scale to enwiki, frwiki, and dewiki and do some basic analysis.

Week 24/4/23 - 30/4/23 Update:

Incorporate Isaacs feedback for MR5 and 6. All MRs merged after some editing and discussion.
Created MR7 to Refactored repo
extracted misspelling from all language wikipedias.
- Todo: analysis on extracted data

Week 1/5/23 - 7/5/23 Update:

Incorporated feedback and had MR7 merged (refactor repo)
Analysis done on extracted misspellings, sent MR8. Based on feedback, some more analysis done.

Week 8/5/23 - 14/5/23 Update:

Created Issues 12 and 13. Started working on them: identify misspelling of templates in other languages and find usage of these templates. The templates would be collected from Q50368067, misspelling of named templates in other languages, and their redirects.

Week 15/5/23 - 21/5/23 Update:

Checked example use of misspelling of templates in all the collected 16 languages. All languages look similar to enwiktionary except trwiktionary (small change) and viwiktionary

Week 22/5/23 - 28/5/23 Update:

Updated MR9 with summary
Created Issue 14. Changed wiktionary parser script to make it work with all languages. Need to figure out some changes in template params.

Week 29/5/23 - 4/6/23 Update:

Fixed template parsing to accommodate the use of lang param in template
Parsed and saved all language wiktionary misspelling.
Did some analysis, de-duplication, and saved all_wiki combined wiktionary misspellings.
More errors found in template parsing (named params occur before un-named params causing incorrect parsing)

AKhatun_WMF updated the task description. (Show Details)Jun 9 2023, 2:29 AM

Week 5/6/23 - 11/6/23 Update:

Fixed template parsing error, MR10 got merged.
Started working on report

Week 12/6/23 - 18/6/23 Update:

Finished report (Images not added yet)
MR11 sent for readme and figures

Week 19/6/23 - 25/6/23 Update:

Final edits in report, uploaded images. Report.
Final edits in readme. MR11 Merged.
Redirect links work. Find misspellings using Template:R_from_misspelling in Wikipedia, perform basic analysis. Sent MR12.

Week 26/6/23 - 2/7/23 Update:

Listed and analyzed redirects in Wiktionary.

AKhatun_WMF closed this task as Resolved.Jun 30 2023, 8:08 PM

AKhatun_WMF updated the task description. (Show Details)

Generate list of common misspellings from wiktionary
Closed, ResolvedPublic
Actions

Description

Related Objects
Search...

Event Timeline

Generate list of common misspellings from wiktionaryClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Generate list of common misspellings from wiktionary
Closed, ResolvedPublic
Actions

Related Objects
Search...