Page MenuHomePhabricator

Generate list of common misspellings from wiktionary
Closed, ResolvedPublic

Description

Background: In a previous work (T315086) we manually evaluated a sample of copyedits in 5 languages. We observed that using custom lists of common misspellings yields high-accuracy copyedits. For example, in the case of Bengali this approach had an accuracy >90% in comparison to ~0% when using spellcheckers). Therefore, we believe that custom lists of common misspellings are a promising approach for surfacing copyedits which can be scaled, in principle, across many languages.

Challenge: However, one of the main challenges is to curate such lists of common misspellings. For a few languages, such lists have been compiled by the communities (see for example English or German). For most languages, such lists are not readily available. Therefore, we would like to explore approaches how we can automatically generate such a list for different languages from existing resources. Note: this list will most likely not be perfect, but will constitute a first version, which coould then be refined.

Idea: We use the wiktionary projects to find common misspellings in different languages. For example, English Wiktionary contains many entries about misspellings in different languages (category) . These entries are captured in a structured way via a template (example: tripple). Other wiktionary projects will likely contain additional entries -- the corresponding template exists in 11 different wiktionaries.

Tasks:

  • Get relevant Templates and all its redirects (It is misspelling_of in enwiktionary and here are its redirects)
  • Identify list of wiktionary articles containing the relevant templates from templatelinks table
  • Pick a small set of wiktionary articles and parse wikitext to extract template's location (section contains information about language, subsections contain information about word-forms)
  • parse full English wiktionary dump to extract misspellings from wikitext
  • implement different filters for the misspellings, for example: make sure the word is a misspelling across all word-forms
  • To ensure this method of collecting misspellings is working, compare the collected list with existing approaches:
  • make sure the supposed misspelling is not too common by counting the number of occurrences in the respective Wikipedia
  • parse other wiktionaries to extract misspellings (identify similar templates, annotations, etc)
    • Only misspelling of template was considered for now. Other similar templates exist but not yet explored (e.g obsolete_spelling)
  • count number of misspellings per languages
  • extract all potential copyedits in Wikipedia articles using the list of misspellings in the respective languages
  • manually evaluate the accuracy of the extracted copyedits in some selected languages

Done:

Additional work done:

Event Timeline

Week 1/2/23 - 5/2/23 Update:

  • Caught up on previous work on copy editing both in research team and growth team
  • Learned about templates in Wiktionary in different langauges and the possible categories they may be in

Week 6/2/23 - 12/2/23 Update:

  • Set up jupyter notebook (fix issues with getting spark3)
  • Get list of enwiktionary pages that use missplelling_of template using the following tables:
    • mediawiki_templatelinks, mediawiki_linktarget, mediawiki_wikitext_current
  • Parsed enwiktionary pages to get heading name (typically POS: Noun, Adj, etc), language of misspelling, and the correct spelling from the template
  • Some analysis on parsed wikis to get languauge and heading distribution

Notes:

  • There are other templates like deliberate_missplelling_of
  • There are template redirects missp is used to call `missplelling_of
  • This template is also used in non-enwiktionary pages.

Week 13/2/23 - 19/2/23 Update:

Issue 1

  • Added redirects of misspelling_of using redirect and page tables.
    • some debugging was required to realize that the number of templates remains the same, but we will need the list of redirects anyways to match the name of the template in wikitext.
  • Added _ and space variations of the same template
  • Parsed and saved enwiktionary misspellings as tsv file (Merge requested)

Issue 2

  • Went through Nazia's code. Adapted it to parse enwiki articles and collected frequency of misspelled words. Then calculated the ratio of freq_misspelling/freq_correct_spelling
    • Code running, takes quite some time.

Week 20/2/23 - 26/2/23 Update:

  • Re-prioritization of issues. We decided on doing Issue #5 first, after Issue #1.
  • Followed and fixed review comments for Issue #1.
  • Completed Issue #5 - Parse wikitext by L2 (language) sections and get number of definitions and templates per section. Some comments left.

Week 27/2/23 - 5/3/23 Update:

  • Address comments for Issue #5
    • Parse sections line by line, consider templates in # items (numbered list)
    • Count the number of definitions by # count, excluding ## #: #; and #*
    • Also change the data format a bit to make it more readable
  • To address Issue 6: get list of misspellings from another Language and compare the collected lists to existing approaches
    • collected bnwiktionary templates. It does not have much Bangla words. Its the same as present in enwiktionary. Will work with existing collected Spanish misspellings instead.
    • for English, compared collected list with enwiki Lists_of_common_misspellings

Week 6/3/23 - 12/3/23 Update:

  • Compared collected en and fr misspellings with AutowikiBrowser Typo list. Merge requested. Summary here
  • Started working on extracting wikipedia text to find the ratio of misspellings

Week 13/3/23 - 19/3/23 Update:

  • MR4 (misspelling comparison) comments addressed. Merged.
  • Created Issue 7. Parsed simplewiki, extracted misspellings, saved as tsv. Code pushed.

Week 20/3/23 - 26/3/23 Update:

  • Continue working on Issue 7: collected edit node type from Isaac's updated code. Analyze and report the correctly and incorrectly identified misspellings by manually analyzing 100 examples from each node type.

Week 27/3/23 - 2/4/23 Update:

  • Apply additional filter information to extracted misspellings: Capitalization, word length, part of a list item, inside of quotations (in any language)
  • Still need to figure out the data's structure and add fasttext detected language information

Week 3/4/23 - 9/4/23 Update:

  • Pushed revised code that includes all additional formatting as a list (as discussed).
  • Fixed quotations detected. Added fasttext language detection.
  • Analysed collected misspellings from context. Some work need to be done to increase precision of detected language.

Week 10/4/23 - 16/4/23 Update:

  • Add info on language detection (language, confidence, text sent to model). Analyze examples.
  • Add proxy tables: tables that were not detected by mwparserfromhell.
  • Separate cell data of tables: each cell in table is now a node. Stuck with cell data/paragraph text to send to model.

Week 17/4/23 - 23/4/23 Update:

  • After meeting with Martin, decided on sending each paragraph to language detected model, and also send cell data to model. These only happen if misspellings are detected.
  • Refactor a bit, add docs for functions.
  • Scale to enwiki, frwiki, and dewiki and do some basic analysis.

Week 24/4/23 - 30/4/23 Update:

  • Incorporate Isaacs feedback for MR5 and 6. All MRs merged after some editing and discussion.
  • Created MR7 to Refactored repo
  • extracted misspelling from all language wikipedias.
    • Todo: analysis on extracted data

Week 1/5/23 - 7/5/23 Update:

  • Incorporated feedback and had MR7 merged (refactor repo)
  • Analysis done on extracted misspellings, sent MR8. Based on feedback, some more analysis done.

Week 8/5/23 - 14/5/23 Update:

  • Created Issues 12 and 13. Started working on them: identify misspelling of templates in other languages and find usage of these templates. The templates would be collected from Q50368067, misspelling of named templates in other languages, and their redirects.

Week 15/5/23 - 21/5/23 Update:

  • Checked example use of misspelling of templates in all the collected 16 languages. All languages look similar to enwiktionary except trwiktionary (small change) and viwiktionary

Week 22/5/23 - 28/5/23 Update:

  • Updated MR9 with summary
  • Created Issue 14. Changed wiktionary parser script to make it work with all languages. Need to figure out some changes in template params.

Week 29/5/23 - 4/6/23 Update:

  • Fixed template parsing to accommodate the use of lang param in template
  • Parsed and saved all language wiktionary misspelling.
  • Did some analysis, de-duplication, and saved all_wiki combined wiktionary misspellings.
  • More errors found in template parsing (named params occur before un-named params causing incorrect parsing)

Week 5/6/23 - 11/6/23 Update:

  • Fixed template parsing error, MR10 got merged.
  • Started working on report

Week 12/6/23 - 18/6/23 Update:

  • Finished report (Images not added yet)
  • MR11 sent for readme and figures

Week 19/6/23 - 25/6/23 Update:

  • Final edits in report, uploaded images. Report.
  • Final edits in readme. MR11 Merged.
  • Redirect links work. Find misspellings using Template:R_from_misspelling in Wikipedia, perform basic analysis. Sent MR12.

Week 26/6/23 - 2/7/23 Update:

  • Listed and analyzed redirects in Wiktionary.
AKhatun_WMF updated the task description. (Show Details)