Change Details

**Background**: In a previous work (T315086) we manually evaluated a sample of copyedits in 5 languages. We observed that using custom lists of common misspellings yields high-accuracy copyedits. For example, in the case of Bengali this approach had an accuracy >90% in comparison to ~0% when using spellcheckers). Therefore, we believe that custom lists of common misspellings are a promising approach for surfacing copyedits which can be scaled, in principle, across many languages. **Challenge**: However, one of the main challenges is to curate such lists of common misspellings. For a few languages, such lists have been compiled by the communities (see for example [[ https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines#The_Machine-Readable_List | English ]] or [[ https://de.wikipedia.org/wiki/Wikipedia:Liste_von_Tippfehlern/F%C3%BCr_Maschinen | German ]]). For most languages, such lists are not readily available. Therefore, we would like to explore approaches how we can automatically generate such a list for different languages from existing resources. Note: this list will most likely not be perfect, but will constitute a first version, which coould then be refined. **Idea**: We use the wiktionary projects to find common misspellings in different languages. For example, English Wiktionary contains many entries about misspellings in different languages ([[ https://en.wiktionary.org/wiki/Category:Misspellings_by_language | category ]]) . These entries are captured in a structured way via a [[ https://en.wiktionary.org/wiki/Template:misspelling_of | template ]] (example: [[ https://en.wiktionary.org/wiki/tripple | tripple ]]). Other wiktionary projects will likely contain additional entries -- the corresponding template [[ https://www.wikidata.org/wiki/Q50368067 | exists in 11 different wiktionaries ]]. **Tasks**: [x] Identify list of wiktionary articles containing the relevant templates from [[ https://www.mediawiki.org/wiki/Manual:Templatelinks_table/en | templatelinks table ]] [x] Pick a small set of wiktionary articles and parse wikitext to extract template's location (section contains information about language, subsections contain information about word-forms) [x] parse full English wiktionary dump to extract misspellings from wikitext [ ] implement different filters for the misspellings (for example, i) make sure the word is a misspelling across all word-forms; ii) make sure the supposed misspelling is not too common by counting the number of occurrences in the respective Wikipedia) - We decided on using just (ii) for filtering because the wiktionary sections are not consistent enough to be parsed well and identify whether a word is a misspelling in all forms [ ] parse other wiktionaries to extract misspellings (identify similar templates, annotations, etc) [ ] count number of misspellings per languages [ ] extract all potential copyedits in Wikipedia articles using the list of misspellings in the respective languages [ ] manually evaluate the accuracy of the extracted copyedits in some selected languages

**Background**: In a previous work (T315086) we manually evaluated a sample of copyedits in 5 languages. We observed that using custom lists of common misspellings yields high-accuracy copyedits. For example, in the case of Bengali this approach had an accuracy >90% in comparison to ~0% when using spellcheckers). Therefore, we believe that custom lists of common misspellings are a promising approach for surfacing copyedits which can be scaled, in principle, across many languages. **Challenge**: However, one of the main challenges is to curate such lists of common misspellings. For a few languages, such lists have been compiled by the communities (see for example [[ https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines#The_Machine-Readable_List | English ]] or [[ https://de.wikipedia.org/wiki/Wikipedia:Liste_von_Tippfehlern/F%C3%BCr_Maschinen | German ]]). For most languages, such lists are not readily available. Therefore, we would like to explore approaches how we can automatically generate such a list for different languages from existing resources. Note: this list will most likely not be perfect, but will constitute a first version, which coould then be refined. **Idea**: We use the wiktionary projects to find common misspellings in different languages. For example, English Wiktionary contains many entries about misspellings in different languages ([[ https://en.wiktionary.org/wiki/Category:Misspellings_by_language | category ]]) . These entries are captured in a structured way via a [[ https://en.wiktionary.org/wiki/Template:misspelling_of | template ]] (example: [[ https://en.wiktionary.org/wiki/tripple | tripple ]]). Other wiktionary projects will likely contain additional entries -- the corresponding template [[ https://www.wikidata.org/wiki/Q50368067 | exists in 11 different wiktionaries ]]. **Tasks**: [] Get relevant Templates and all its redirects (It is `misspelling_of` in enwiktionary and [here are its redirects](https://en.wiktionary.org/w/index.php?title=Special:WhatLinksHere/Template:misspelling_of&hidelinks=1&hidetrans=1)) [x] Identify list of wiktionary articles containing the relevant templates from [[ https://www.mediawiki.org/wiki/Manual:Templatelinks_table/en | templatelinks table ]] [x] Pick a small set of wiktionary articles and parse wikitext to extract template's location (section contains information about language, subsections contain information about word-forms) [x] parse full English wiktionary dump to extract misspellings from wikitext [ ] implement different filters for the misspellings (for example, i) make sure the word is a misspelling across all word-forms; ii) make sure the supposed misspelling is not too common by counting the number of occurrences in the respective Wikipedia) - We decided on using just (ii) for filtering because the wiktionary sections are not consistent enough to be parsed well and identify whether a word is a misspelling in all forms [ ] parse other wiktionaries to extract misspellings (identify similar templates, annotations, etc) [ ] count number of misspellings per languages [ ] extract all potential copyedits in Wikipedia articles using the list of misspellings in the respective languages [ ] manually evaluate the accuracy of the extracted copyedits in some selected languages