Background: In a previous work (T315086) we manually evaluated a sample of copyedits in 5 languages. We observed that using custom lists of common misspellings yields high-accuracy copyedits. For example, in the case of Bengali this approach had an accuracy >90% in comparison to ~0% when using spellcheckers). Therefore, we believe that custom lists of common misspellings are a promising approach for surfacing copyedits which can be scaled, in principle, across many languages.
Challenge: However, one of the main challenges is to curate such lists of common misspellings. For a few languages, such lists have been compiled by the communities (see for example English or German). For most languages, such lists are not readily available. Therefore, we would like to explore approaches how we can automatically generate such a list for different languages from existing resources. Note: this list will most likely not be perfect, but will constitute a first version, which coould then be refined.
Idea: We use the wiktionary projects to find common misspellings in different languages. For example, English Wiktionary contains many entries about misspellings in different languages (category) . These entries are captured in a structured way via a template (example: tripple). Other wiktionary projects will likely contain additional entries -- the corresponding template exists in 11 different wiktionaries.
- Get relevant Templates and all its redirects (It is misspelling_of in enwiktionary and here are its redirects)
- Identify list of wiktionary articles containing the relevant templates from templatelinks table
- Pick a small set of wiktionary articles and parse wikitext to extract template's location (section contains information about language, subsections contain information about word-forms)
- parse full English wiktionary dump to extract misspellings from wikitext
- implement different filters for the misspellings, for example: make sure the word is a misspelling across all word-forms
- To ensure this method of collecting misspellings is working, compare the collected list with existing approaches:
- make sure the supposed misspelling is not too common by counting the number of occurrences in the respective Wikipedia
- parse other wiktionaries to extract misspellings (identify similar templates, annotations, etc)
- Only misspelling of template was considered for now. Other similar templates exist but not yet explored (e.g obsolete_spelling)
- count number of misspellings per languages
- extract all potential copyedits in Wikipedia articles using the list of misspellings in the respective languages
- manually evaluate the accuracy of the extracted copyedits in some selected languages
- Code: https://gitlab.wikimedia.org/repos/research/copyedit-common-misspellings
- Writeup: https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task/Common_misspellings_wiktionary
Additional work done:
- Wikipedia Template:R_from_misspelling analysis.
- Wiktionary redirects analysis.