Page MenuHomePhabricator

Create a list with examples of automatically suggested copyedits for manual evaluation
Closed, ResolvedPublic

Description

In previous research T305180 we showed that some tools (such as LanguageTool) can surface copyedits, however, automatic evaluation in the context of Wikipedia is difficult due to lack of ground truth data.

In this task, we want to generate a short list of copyedits with one of the previously discussed methods (LanguageTool, spellcheckers, etc) for manual evaluation in order to assess whether the suggested copyedits are any good (correspond to genuine copyedit-errors). Ideally, the manual evaluation would label each suggested copyedit as good/bad from which we will calculate the fraction of good suggestions (precision).

Specifically we want,

  • each copyedit in the list contains: wiki_db, page_title, the sentence of the text, the word/substring that contains the error, (if possible) a suggestion for improvement
  • the list of copyedits should be short in order to be managable by a human. this means not more than copyedits for 100 articles (probably less)
  • start with copyedits for one or all of the four pilot wikis: ar, bn, cs, es

Event Timeline

KStoller-WMF renamed this task from Create a list with examples of automcatially suggested copyedits for manual evaluation to Create a list with examples of automatically suggested copyedits for manual evaluation.Aug 12 2022, 5:08 PM

Update week 2022-08-15:

Summary: I created a list of 100 samples of potential copyedits in Wikipedia articles for arwiki, bnwiki, cswiki, eswiki (pilot-wikis) and enwiki (as a test-case to debug because I am not familiar with the other languages) with two different methods:

  • LanguageTool (bnwiki and cswiki are not supported so no copyedits in these wikis)
  • Hunspell-spellchecker

The results are visible in this spreadsheet: https://docs.google.com/spreadsheets/d/1XgFqZNZ0-YWmQJkBkXXmawNEJPc1tz54j2AJGOLGt_8/edit#gid=1019257284

In detail: What did I do to create this list.

  • I started with a subset of the 10,000 first articles from the HTML dumps using the 20220801-snapshot of the respective wiki and extracted the plain text from the HTML-version of the article (trying to remove any tables, images, etc)
  • I then ran LanguageTool and the Hunspell-spellchecker on the plain text. Looking at the results in enwiki I found that there were many false positives so that I applied a range of filters (these are heuristics but are inspired by similar approaches used in, e.g., the moss-tool.
  • Filters for LanguageTool:
    • Remove error if the first letter of the matched word is uppercase
    • Remove error if the matched word is a substring of any of the strings extracted from annotated strings (anchor-text in links, strings in italics, etc)
    • Remove error where the matched word consists only of whitespaces (e.g. when there are 2 instead of just one whitespace)
    • Remove error if position of the error overlaps with any text that has an annotation (e.g. links, italics)
    • Remove error if position of the error overlaps with any text in quotes
    • Remove error if position of the error overlaps with any text in brackets
  • Filters for Spellchecker -- same filters as for LanguageTool but we apply additional filters due to large number of false positives:
    • Remove error if matched word is 3 or fewer characters (symbols etc)
    • Remove error if character before or after the matched word is a hyphen (“-”)
    • Remove error if the Levenshtein-distance between matched word and top-suggestion is larger than 3 (only if there is a suggestion)
    • Remove error if the matched word exists in any of the spellchecking dictionaries of the same language but different locale (i.e. dialect such as en_US, en_GB, etc).
    • Remove error if the matched word exists in a Personal Word List (pwl). The pwl for each language is created from all page-titles of its language-version of Wikipedia and Wiktionary. For example for enwiki, the pwl contains >10M words.
  • I then selected the first 100 articles for which there is at least one error left after the filtering. We only consider articles that have not been edited in at least 1 year. For each article I picked only one error randomly such that we have 100 errors from 100 different articles.

Update week 2022-09-05:

  • met with ambassadors to discuss manual evaluation of sample of copyedits from LanguageTool and Hunspell-spellchecker (results/comments in this spreadsheet )
  • main takeaways for me are that:
    • precision for LanguageTool's copyedits was judged >50% across all wiksi. The comments from ambassadors gave good starting points for further improving the filters to decrease the number of false positives
    • the precision for the spellchecker's copyedits was judged <50% across all wikis (best case was English with 39% but Czech had only 19% and Bengali yielded 0%!). This suggests to me that it will be difficult to filter the extremely large number of false positives from running the spellcheckers on Wikipedia articles, especially for non-English languages. I believe that it would be better to look for a narrower but more specific set of typos, e.g. specified via lists such as List of common misspellings in English. The main challenge would be how to generate such lists.

Next steps:

  • go through ambassador's comments in detail and identify improvements to filtering of LanguageTool's copyedits.

Weekly update:

  • inspected in more detail the challenges of the spellchecker in Bengali.
    • spent some time double-checking if my results are correct. for this I copy-pasted some of the samples into LibreOffice. LibreOffice's spellchecker (which uses the same hunspell-dictionary for spellchecking) yields a similarly high number of false positives.
      Screenshot from 2022-09-16 17-18-30.png (521×1 px, 115 KB)
    • reached out to Santhosh (Language Team) to discuss about potential improvements. It seems that spellchecking via dictionary-based approaches for highly inflected and agglutinated languages, such as Bengali, is not a solved problem. as a result, a functional spell checking system that we can just use not not readily available.
  • one promising alternative to a spellchecker that checks all text is to only look for very specific spelling errors and thus, hopefully, improving the accuracy.
    • I spent some time to identify existing lists of spelling errors, such as English Wikipedia's list of common misspellings. Similar lists exist for 20+ wikis (Q10957404). These lists are actually used by (semi-) automatic tools such as WPCleaner or AutoWikiBrowser to help editors with maintenance or repetetive tasks. An interesting alternative is Template:misspelling of in English Wiktionary which provides a 1000s of examples of common misspellings in more than 100 languages.
    • as a test-case, I used one of the lists in English containing 4291 common misspellings. Checking 10,000 articles in enwiki (from the 2022-08 snapshot) I found 5 spelling errors (after applying some filters such as ignoring text in quotes etc) -- all of which were genuine spelling errors (campagin → campaign, verison→version, prefered→preferred, extremly→extremely, meaing→ meaning). However, inspecting the current version of each article, 4 out of 5 errors have already been fixed by editors since the snapshot was created.
  • Summary:
    • fixing dictionary-based spellcheckers in highly inflecting/agglutinated languages (such as Bengali) is extremely difficult (not feasible)
    • checking for specific spelling errors (using lists of common misspellings) yields high-precision copyedit suggestions. the main challenges with this approach are: i) for most of the languages, we currently dont have such lists of common misspellings; ii) for languages in which such lists already exist, editors seem to be quick to fix those using (semi-) automatic tools such as WPCleaner or AutoWikiBrowser.

Weekly update:

Weekly update

  • compiled a new sample of copyedit errors in this spreadsheet (v2) https://docs.google.com/spreadsheets/d/1ponuT-jwEM4KF9XCG1Q86mkDYEGm8-F66PaG1U0XQ3M/edit#gid=0
    • following the same approach as before, I used the 20220801 snapshot of the HTML-dumps. I then selected the first 100 articles for which there is at least one error left after the filtering. We only consider articles that have not been edited in at least 1 year. For each article I picked only one error randomly such that we have 100 errors from 100 different articles.
    • I improved some of the details of the pre-processing of the text and the post-processing of the errors.
      • pre-processing: we now keep the paragraph-id from the HTML where the error was found. This will make it easier to re-localize the error downstream in the application
      • post-processing: improved the filtering of errors. i) errors where only correction is a hyphen (previous corrections seemed too strict and debatable); ii) errors which are marked by sic; iii) errors which appear in quotes; iv) errors which relate to proper nouns (filter words where first letter is capitalized)
  • List of common misspellings: I looked for misspelled words in ar, bn, cs, es, en using the list of misspelled words compiled by the ambassadors (thank you!). These lists each contain 20 misspelled words. Roughly 1 out of 1000 articles has at least one of these spelling mistakes (an indication we might need to extend this list of common spelling mistakes in some cases)
    • ar: 100 articles with errors after checking 94,994 articles
    • bn: 100 articles with errors after checking 54,189 articles
    • cs: 100 articles with errors after checking 124,892 articles
    • es: 56 articles with errors after checking all 1,732,127 articles in the dump
    • en: 100 articles with errors after checking 1,277,730 articles (for this I actually used a much longer[[ https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines#The_Machine-Readable_List | list of common spelling mistakes ]] since the original list did not yield errors in any article)
  • LanguageTool: I looked for errors from LanguageTool in ar, es, en (unfortunately not supported in bn and cs). I filtered a set of error-categories/types/rules based on the most common false positives from the evaluation in the previous round (v1). Note that, in principle, these rules can be added/removed to the filter separately for each wiki
    • PUNCTUATION (category): feedback mentioned many false positives related to adding commas
    • STYLE (category): feedback mentioned many false positives where the error isnt clearly wrong or right
    • REDUNDANCY (category): feedback mentioned many false positives where the error isnt clearly wrong or right
    • HUNSPELL_RULE_AR (rule): rule related to typos in Arabic which caused many false positives
    • MORFOLOGIK_RULE_ES (rule): rule related to typos in Spanish which caused many false positives
    • UPPERCASE_SENTENCE_START (rule): rule requiring capital letter at sentence start; this yielded many false positives due to abbreviations causing wrong identification of the beginning of a sentence

More details:

Weekly update:

  • created a more balanced sample of copyedits from lists of common misspellings ([[ copyedits_v2_common-misspellings-balanced | spreadsheet ]])
    • some misspellings occur very often while many/most occur rarely which leads to an overrepresentation of a few misspellings
    • I parse the whole dump of the respective Wikipedias (ar,bn,cs,es, en) and all occurrences of each misspelling
    • I then keep only at most 5 occurrences of each misspelling
  • more generally, this suggests we would need to figure out ways to extend the list of common misspellings in order to surface a more diverse set of misspellings

weekly updates:

  • Kirsten shared results from 2nd round of manual evaluation (spreadsheet)
  • refinements for LanguageTool substantially improve the accuracy. my guess is that filtering[[ https://community.languagetool.org/rule/list?lang=en | certain types of rules ]] is the main driver of this improvement.
  • surfacing copyedits from a list of common misspellings seems very effective in terms of surfacing high-precision copyedits. the main challenge will then be to curate such lists such that there are enough copyedits and that there is a sufficient diversity of copyedits (and not just one particular misspelling again and again)

weekly update:

  • went through feedback from ambassadors about sources of errors. the main issue seemed to be about text that was directly quoted (often text in foreign language or from hundreds of years ago) thus throwing many false positives
  • added improvements to the pre- and post-processing of the errors (both for LanguageTool and from custom list of misspellings)
    • ignoring text paragraphs that indicate blockquotes (example) or multicolumn tables often capturing parallel multilingual texts of poems or lyrics (example) (commit)
    • better handling of text that appears in explicit quotes, especially single quotes and dismabiguating with apostrophes (example) (commit)

weekly update:

  • current tasks have been solved
  • there are no further improvements planned at the moment for manual evaluation
  • therefore closing the task, feel free to re-open in the future if the work will be picked up again
  • research around open questions on models for copyedits will be captured in other tasks under the same parent-task