Page MenuHomePhabricator

Evaluate copyedits from LanguageTool (Q4)
Closed, ResolvedPublic


In T293035 we identified LanguageTool as a possible candidate to surface copyedits across several languages. Inspecting the copyedits for Wikipedia articles, we identified the potential problem that LanguageTool might raise many copyedits that are false positives (see the example described here T284550#7802765). Therefore, we want to quantitatively evaluate the copyedits from LanguageTool. Specifically, we want to investigate the rate of false positives and how different filters might decrease it.

  • Evaluate LanguageTool in a benchmark dataset in English from, e.g., Grammatical Error Correction
  • if possible, evaluate LanguageTool in a benchmark dataset in a language that is not English
  • Find a ground-truth dataset with copyedits in Wikipedia articles in at least one language (e.g. through hand-labeling)

Event Timeline

Update week 2022-04-04:

  • starting to evaluate performance of LanguageTool on benchmark datasets from grammatical error correction. we started with the most commonly used one for English (BEA 2019 Shared Task). We already identified a similar benchmark dataset which covers a few other languages (Lang-8). We will have to adapt the standard evaluation metrics as we are mostly interested in error detection and not so much on automatic error correction (the correction will be done by the editors)
  • started to brainstorm how we could obtain a ground-truth dataset of copyedit errors in Wikipedia-articles. one idea is to start from sentences with in high-quality articles (e.g. featured-articles) as extreme cases with no errors to assess sensitivity.

Update week 2022-04-25:

  • Familiarized myself with Errant, a tool for the evaluation of grammatical error detection. Starting in English we can evaluate LanguageTool’s precision and recall on a benchmark corpus.

Update week 2022-05-02:

  • I evaluated LanguageTool on an English benchmark dataset containing annotated grammatical errors, specifically the Write&Improve (W&I) corpus of the BEA2019 Shared Task. Write & Improve is an online web platform that assists non-native English students with their writing. Specifically, students from around the world submit letters, stories, articles and essays in response to various prompts, and the W&I system provides instant feedback. Since W&I went live in 2014, W&I annotators have manually annotated some of these submissions and assigned them a CEFR level. Thus, we have annotated errors in three difference levels: A (beginner), B (intermediate), C (advanced). My interpretation is that these classes contain errors with increasing complexity.
  • Results for error detection (only highlighting an error) and error correction (highlighting the error and proposing a correction) in terms of true positives (TP), false positives (FP), false negatives (FN) and aggregated statistics precision (TP/TP+FP, i.e. the fraction of raised errors that are correct) and recall (TP/TP+FN, i.e. fraction of errors found).

Error detection
dataset | #sentences | metrics
A.train | 10880 | {'TP': 2338, 'FP': 2045, 'FN': 26734, 'Prec': 0.5334, 'Rec': 0.0804, 'F0.5': 0.2508}
B.train | 13202 | {'TP': 1363, 'FP': 1954, 'FN': 22854, 'Prec': 0.4109, 'Rec': 0.0563, 'F0.5': 0.1818}
C.train | 10667 | {'TP': 516, 'FP': 1362, 'FN': 9140, 'Prec': 0.2748, 'Rec': 0.0534, 'F0.5': 0.1503}

Error correction
dataset | #sentences | metrics
A.train | 10880 | {'TP': 1898, 'FP': 2481, 'FN': 26264, 'Prec': 0.4334, 'Rec': 0.0674, 'F0.5': 0.2078}
B.train | 13202 | {'TP': 1175, 'FP': 2136, 'FN': 22490, 'Prec': 0.3549, 'Rec': 0.0497, 'F0.5': 0.1592}
C.train | 10667 | {'TP': 461, 'FP': 1415, 'FN': 9017, 'Prec': 0.2457, 'Rec': 0.0486, 'F0.5': 0.1357}

  • LanguageTool is able to detect meaningful grammatical errors
    • The precision for the detection is at 53%; this means that the majority of detected errors correspond to ground-truth annotated errors
    • The recall is not very high (around 8%); this means that it misses to detect a substantial amount of errors. However, in our context this is probably ok.
  • LanguageTool can even provide a good correction
    • While the precision for error correction is lower than for error detection, the decrease is not that much (a drop from 53% to 43%)
    • For harder text levels, the precision and recall further drop. My interpretation is that LanguageTool performs worse to detect/correct more difficult and complex errors. But if we can address simple errors, this will also probably be fine in our context.
  • code for evaluation:
  • Next step: There is anecdotal evidence that LanguageTool creates many false positives when run on Wikipedia due to, e.g., entity-names (T284550#7802765). While there is no benchmark corpus with a complete annotation of grammatical errors for Wikipedia (so that a similar analysis as above in the context of Wikipedia is not directly feasible), we can investigate the propensity of false positives by evaluating sentences without any errors. As a proxy we can sample sentences from articles with the highest quality rating assuming that they fit copyediting standards in Wikipedia. We can then approximate the average number of false positives per sentence generated by LanguageTool. If indeed higher than expected in benchmark corpora, we can develop filtering strategies to reduce the rate.

Update week 2022-05-09:

  • Evaluated error-detection of LanguageTool in Wikipedia articles focusing on the problem of False Positives
  • The main problem is that we dont have an annotated dataset with copyedit errors in Wikipedia articles that we could use as ground truth. Therefore, I focused on "featured articles" (the highest quality class) assuming that these articles are free of errors (at least in the sense that we should highlight these errors to be corrected). In this line of thinking, any error thrown by LanguageTool is considered a false positive. This will give us an upper bound on the false-positive rate since some of these errors might still be genuine.
  • The dataset consists of 6090 featured articles in enwiki with 1,192,369 sentences. (code)
  • when using the language-code "en-US", we get 0.804 errors per sentence, i.e. almost every sentence will yield one false positive. this is consistent with the qualitative observations reported in T284550#7802765 when using LanguageTool's web interface which uses "en-US" as a default.
  • in contrast, when using language-code "en", we get only 0.065 errors per sentence, i.e. only about 1 false positive in every 15 sentences. This means that we have a more than 10-fold reduction in the number of false positives when switching from "en-US" to "en".
  • Thus, using the "en" language-code for LanguageTool will substantially reduce false positives in the context of Wikipedia. when using the LanguageTool-website (instead of the API) the default choice is en-US. In fact, there is no option to select "en" but only other variants such as en-GB. This explains the anecdotal observation of many false positives when copy-pasting text from Wikipedia articles into the web-interface of LanguageTool (T284550#7802765).
  • When comparing the different choices of the language-variants for the annotated benchmark-corpus (T305180#7909253), using "en" instead of "en-US" admittedly causes LanguageTool to detect fewer errors (reduction in recall, e.g., from 0.14 to 0.08 in A.train ) but it still detects thousands of errors. considering the reduction in the number of false positives, this trade-off seems to be well worth.
  • code:

Updates week 2022-05-16

  • With this, we can now approximate the volume and precision of LanguageTool in Wikipedia without having a detailed ground-truth dataset of error annotations but only relying on article-level annotations (featured-article badge vs copyedit-template):
    • for articles with suspected errors (copy-edit template), we find 0.15 errors per sentence that is around 1 error every 6-7 sentences.
    • we approximate that at least 57% of these errors are genuine errors. This number comes from the following estimation We assume that all errors found in the featured articles (T305180#7927447) are false positives. This means that there is a baseline rate of 0.065 errors per sentence which are false positives. We further assume that this baseline rate also applies to any other article. We then conclude that from the 0.15 errors per sentence in copyedit-template articles, the remaining 0.15-0.065=0.085 errors per sentence are genuine errors (true positives). Thus, we get a precision of 0.085/(0.085+0.065)=57%. This is likely a lower bound for the precision as some of the errors in the featured articles will probably correspond to genuine errors.

Updates week 2022-05-23:

  • extending evaluation on Wikipedia to other languages (i.e. getting featured articles and articles with copyedit-templates)

Updates week 2022-06-03:

  • evaluated performance of LanguageTool for detecting copyedits in more than 20 different language versions of Wikipedia using the approach of comparing error-rates in articles with featured-article-badge and articles with copyedit-template
  • I showed how we can apply an additional post-processing step in which we filter errors using the rich annotations of the text contained in the HTML-version of the articles which leads to a substantial improvement in the performance in almost all wikis
  • I added a detailed write-up of the evaluation of LanguageTool to meta:

Updates week 2022-06-06:

  • presented results to the Growth Team (slidedeck). They will discuss about how to best proceed and get back to us
  • started some additional analysis comparing performance of spellcheckers with that of LanguageTool, will add results next week