Page MenuHomePhabricator

Integrate LanguageTools suggestion as structured task
Open, Needs TriagePublic

Description

We've discussed this as a team but I don't think there's a task, so here's one.

The proposal is to integrate spelling and grammar checks from https://languagetool.org into the suggested edits module on Special:Homepage as a structured task.

Conceptually, this would be similar to the Add-Link project, except that there isn't a service we would need to maintain and deploy, as we could use https://languagetool.org's infrastructure (perhaps/probably with some agreement on API usage with them).

Like Add-Link, we would have a script that iterates over batches of pages and sends them to a LanguageTool API, gets suggestions, stores them in a on-wiki cache and updates the wiki's search index so hasrecommendation:languagetool yields those articles. Then the user clicks through the task on Special:Homepage and can interact with the LanguageTool suggestions.

LanguageTool already provides a browser extension that provides the interactions with a sentence:

image.png (674×1 px, 118 KB)
but I suspect we would want to implement our own VisualEditor widget for handling the interactions, again, very similar to Add-Link.

Some related tasks:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@kostajh -- thank you for filing this and including that useful list. @MGerlach is going to be doing research in the coming months about this whole space, and so these links might be useful for him. Here's the project page we created about it: https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks/Copyedit

In order to directly query LanguageTool's suggestions for Wikipedia-articles, I have created an experimental API on toolforge. One needs to specify the language of the Wikipedia (e.g. “en”) and the page-title. Some example calls for different wikis/articles:

This allows for easier exploration on what are possible challenges. One of the main difficulties I encountered was that LanguageTool is very sensitive and thus yields many false positives (flagging an error when the text is in fact correct). In order to mitigate that, the tool applies an aggressive filtering to reduce the number of false positives. Using the HTML-version of the article makes this much easier: i) identify only plain text (and avoiding tables, infoboxes, or transcluded content from templates; ii) remove errors that overlap with annotated text such as links, bold, italics, etc, which often yields wrong errors. While in this case I query the HTML for individual articles via the MediaWiki-API, we could also do this in bulk thanks to the amazing HTML-dumps.

The tool queries an endpoint on cloud-vps running our own instance of LanguageTool such that we dont rely on limitations/restrictions of LanguageTool's public-facing API. The endpoint could easily feed into other tools as well.

This allows for easier exploration on what are possible challenges. One of the main difficulties I encountered was that LanguageTool is very sensitive and thus yields many false positives (flagging an error when the text is in fact correct).

Hmm, I was a little surprised to read this as anecdotally in my day-to-day use of the tool, I rarely see false positives. Are there configurable thresholds for the suggestions?

@kostajh at this point it is also an anecdotal observation from my side. For example, copying the lead section of enwiki-article Roman Catholic Diocese of Bisceglie (a random article) into the LanguageTool-interface yields 7 errors in only 3 sentences -- all of which are false positives (bold are the errors from LanguageTool):

The Diocese of Bisceglie (Latin: Dioecesis Vigiliensis) was a Roman Catholic diocese located in the town of Bisceglie on the Adriatic Sea in the province of Barletta-Andria-Trani, Apulia in southern Italy. It is five miles south of Trani. In 1818, it was united with the Archdiocese of Trani to form the Archdiocese of Trani-Bisceglie.[1][2]

Not all of the errors are links such as the italics Latin spelling. Some false positives are recurring, e.g. Trani appears as plain text and is marked as an error. Thus, it is not trivial to exclude them and some effort is needed to filter out the false positives.

The aim is to get a quantitative evaluation of the false positive rate. However, this is difficult due to lack of good ground-truth data for two reasons: i) we dont have a complete annotated sample of true errors in Wikipedia articles and ii) we have even less data for languages other than English.

There are some configuration options though I havent played with that (for more details see the documentation of the endpoint https://github.com/wikimedia/research-api-endpoint-template/tree/language-tool).