Page MenuHomePhabricator

[AOI] Investigation: Can we improve spell checking capabilities on Wikipedia?
Closed, ResolvedPublic

Description

Per http://www.allourideas.org/wikimediaaccesorios/results?locale=es, what could we do to improve spell checking on Wikipedia? This could include spell-checker support from within the editing interface or dedicated tools on Tool Labs for identifying and fixing spelling errors. Consideration should be given for supporting multiple languages. First we should survey what tools already exist, what their capabilities are, and how well they work. Then we should identify some concrete development tasks that we can work on to improve things.

GSoC project task: T89107: Unified language proofing tools integration framework

Please answer the following questions:

  • Are there high priority bugs or features that the Community Tech team could address in a short period of time?
    • Not immediately. The best course of action here would be to wait at least till the 21st of August when GSoC officially ends to see what the status of the project is then and whether we can help push it any faster.
  • If so, is the maintainer amendable to working with us and is the code publicly available?
    • Yes and yes.
  • Would this be a good tool to convert into a MediaWiki extension or add as functionality to an existing extension?
    • Already being created as an extension.

Event Timeline

kaldari raised the priority of this task from to Needs Triage.
kaldari updated the task description. (Show Details)
kaldari added a project: Community-Tech.
kaldari subscribed.
kaldari renamed this task from [AOI] Spike: Can we improve spell checking capabilities? to [AOI] Spike: Can we improve spell checking capabilities on Wikipedia?.Aug 10 2015, 9:09 PM
kaldari set Security to None.

Here's a demo for the GSoC project under development: http://tools.wmflabs.org/languageproofing-ui/ It'll be available for about 20 languages only.

kaldari renamed this task from [AOI] Spike: Can we improve spell checking capabilities on Wikipedia? to [AOI] Investigation: Can we improve spell checking capabilities on Wikipedia?.Aug 12 2015, 1:55 AM

Here's a demo for the GSoC project under development: http://tools.wmflabs.org/languageproofing-ui/ It'll be available for about 20 languages only.

The backend is a free software project and it can be extended - https://www.languagetool.org/ . Whatever that tool gets, we will get. (On the world scale, the number of languages that have good coverage of tools for checking spelling and grammar is not much larger than 20 anyway.)

And I very much hope to get this deployed as a beta feature in projects that have VisualEditor and are written in languages that this tool supports in 2015.

I chatted with @Amire80 about this. There are still a few rough edges that need to be sorted. This will further warrant a review from the VE editors and James. And of course, security reviews before it can get to beta features stage.

Available on Gerrit on: https://gerrit.wikimedia.org/r/mediawiki/extensions/LanguageTool
Dependency - VisualEditor

@NiharikaKohli: That link doesn't work for me.

Looks like the extensions is documented at https://www.mediawiki.org/wiki/Extension:LanguageTool.

@NiharikaKohli: Can you give us a rough idea of how far along this extension is? Is it functional at the moment? Would it need a design pass? If it's actually working, could you post a screenshot or two? The demo on Tool Labs doesn't work for me.

Looks like the extensions is documented at https://www.mediawiki.org/wiki/Extension:LanguageTool.

@NiharikaKohli: Can you give us a rough idea of how far along this extension is? Is it functional at the moment? Would it need a design pass? If it's actually working, could you post a screenshot or two? The demo on Tool Labs doesn't work for me.

I tried to make it work yesterday but no luck. Amir mentioned that critical design changes are being worked on and will hopefully be completed by the next week. I've added a screenshot of the existing design as a mock.

Other attempts:

  1. SpellCheckerBot: Does not edit articles but outputs suspected incorrectly spelled words along with suggestions. Does not seem to be actively used or maintained.
  2. Extension:Spellcheck: Status:unmaintained, no localisation. Last updated version was in 2007. Source code at: https://code.google.com/p/wikimediaspellchecker/
  3. There is a pywikibot script for spellchecking pages https://git.wikimedia.org/blob/pywikibot%2Fcompat.git/HEAD/spellcheck.py whose status is unknown. It seems to have been last updated in 2013.

@Amire80, @Jdforrester-WMF What are the plans for integrating LanguageTool into VisualEditor. Is there a Phabricator task for this?

It sounds like there is also some kind of LanguageTool server or service running somewhere on Tool Labs, but I couldn't find any documentation about this. Anyone know anything about that?

It sounds like there is also some kind of LanguageTool server or service running somewhere on Tool Labs, but I couldn't find any documentation about this. Anyone know anything about that?

http://tools.wmflabs.org/languageproofing/

http://tools.wmflabs.org/languageproofing/

Does this expose an API for other tools to use? Is there a code repo for this somewhere?

http://tools.wmflabs.org/languageproofing/

Does this expose an API for other tools to use? Is there a code repo for this somewhere?

Code repo: https://git.wikimedia.org/tree/mediawiki%2Fextensions%2FLanguageTool.git/5b7a9a24e961f1b32cd00ca7d9fc89ab0d6f99d3
I am not sure I understand what you mean by an API here. This tool works off https://www.languagetool.org. The idea is to make it work with VisualEditor.

@Amire80 said that they will be holding a demo for JamesF sometime this(next?) week and we can attend as well.

@NiharikaKohli: For integrating this tool into VisualEditor, I think it makes sense to leave that to the Language Engineering and Editing teams. Since that might be a longer-term project, I'm wondering if there's anything quick and dirty that we can do to make this functionality available in the meantime as a Tool Labs tool (similar to the demo at http://tools.wmflabs.org/languageproofing-ui/). For example, we could add some UI to http://tools.wmflabs.org/languageproofing-ui/ to allow the user to import an existing Wikipedia article to be checked (rather than cutting and pasting text). I know this isn't what the GSOC project was intended to deliver, but I think even a very basic tool like that would be useful to Wikipedia editors, especially since it supports 20 languages. Is the code for http://tools.wmflabs.org/languageproofing-ui/ in a code repo?

We may also want to advertize the API that is exposed at http://tools.wmflabs.org/languageproofing/ in case other bot or tool writers are interested in using it.

@kaldari, the code for http://tools.wmflabs.org/languageproofing-ui/ is provided here: http://wiki.languagetool.org/integration-on-websites

The idea for a tool to import an article and proofread it sounds interesting. I'll look into the possibility of doing that.

In case you want to just import an article to proofread it, something neat already exists. Take a look at this : http://community.languagetool.org/wikiCheck/pageCheck/index?lang=en

In case you want to just import an article to proofread it, something neat already exists. Take a look at this : http://community.languagetool.org/wikiCheck/pageCheck/index?lang=en

Thanks @Ankita-ks. That looks pretty much like what we were looking for.

@Ankita-ks: That's exactly what I had in mind. Is there any possibility that we could either fork that code or improve http://community.languagetool.org/wikiCheck/pageCheck/index directly? There are several false-positives that should be fixed. For example, it currently flags piped wikilinks as grammar errors. It would also be nice if you could choose to check just spelling or just grammar rather than always checking both.

@kaldari: Hi, I'm the author of that tool (http://community.languagetool.org/wikiCheck/pageCheck/index) and any help improving it is very welcome. False alarms caused by wikitext exist because we use http://sweble.org, which isn't perfect when it comes to extracting plain text. Switching to parsoid should solve those problems. (There will still be some false alarms not caused by text extraction issues.)

Thanks for the info dnaber! I agree switching to the parsoid output (via the RESTBase API) would probably help, although I imagine that would complicate feeding the results back to the WikiText editor. Do you have any thoughts on that?

Also, is the tool currently on GitHub or any other code repo?

The problem of feeding back error fixes to wikitext is solved, as long as we have a character position mapping between wikitext and plain text. Currently this is provided by sweble, but as I understand, parsoid also offers this or the parsoid result can easily be be parsed to get it?

The code is here:
Backend:
https://github.com/languagetool-org/languagetool/tree/master/languagetool-wikipedia/src/main/java/org/languagetool/dev/wikipedia
Frontend:
https://github.com/languagetool-org/languagetool-wikicheck

@dnaber: Also, it looks like currently spell-checking in the LanguageTool WikiCheck is disabled due to too many false positives. Personally, I would like to have it available as an option at least even if it isn't that reliable. How bad is it? I know that some spell-checkers for Wikipedia suffer from only supporting American or British spelling. Does LanguageTool allow either or is it specific to one variation?

@kaldari: British and American English are both supported (https://languagetool.org/languages/). The problem with the false alarm is that Wikipedia is full of proper nouns and rare words which the spell checker doesn't know. One could maybe ignore words from the title and words that occur more than once in the document (or more than n times in the whole Wikipedia if we have a fast lookup for that).

There is another spellchecker in wikipedia that it's installed on de.wik, es.wiki, gl.wiki and others. The info is here: https://gl.wikipedia.org/wiki/Wikipedia:Revisor_ortogr%C3%A1fico, the code is here: https://gl.wikipedia.org/wiki/MediaWiki:Gadget-RevisorOrtografico.js, and it uses lists of erros to detect created at the wikis like this: https://gl.wikipedia.org/wiki/Wikipedia:Revisor_ortogr%C3%A1fico/Listaxe. It was created by the user APPER of de.wiki.

Perhaps you could reuse the lists of errors on wikis for your tool, or something.

Here is the updated link for VisualEditor extension for LanguageTool : http://languagetoolextension.wmflabs.org/wiki/Main_Page?veaction=edit

It has some bugs which I am in process of fixing. But please feel free to find more bugs.

Thanks for the information, @Elisardojm. I'm not very good at Spanish and German, could you explain a bit more on how the script works? Does it generate a new page listing all the errors in a page and suggested corrections?

The outcome of this investigation was T110156. Related work is already being handled by the VisualEditor team under the VisualEditor-LanguageTool project.

Hi @NiharikaKohli, I don't know exactly how the script works, but I can translate the spanish translation :) "When you upload a page at your navigator, an script is executed. The script sends the article's URL to the ToolServer that checks if any word of the article match with any of the words of the list of errors. All the errors detected are marked with red background, and a tip appears over them to suggest the correct word. It only works on view mode, you have to edit the article to correct the mistakes.

You can try it on https://gl.wikipedia.org/wiki/Usuario:Elisardojm/Probas_do_revisor, you only have to activate it previously on your preferences (Tools>Spell-checker or "Ortografía" in gl.wiki).

In T108628#1631362, @NiharikaKohli wrote:

Thanks for the information, @Elisardojm. I'm not very good at Spanish and German, could you explain a bit more on how the script works? Does it generate a new page listing all the errors in a page and suggested corrections?

This one is really simple. It's not a real spell check - it doesn't have a positive list but a negative list: a list of known common spelling errors. After a page is loaded, a script starts (on tool labs project "spellcheck"), which downloads the raw article text and does some preparations (for example quotes are removed and there is a special treatment for swiss articles on de.wikipedia because of a bit different spelling there). Then all words are checked against the list of wrong words (which is maintained in the wiki and loaded once a day) and if a wrong word is found it is marked with a red background. When hovering the correct spelling is shown. So this one is for reading only (doesn't work in edit mode). I wrote it years ago to correct some of the most common spelling errors while doing other stuff on articles without reading the whole article.

Thanks, @APPER, @Elisardojm - that is helpful and the script looks pretty cool.