Page MenuHomePhabricator

Implement hunspell dictionary for euwiki article quality model
Open, NormalPublic


Word from @Theklan is that there is a good dictionary for Basque. Let's implement a set of features.

Also, there are some paragraphs of English and Spanish that we might want to catch.

English or spanish in a <ref> tag is OK, but not in the rest of the content.

Event Timeline

Halfak created this task.May 19 2019, 9:52 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 19 2019, 9:52 AM

I think Hunspell is available in Basque, there are also dictionaries made by IXA Taldea (@Ksarasola) available: Most prominent one is Xuxen: This is mantained by @ElhuyarFundazioa, but can be downloaded for use.

Hello @Halfak
I'm currently building some statistics on the ORES assesments to our project article list and I'm seing that the Start-C-B part is quite inconsistent. We can have a start article with 5.7 points and a C with 4.9... even we can have a good article with 5.9.

Would it be possible to run again a set of articles to get a better assesment?


Harej triaged this task as Normal priority.Jun 4 2019, 9:21 PM
Harej moved this task from Untriaged to New development on the Scoring-platform-team board.

Just for the record, there is a campaign with more labels for this dataset up here: See T215351 completing that.

For this task, I'd like to focus on using dictionaries to make the predictions better.

So we have revscoring.languages.english.dictionary.dict_words and a similar for Spanish. We could incorporate those into the feature list of the Basque model by including raw counts and a proportion. E.g. revscoring.languages.english.dictionary.revision.dict_words / revscoring.features.wikitext.revision.words

I've already got a work in progress PR for adding hunspell dictionary support for basque. See

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJul 18 2019, 4:09 PM