Page MenuHomePhabricator

Implement hunspell dictionary for euwiki article quality model
Open, NormalPublic

Description

Word from @Theklan is that there is a good dictionary for Basque. Let's implement a set of features.

Also, there are some paragraphs of English and Spanish that we might want to catch.

English or spanish in a <ref> tag is OK, but not in the rest of the content.


Proposal:

  1. Merge https://github.com/wikimedia/revscoring/pull/400
  2. Edit https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/euwiki.py
    • Add new features for the proportion of words that match Basque, English, and Spanish dictionaries.
  3. Train and test to compare results.

Details

Related Gerrit Patches:
operations/puppet : productionAdds hunspell-eu to ores/manifests/base.pp

Event Timeline

Halfak created this task.May 19 2019, 9:52 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 19 2019, 9:52 AM

I think Hunspell is available in Basque, there are also dictionaries made by IXA Taldea (@Ksarasola) available: http://ixa.si.ehu.es/produktuak. Most prominent one is Xuxen: http://xuxen.eus/es/versiones. This is mantained by @ElhuyarFundazioa, but can be downloaded for use.

Hello @Halfak
I'm currently building some statistics on the ORES assesments to our project article list and I'm seing that the Start-C-B part is quite inconsistent. We can have a start article with 5.7 points and a C with 4.9... even we can have a good article with 5.9.

Would it be possible to run again a set of articles to get a better assesment?

Thanks

Harej triaged this task as Normal priority.Jun 4 2019, 9:21 PM
Harej moved this task from Untriaged to New development on the Scoring-platform-team board.

Just for the record, there is a campaign with more labels for this dataset up here: https://labels.wmflabs.org/stats/euwiki/ See T215351 completing that.

For this task, I'd like to focus on using dictionaries to make the predictions better.

So we have revscoring.languages.english.dictionary.dict_words and a similar for Spanish. We could incorporate those into the feature list of the Basque model by including raw counts and a proportion. E.g. revscoring.languages.english.dictionary.revision.dict_words / revscoring.features.wikitext.revision.words

I've already got a work in progress PR for adding hunspell dictionary support for basque. See https://github.com/wikimedia/revscoring/pull/400

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJul 18 2019, 4:09 PM

Is this live?

This is live. I'm hoping to make this one of the intro tasks for our new engineer, @kevinbazira. He's still just getting his accounts and access together though.

Halfak updated the task description. (Show Details)Oct 7 2019, 6:56 PM
Halfak added a comment.Oct 7 2019, 6:59 PM

@Theklan, do you think it's likely we'll find examples of articles that are long enough to be high quality but contain English/Spanish language and are thus lower quality in our labeled data?

I see that we still have a some labels needed for the most recent campaign: https://labels.wmflabs.org/stats/euwiki/

If we don't have many good examples, we could work from some cherry-picked examples. I think 10 or 15 examples of articles (they can be historical versions of articles) with English/Spanish or other non-Basque language content could work.

Hello! I have some queue to start with the labelling campaign, but I will follow up as soon as possible.

We have this category of articles that need corrections: https://eu.wikipedia.org/wiki/Kategoria:Zuzentzeko
In this category we have texts in other languages inside the articles, mostly on citations: https://eu.wikipedia.org/wiki/Kategoria:Artikulu_itzuligabeak

The article Lantanoide (https://eu.wikipedia.org/wiki/Lantanoide) is a good example: structurally perfect, but with lots of wrong wording.

I have finished the labelling campaing. There was a redirect in the list, so I said it was a stub, because I couldn't finish without it.

Halfak added a comment.Oct 8 2019, 1:50 PM

Thanks Theklan. I looked through https://eu.wikipedia.org/wiki/Lantanoide but my stupid American monolingual eyes couldn't see any clear instances of English or Spanish words. :) Are there a lot of Basque language typos in the article? I'm trying to think through how we'll make the best use of these new features.

Either way, I'm curious how you would rate the quality of such an article that is structurally perfect but has wrong wording. After all, we'll need to figure out how to teach the model to make better predictions about articles like this.

Halfak reassigned this task from Halfak to kevinbazira.Mon, Oct 28, 4:11 PM
Halfak moved this task from Pending deployment to Active on the Scoring-platform-team (Current) board.

It looks like we have the dictionary working. The next step is to re-train the model.

How could we train this? I add @Ksarasola to this topic, maybe he has some great ideas.

@Theklan! We've got training in progress right now. But, I'm very interested in working with new volunteers. @Ksarasola, I'm sure we could find some other ways to make ORES for euwiki and related languages better :)

Change 547285 had a related patch set uploaded (by Halfak; owner: Halfak):
[operations/puppet@production] Adds hunspell-eu to ores/manifests/base.pp

https://gerrit.wikimedia.org/r/547285

Change 547285 merged by Dzahn:
[operations/puppet@production] Adds hunspell-eu to ores/manifests/base.pp

https://gerrit.wikimedia.org/r/547285

Halfak added a subscriber: Dzahn.Wed, Oct 30, 8:07 PM

@kevinbazira, thanks to @Dzahn, you should be unblocked. Please try to rebuild the euwiki model when you can and we can review/iterate when I get online tomorrow.