Implement hunspell dictionary for euwiki article quality model
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	May 19 2019, 9:52 AM

Description

Word from @Theklan is that there is a good dictionary for Basque. Let's implement a set of features.

Also, there are some paragraphs of English and Spanish that we might want to catch.

English or spanish in a <ref> tag is OK, but not in the rest of the content.

Proposal:

Merge https://github.com/wikimedia/revscoring/pull/400
Edit https://github.com/wikimedia/articlequality/blob/master/articlequality/feature_lists/euwiki.py
- Add new features for the proportion of words that match Basque, English, and Spanish dictionaries.
Train and test to compare results.

Details

	Subject	Repo	Branch	Lines +/-
	Adds hunspell-eu to ores/manifests/base.pp	operations/puppet	production	+5 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T223228 The Great Basque Redesign
Resolved	Halfak	T234222 Onboarding Kevin Bazira -- Accounts and Access
Resolved	kevinbazira	T238839 ORES deploy -- Late November, 2019
Resolved	kevinbazira	T223788 Implement hunspell dictionary for euwiki article quality model

Event Timeline

Halfak created this task.May 19 2019, 9:52 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 19 2019, 9:52 AM

I think Hunspell is available in Basque, there are also dictionaries made by IXA Taldea (@Ksarasola) available: http://ixa.si.ehu.es/produktuak. Most prominent one is Xuxen: http://xuxen.eus/es/versiones. This is mantained by @ElhuyarFundazioa, but can be downloaded for use.

Xuxen Hunspell is here: http://xuxen.eus/static/hunspell/xuxen_5.1_hunspell.zip

Reedy added a project: Machine-Learning-Team.May 19 2019, 11:52 AM

Hello @Halfak
I'm currently building some statistics on the ORES assesments to our project article list and I'm seing that the Start-C-B part is quite inconsistent. We can have a start article with 5.7 points and a C with 4.9... even we can have a good article with 5.9.

Would it be possible to run again a set of articles to get a better assesment?

Thanks

Harej triaged this task as Medium priority.Jun 4 2019, 9:21 PM

Harej moved this task from Unsorted to New development on the Machine-Learning-Team board.

Just for the record, there is a campaign with more labels for this dataset up here: https://labels.wmflabs.org/stats/euwiki/ See T215351 completing that.

For this task, I'd like to focus on using dictionaries to make the predictions better.

So we have revscoring.languages.english.dictionary.dict_words and a similar for Spanish. We could incorporate those into the feature list of the Basque model by including raw counts and a proportion. E.g. revscoring.languages.english.dictionary.revision.dict_words / revscoring.features.wikitext.revision.words

I've already got a work in progress PR for adding hunspell dictionary support for basque. See https://github.com/wikimedia/revscoring/pull/400

Halfak added a project: editquality-modeling.Jul 18 2019, 4:09 PM

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJul 18 2019, 4:09 PM

Is this live?

This is live. I'm hoping to make this one of the intro tasks for our new engineer, @kevinbazira. He's still just getting his accounts and access together though.

Halfak updated the task description. (Show Details)Oct 7 2019, 6:56 PM

@Theklan, do you think it's likely we'll find examples of articles that are long enough to be high quality but contain English/Spanish language and are thus lower quality in our labeled data?

I see that we still have a some labels needed for the most recent campaign: https://labels.wmflabs.org/stats/euwiki/

If we don't have many good examples, we could work from some cherry-picked examples. I think 10 or 15 examples of articles (they can be historical versions of articles) with English/Spanish or other non-Basque language content could work.

Hello! I have some queue to start with the labelling campaign, but I will follow up as soon as possible.

We have this category of articles that need corrections: https://eu.wikipedia.org/wiki/Kategoria:Zuzentzeko
In this category we have texts in other languages inside the articles, mostly on citations: https://eu.wikipedia.org/wiki/Kategoria:Artikulu_itzuligabeak

The article Lantanoide (https://eu.wikipedia.org/wiki/Lantanoide) is a good example: structurally perfect, but with lots of wrong wording.

I have finished the labelling campaing. There was a redirect in the list, so I said it was a stub, because I couldn't finish without it.

Thanks Theklan. I looked through https://eu.wikipedia.org/wiki/Lantanoide but my stupid American monolingual eyes couldn't see any clear instances of English or Spanish words. :) Are there a lot of Basque language typos in the article? I'm trying to think through how we'll make the best use of these new features.

Either way, I'm curious how you would rate the quality of such an article that is structurally perfect but has wrong wording. After all, we'll need to figure out how to teach the model to make better predictions about articles like this.

Halfak added a parent task: T234222: Onboarding Kevin Bazira -- Accounts and Access.Oct 9 2019, 8:44 PM

Halfak mentioned this in T234222: Onboarding Kevin Bazira -- Accounts and Access.Oct 9 2019, 8:55 PM

Halfak claimed this task.Oct 15 2019, 3:06 PM

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.

Halfak moved this task from Parked to Pending deployment on the Machine-Learning-Team (Active Tasks) board.

It looks like we have the dictionary working. The next step is to re-train the model.

How could we train this? I add @Ksarasola to this topic, maybe he has some great ideas.

@Theklan! We've got training in progress right now. But, I'm very interested in working with new volunteers. @Ksarasola, I'm sure we could find some other ways to make ORES for euwiki and related languages better :)

Change 547285 had a related patch set uploaded (by Halfak; owner: Halfak):
[operations/puppet@production] Adds hunspell-eu to ores/manifests/base.pp

https://gerrit.wikimedia.org/r/547285

gerritbot added a project: Patch-For-Review.Oct 30 2019, 7:16 PM

Change 547285 merged by Dzahn:
[operations/puppet@production] Adds hunspell-eu to ores/manifests/base.pp

https://gerrit.wikimedia.org/r/547285

@kevinbazira, thanks to @Dzahn, you should be unblocked. Please try to rebuild the euwiki model when you can and we can review/iterate when I get online tomorrow.