Page MenuHomePhabricator

Greek language assets
Closed, ResolvedPublic

Description

  • generate TFiDF badword lists
  • review and aggregation of badwords/informal words by native speaker
  • implement revscoring.language.Language (Language utility)

Event Timeline

ToAruShiroiNeko raised the priority of this task from to Needs Triage.
ToAruShiroiNeko updated the task description. (Show Details)
ToAruShiroiNeko subscribed.

@ToAruShiroiNeko, did you have a contact from elwiki who can help us push this forward?

Halfak triaged this task as Lowest priority.Jul 28 2016, 2:15 PM
Halfak set Security to None.

I'd like to help with this.
The bot generated list seems to cut words in half, separated by their diacritics: almost all greek words contain diacritics (some people omit them only when they type in touch screen or capital letters)
So the list is currently useless, most strings there are parts of common words.

Hi @geraki, we just had a pull request merged this data. https://github.com/wiki-ai/revscoring/pull/317/files

How does it look?

We can certainly work on the greek diacritics issues. I'll do some work on our tokenizer to handle that right now.

Halfak renamed this task from Greek language utilities to Greek language assets.May 23 2017, 7:00 AM

Just checked and it looks like our tokenizer handles it but the elwiki words came from an old run of BWDS. @Ladsgroup, do you think we could re-run BWDS with the most recent version of deltas and then use that to extend what we have for Greek language assets? I'll file a new task.

Halfak claimed this task.