- generate TFiDF badword lists
- review and aggregation of badwords/informal words by native speaker
- implement revscoring.language.Language (Language utility)
|Resolved||Halfak||T166045 Scoring platform team FY18 Q1|
|Resolved||Catrope||T170723 Deploy ORES Review Tool & ORES-based RCFilters for Romanian & Albanian Wikipedia|
|Resolved||awight||T170485 ORES deployment - Mid July, 2017|
|Resolved||Halfak||T170491 Train reverted model for Greek Wikipedia|
|Resolved||Halfak||T166049 Deploy reverted model for elwiki|
|Resolved||Halfak||T166050 Train/test reverted model for elwiki|
|Resolved||Halfak||T122727 Greek language assets|
I'd like to help with this.
The bot generated list seems to cut words in half, separated by their diacritics: almost all greek words contain diacritics (some people omit them only when they type in touch screen or capital letters)
So the list is currently useless, most strings there are parts of common words.
Hi @geraki, we just had a pull request merged this data. https://github.com/wiki-ai/revscoring/pull/317/files
How does it look?
We can certainly work on the greek diacritics issues. I'll do some work on our tokenizer to handle that right now.