- generate TFiDF badword lists
- review and aggregation of badwords/informal words by native speaker
- implement revscoring.language.Language (Language utility)
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Halfak | T166045 Scoring platform team FY18 Q1 | |||
Resolved | Catrope | T170723 Deploy ORES Review Tool & ORES-based RCFilters for Romanian & Albanian Wikipedia | |||
Resolved | awight | T170485 ORES deployment - Mid July, 2017 | |||
Resolved | Halfak | T170491 Train reverted model for Greek Wikipedia | |||
Resolved | Halfak | T166049 Deploy reverted model for elwiki | |||
Resolved | Halfak | T166050 Train/test reverted model for elwiki | |||
Resolved | Halfak | T122727 Greek language assets |
Event Timeline
@ToAruShiroiNeko, did you have a contact from elwiki who can help us push this forward?
Looks like the tfidf box got checked, but I don't see evidence of a Bad-Words-Detection-System run. https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/el
I'd like to help with this.
The bot generated list seems to cut words in half, separated by their diacritics: almost all greek words contain diacritics (some people omit them only when they type in touch screen or capital letters)
So the list is currently useless, most strings there are parts of common words.
Hi @geraki, we just had a pull request merged this data. https://github.com/wiki-ai/revscoring/pull/317/files
How does it look?
We can certainly work on the greek diacritics issues. I'll do some work on our tokenizer to handle that right now.
Just checked and it looks like our tokenizer handles it but the elwiki words came from an old run of BWDS. @Ladsgroup, do you think we could re-run BWDS with the most recent version of deltas and then use that to extend what we have for Greek language assets? I'll file a new task.