Page MenuHomePhabricator

Improve Czech Language assets
Open, LowestPublic

Description

https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/czech.py contains our current list of "Stop words" (words that carry little meaning but glue sentences together), "Badwords" (racial slurs and curse words), and "Informals" (casual language that doesn't belong in articles). Let's review it and make it better.

To get started, modify the tests first. See https://github.com/wikimedia/revscoring/blob/master/tests/languages/test_czech.py

Event Timeline

Halfak created this task.May 15 2019, 2:30 PM
Harej triaged this task as Lowest priority.Jun 4 2019, 9:25 PM
Harej moved this task from Untriaged to Blocked on community input on the Scoring-platform-team board.
Zoranzoki21 added a subscriber: Zoranzoki21.

Should be ok for GCI, this is so easy. Just needs updating two files and creating pull request on GitHub. My patch https://github.com/wikimedia/revscoring/pull/456 can be shown as example.

Awesome! Thanks for picking this up!

Also note that we'd be happy with anyone taking on this task in any language -- serbian, czech, french, etc.

@Halfak To ensure I understand this correctly, it is about a) verifying the current lists b) adding new words to the lists, if missing? I can offer my Czech language skills to mentor this then, however, I would appreciate you being available in case I have any questions of list-meaning. if you can prepare a general explanation why a student should do this (what does it benefit) in a form understandable for anyone with zero knowledge of Wikimediav, it would be great. Thanks!

Sure!

We're using these word lists to build artificial intelligences (AIs) for Wikipedia. These AIs help Wikipedia editors catch vandalism, measure the quality of articles, and sort articles by topic. By extending these word lists, you're helping the AIs get a better sense of what's happening in a given language.

Badwords: These words are often used as insults or are otherwise crass. Curses, racial slurs, belong in this list. We also want to include words that are often used as insults but aren't inherently bad. Such as "gay" and "pig". The AI can often figured out what it needs to from context.

Informals: These words are perfectly welcome in a casual conversation, but they are unlikely to be appropriate in an encyclopedia article. Greetings such as "hello", shorthand such as "lol", and other informal language ("hahaha", "woo hoo" and "weeee") all belong in this list.

Stopwords: These words are really common to the language. They are the glue that holds sentences together. But they don't carry much meaning themselves. In english, words like "the", "and", "for", and "because" are stopwords. We need to make sure this list doesn't contain a meaningful word like "truck", "electric" or "think".