Page MenuHomePhabricator

Improve Czech Language assets
Open, LowestPublic

Description

https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/czech.py contains our current list of "Stop words" (words that carry little meaning but glue sentences together), "Badwords" (racial slurs and curse words), and "Informals" (casual language that doesn't belong in articles). Let's review it and make it better.

To get started, modify the tests first. See https://github.com/wikimedia/revscoring/blob/master/tests/languages/test_czech.py

Glossary

  • Badwords: These words are often used as insults or are otherwise crass. Curses, racial slurs, belong in this list. We also want to include words that are often used as insults but aren't inherently bad. Such as "gay" and "pig". The AI can often figured out what it needs to from context.
  • Informals: These words are perfectly welcome in a casual conversation, but they are unlikely to be appropriate in an encyclopedia article. Greetings such as "hello", shorthand such as "lol", and other informal language ("hahaha", "woo hoo" and "weeee") all belong in this list.
  • Stopwords: These words are really common to the language. They are the glue that holds sentences together. But they don't carry much meaning themselves. In english, words like "the", "and", "for", and "because" are stopwords. We need to make sure this list doesn't contain a meaningful word like "truck", "electric" or "think".

Event Timeline

Harej triaged this task as Lowest priority.Jun 4 2019, 9:25 PM
Kizule added a subscriber: Kizule.

Should be ok for GCI, this is so easy. Just needs updating two files and creating pull request on GitHub. My patch https://github.com/wikimedia/revscoring/pull/456 can be shown as example.

Awesome! Thanks for picking this up!

Also note that we'd be happy with anyone taking on this task in any language -- serbian, czech, french, etc.

@Halfak To ensure I understand this correctly, it is about a) verifying the current lists b) adding new words to the lists, if missing? I can offer my Czech language skills to mentor this then, however, I would appreciate you being available in case I have any questions of list-meaning. if you can prepare a general explanation why a student should do this (what does it benefit) in a form understandable for anyone with zero knowledge of Wikimediav, it would be great. Thanks!

Sure!

We're using these word lists to build artificial intelligences (AIs) for Wikipedia. These AIs help Wikipedia editors catch vandalism, measure the quality of articles, and sort articles by topic. By extending these word lists, you're helping the AIs get a better sense of what's happening in a given language.

Badwords: These words are often used as insults or are otherwise crass. Curses, racial slurs, belong in this list. We also want to include words that are often used as insults but aren't inherently bad. Such as "gay" and "pig". The AI can often figured out what it needs to from context.

Informals: These words are perfectly welcome in a casual conversation, but they are unlikely to be appropriate in an encyclopedia article. Greetings such as "hello", shorthand such as "lol", and other informal language ("hahaha", "woo hoo" and "weeee") all belong in this list.

Stopwords: These words are really common to the language. They are the glue that holds sentences together. But they don't carry much meaning themselves. In english, words like "the", "and", "for", and "because" are stopwords. We need to make sure this list doesn't contain a meaningful word like "truck", "electric" or "think".

Thank you. I've added that to the description, and filled as https://codein.withgoogle.com/dashboard/tasks/5199253656305664/. @Halfak Could you please review the task to make sure if that is what you want? You should be able to access that after accepting the invite I sent you.

Feel free to add yourself as a co-mentor of this task at GCI, so you are notified about student's activity and are able to review their contribs.

Looks like I can't access the task. I've registered with GCI using my wikimedia email address (ahalfaker@wikimedia.org)

Looks like I can't access the task. I've registered with GCI using my wikimedia email address (ahalfaker@wikimedia.org)

Could you try once again, please?