Page MenuHomePhabricator

Improve Czech Language assets
Closed, DeclinedPublic

Description

https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/czech.py contains our current list of "Stop words" (words that carry little meaning but glue sentences together), "Badwords" (racial slurs and curse words), and "Informals" (casual language that doesn't belong in articles). Let's review it and make it better.

To get started, modify the tests first. See https://github.com/wikimedia/revscoring/blob/master/tests/languages/test_czech.py

Glossary

  • Badwords: These words are often used as insults or are otherwise crass. Curses, racial slurs, belong in this list. We also want to include words that are often used as insults but aren't inherently bad. Such as "gay" and "pig". The AI can often figured out what it needs to from context.
  • Informals: These words are perfectly welcome in a casual conversation, but they are unlikely to be appropriate in an encyclopedia article. Greetings such as "hello", shorthand such as "lol", and other informal language ("hahaha", "woo hoo" and "weeee") all belong in this list.
  • Stopwords: These words are really common to the language. They are the glue that holds sentences together. But they don't carry much meaning themselves. In english, words like "the", "and", "for", and "because" are stopwords. We need to make sure this list doesn't contain a meaningful word like "truck", "electric" or "think".

Event Timeline

Harej triaged this task as Lowest priority.Jun 4 2019, 9:25 PM
Harej moved this task from Unsorted to Blocked on community input on the Machine-Learning-Team board.
Kizule subscribed.

Should be ok for GCI, this is so easy. Just needs updating two files and creating pull request on GitHub. My patch https://github.com/wikimedia/revscoring/pull/456 can be shown as example.

Awesome! Thanks for picking this up!

Also note that we'd be happy with anyone taking on this task in any language -- serbian, czech, french, etc.

@Halfak To ensure I understand this correctly, it is about a) verifying the current lists b) adding new words to the lists, if missing? I can offer my Czech language skills to mentor this then, however, I would appreciate you being available in case I have any questions of list-meaning. if you can prepare a general explanation why a student should do this (what does it benefit) in a form understandable for anyone with zero knowledge of Wikimediav, it would be great. Thanks!

Sure!

We're using these word lists to build artificial intelligences (AIs) for Wikipedia. These AIs help Wikipedia editors catch vandalism, measure the quality of articles, and sort articles by topic. By extending these word lists, you're helping the AIs get a better sense of what's happening in a given language.

Badwords: These words are often used as insults or are otherwise crass. Curses, racial slurs, belong in this list. We also want to include words that are often used as insults but aren't inherently bad. Such as "gay" and "pig". The AI can often figured out what it needs to from context.

Informals: These words are perfectly welcome in a casual conversation, but they are unlikely to be appropriate in an encyclopedia article. Greetings such as "hello", shorthand such as "lol", and other informal language ("hahaha", "woo hoo" and "weeee") all belong in this list.

Stopwords: These words are really common to the language. They are the glue that holds sentences together. But they don't carry much meaning themselves. In english, words like "the", "and", "for", and "because" are stopwords. We need to make sure this list doesn't contain a meaningful word like "truck", "electric" or "think".

Thank you. I've added that to the description, and filled as https://codein.withgoogle.com/dashboard/tasks/5199253656305664/. @Halfak Could you please review the task to make sure if that is what you want? You should be able to access that after accepting the invite I sent you.

Feel free to add yourself as a co-mentor of this task at GCI, so you are notified about student's activity and are able to review their contribs.

Looks like I can't access the task. I've registered with GCI using my wikimedia email address (ahalfaker@wikimedia.org)

Looks like I can't access the task. I've registered with GCI using my wikimedia email address (ahalfaker@wikimedia.org)

Could you try once again, please?

Heyy I worked on this issue but for Hindi language and created a PR for the same, can you please review it?
https://github.com/wikimedia/revscoring/pull/512

@dgsahethi: Thanks! This ticket is about Czech language. Please create a separate Phab ticket for Hindi, if not existing yet. Thanks.

@Aklapper Ohh, will do that!
Also, wanted to ask if there is a dedicated IRC for this project on Slack or anything other than that?

Aklapper edited projects, added revscoring; removed good first task, ORES.

Declining as this task lacks clear criteria when to call it done (it made sense in a Google Code-in context though)