https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/czech.py contains our current list of "Stop words" (words that carry little meaning but glue sentences together), "Badwords" (racial slurs and curse words), and "Informals" (casual language that doesn't belong in articles). Let's review it and make it better.
To get started, modify the tests first. See https://github.com/wikimedia/revscoring/blob/master/tests/languages/test_czech.py
Glossary
- Badwords: These words are often used as insults or are otherwise crass. Curses, racial slurs, belong in this list. We also want to include words that are often used as insults but aren't inherently bad. Such as "gay" and "pig". The AI can often figured out what it needs to from context.
- Informals: These words are perfectly welcome in a casual conversation, but they are unlikely to be appropriate in an encyclopedia article. Greetings such as "hello", shorthand such as "lol", and other informal language ("hahaha", "woo hoo" and "weeee") all belong in this list.
- Stopwords: These words are really common to the language. They are the glue that holds sentences together. But they don't carry much meaning themselves. In english, words like "the", "and", "for", and "because" are stopwords. We need to make sure this list doesn't contain a meaningful word like "truck", "electric" or "think".