- generate TFiDF badword lists
- review and aggregation of badwords/informal words by native speaker
- implement revscoring.Language (Language utility)
you need to either download them and open it with notepad (gedit, or anything suitable) or in your browser check for encoding option (probably in view menu) and choose "UTF-8" or "Unicode"
There's nothing wrong with these files regarding encoding.
I fixed the encoding and copied it to the wiki here: https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/ur
@MuhammadShuaib, could you take another look? We need you to move words from the "list-generated" to list-badwords(racial slurs, curse words, offensive language) and list-informals(causal talk: e.g., "hello", "haha", "lol", "wat"). Please let me know if you have any questions.