Started the bot to analyze Swahili Wikipedia:
tools.dexbot@tools-bastion-03:~/pywikibot-core$ jsub -once -N sw_bwds -mem 7g -l release=trusty /data/project/dexbot/pywikibot-core/p3_2/bin/python /data/project/dexbot/pywikibot-core/pwb.py /data/project/dexbot/pywikibot-core/scripts/dump_based_detection_beta.py /public/dumps/public/swwiki/20170301/swwiki-20170301-pages-meta-history.xml.bz2 Your job 3887264 ("sw_bwds") has been submitted
It will be in https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/sw in several hours to review.
Looks like this is ready for your attention @Baba_Tabita. See https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service/BWDS_review for instructions on what we need you to do.
Update (based on email exchange with Halfak in May 2017):
I had a look at the BWDS list, and almost 100% of the words in it are English, instead of Swahili. Those handful of Swahili words in it are not bad words (but just happen to occur in text that was reverted). Sorry!
Thanks for posting here. It looks like swahili uses latin chars so we won't be able to use a unicode char range to limit the search. Maybe we can limit the results using an English Dictionary.
Either way, it seems there's a lot of signal in just matching english words. We should include english dict word counts in the models we build.
@Baba_Tabita, do you think you could help us by generating a curated list of bad words (curses, slurs, etc.) and informal terms ("haha", "lol", "stupid", etc.)? There are often lists in Wikipedia article. It seems like http://www.youswear.com/index.asp?language=Swahili, http://www.youswear.com/index.asp?language=Kiswahili would get you started on curses. Note that it is helpful if you can modify the words in ways that are commonly used to circumvent swear filters. E.g. "shit" == "sh1t" == "shiiit"
@kevinbazira, could you take a look at this task? I figure you might be cut out for it. Right now, we need to generate lists of "badwords" and "informals". To get a sense for what kind of words belong in each list, consult the English Language word lists:
And tests here: https://github.com/wikimedia/revscoring/blob/master/tests/languages/test_english.py