Page MenuHomePhabricator

Add language support for Swahili (sw)
Open, Stalled, LowPublic


Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 5 2017, 2:38 PM
Halfak assigned this task to Ladsgroup.Apr 12 2017, 10:41 PM
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptApr 12 2017, 10:41 PM

Started the bot to analyze Swahili Wikipedia:

tools.dexbot@tools-bastion-03:~/pywikibot-core$ jsub -once -N sw_bwds -mem 7g -l release=trusty /data/project/dexbot/pywikibot-core/p3_2/bin/python /data/project/dexbot/pywikibot-core/ /data/project/dexbot/pywikibot-core/scripts/ /public/dumps/public/swwiki/20170301/swwiki-20170301-pages-meta-history.xml.bz2 
Your job 3887264 ("sw_bwds") has been submitted

It will be in in several hours to review.

Looks like this is ready for your attention @Baba_Tabita. See for instructions on what we need you to do.

Halfak updated the task description. (Show Details)Apr 14 2017, 5:49 PM
Halfak triaged this task as Low priority.May 11 2017, 2:41 PM
Halfak moved this task from Untriaged to New development on the Scoring-platform-team board.
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJul 21 2017, 11:08 AM

Update (based on email exchange with Halfak in May 2017):
I had a look at the BWDS list, and almost 100% of the words in it are English, instead of Swahili. Those handful of Swahili words in it are not bad words (but just happen to occur in text that was reverted). Sorry!

Thanks for posting here. It looks like swahili uses latin chars so we won't be able to use a unicode char range to limit the search. Maybe we can limit the results using an English Dictionary.

Either way, it seems there's a lot of signal in just matching english words. We should include english dict word counts in the models we build.

@Baba_Tabita, do you think you could help us by generating a curated list of bad words (curses, slurs, etc.) and informal terms ("haha", "lol", "stupid", etc.)? There are often lists in Wikipedia article. It seems like, would get you started on curses. Note that it is helpful if you can modify the words in ways that are commonly used to circumvent swear filters. E.g. "shit" == "sh1t" == "shiiit"

Halfak removed Ladsgroup as the assignee of this task.Aug 7 2017, 4:59 PM
Halfak added a subscriber: Ladsgroup.

Hi @Baba_Tabita. As you can see in my last comment, I linked to a set of curse words available online. Could you review that list and filter it down to a list that makes sense to you? I'd also be very interested racial slurs and other totally inappropriate language.

Hi @Halfak . Thanks for the link. And sorry for the delay. It's still on my to-do list ...

Ladsgroup changed the task status from Open to Stalled.Jan 28 2019, 11:12 AM

@kevinbazira, could you take a look at this task? I figure you might be cut out for it. Right now, we need to generate lists of "badwords" and "informals". To get a sense for what kind of words belong in each list, consult the English Language word lists:
And tests here:

Thanks @Halfak! I've looked at the BWDS list and as @Baba_Tabita said, the Swahili words on the list are not bad words.

I am manually generating 2 lists of "badwords" and "informals" that should be ready by COB today.