Page MenuHomePhabricator

Add language support for Swahili (sw)
Open, LowPublic

Description

Event Timeline

Started the bot to analyze Swahili Wikipedia:

tools.dexbot@tools-bastion-03:~/pywikibot-core$ jsub -once -N sw_bwds -mem 7g -l release=trusty /data/project/dexbot/pywikibot-core/p3_2/bin/python /data/project/dexbot/pywikibot-core/pwb.py /data/project/dexbot/pywikibot-core/scripts/dump_based_detection_beta.py /public/dumps/public/swwiki/20170301/swwiki-20170301-pages-meta-history.xml.bz2 
Your job 3887264 ("sw_bwds") has been submitted

It will be in https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service/Word_lists/sw in several hours to review.

Looks like this is ready for your attention @Baba_Tabita. See https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service/BWDS_review for instructions on what we need you to do.

Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.

Update (based on email exchange with Halfak in May 2017):
I had a look at the BWDS list, and almost 100% of the words in it are English, instead of Swahili. Those handful of Swahili words in it are not bad words (but just happen to occur in text that was reverted). Sorry!

Thanks for posting here. It looks like swahili uses latin chars so we won't be able to use a unicode char range to limit the search. Maybe we can limit the results using an English Dictionary.

Either way, it seems there's a lot of signal in just matching english words. We should include english dict word counts in the models we build.

@Baba_Tabita, do you think you could help us by generating a curated list of bad words (curses, slurs, etc.) and informal terms ("haha", "lol", "stupid", etc.)? There are often lists in Wikipedia article. It seems like http://www.youswear.com/index.asp?language=Swahili, http://www.youswear.com/index.asp?language=Kiswahili would get you started on curses. Note that it is helpful if you can modify the words in ways that are commonly used to circumvent swear filters. E.g. "shit" == "sh1t" == "shiiit"

Halfak added a subscriber: Ladsgroup.

Hi @Baba_Tabita. As you can see in my last comment, I linked to a set of curse words available online. Could you review that list and filter it down to a list that makes sense to you? I'd also be very interested racial slurs and other totally inappropriate language.

Hi @Halfak . Thanks for the link. And sorry for the delay. It's still on my to-do list ...

Ladsgroup changed the task status from Open to Stalled.Jan 28 2019, 11:12 AM

@kevinbazira, could you take a look at this task? I figure you might be cut out for it. Right now, we need to generate lists of "badwords" and "informals". To get a sense for what kind of words belong in each list, consult the English Language word lists:
Here: https://github.com/wikimedia/revscoring/blob/master/revscoring/languages/english.py
And tests here: https://github.com/wikimedia/revscoring/blob/master/tests/languages/test_english.py

Thanks @Halfak! I've looked at the BWDS list and as @Baba_Tabita said, the Swahili words on the list are not bad words.

I am manually generating 2 lists of "badwords" and "informals" that should be ready by COB today.

@kevinbazira did the "badwords" and "informals" lists ever get merged? or did we wind up skipping this?

Aklapper changed the task status from Stalled to Open.Jun 15 2021, 11:33 AM

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status, as tasks should not be stalled (and then potentially forgotten) for years for unclear reasons.