Page MenuHomePhabricator

Don't use "ha" as an informal in hungarian
Closed, ResolvedPublic

Event Timeline

Background: word "ha" has the meaning "if" in Hungarian, and widely used in written texts.

Can we use the BVDS on a wiki base? List of words per languages or something like that?

@Ladsgroup, what do you think about storing badwords in Wikidata? It seems like having well curated lists in an external repo could be very useful and a better way to maintain the data.

Well, we had the suggestion to store it in mediawiki namespace of wikis which is nicer as it can't be vandalized (easily) but we will still face the problem rebuilding the models after each change.

The PR is merged now, we need to rebuild the model for Hungarian.

It is not clear for me from the code, that is this patch a general solution or works only for the word 'ha'?

It would be useful to give an option to build a bad word dictionary per languages (for example on a local MediaWiki page). It can be that there are other false positive words on the English list and I am sure missing several words in Hungarian makes an edit very likely a vandalism.

we do have a badwordlist for each language. If you could review the english list, we can make sure to exclude others.

See https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/tests/test_english.py

Here's the words we match in Hungarian: https://github.com/wiki-ai/revscoring/blob/master/revscoring/languages/tests/test_hungarian.py

Edit: Warning! There's a bunch of offensive stuff in those links! That should go without saying, but better to say it than not. :)

Oh thanks, I see :)

I checked the English list.
"dada" has more meanings in Hungarian, all of them are good. (Even in English, see https://en.wikipedia.org/wiki/Dada )
I would definitely exclude "ok" from the list in Hungarian (its meaning is "reason" or "cause").
There are good words in Hungarian where "fart" can be part of them (but not alone).
"hu" is often used for example as part of a web address (.hu is the domain for Hungary).
I can imagine "terrorist" is used in good faith in Wikipedia articles.
I don't understand, why "association" is in the list (in the Other category, line 166).

I checked the Hungarian list.
I don't see anything wrong with "infosarok".
I have no idea, why is the text (whole paragraph between lines 131 and 136) in the OTHER part. This is a perfect content of an article.

Question: is there a plan that ORES will be used outside of the article (0) namespace? (There are several words on the lists, which shouldn't be in article namespace, but are good in other namespaces. Only one example: "user" is on the list.)

Can you give me some example for good "fart" words in Hungarian?

I don't understand, why "association" is in the list (in the Other category, line 166).

OTHER is a set of words we want to make sure *do not* get matched :) That's why we paste some good content there. None of it should get picked up. If it does, our tests will fail and we'll know we made a big mistake.

There are several words on the lists, which shouldn't be in article namespace, but are good in other namespaces.

Perfect. The "Informals" list should be words that are OK in conversation, but not in an article.

Can you give me some example for good "fart" words in Hungarian?

Two examples: "fartő" (part of a cattle) or "fartőke" (part of a boat).

OK I'm rebuilding the model now. Thanks for your help. :)