Change Details

Once we have a set of [[ https://phabricator.wikimedia.org/T299428 | word tokens ]] identified in the wikitext, we want to classify these word tokens into dictionary words, or non-dictionary words,. non_safe words (bad words) and informal words. This will allow us to build the credibility signals for these classes ofrelated to dictionary and non-dictionary words. {F34923602} Implementation details: [1] This utility will live under structured-data/packages. [2] is_dictionary_word(//word_token//) -> True/ False If a word token exists in the dictionary for any language, we return true. Specific language dictionary checks are implemented using enchant/aspell/myspell in [[ https://github.com/wikimedia/revscoring/tree/773f9cd8029de7ef5c7713addd2f6661bce94b4e#macos | revscoring ]]. You will need to find similar libraries for dictionary check in golang. Please refer to the check for [[ https://github.com/wikimedia/revscoring/blob/773f9cd8029de7ef5c7713addd2f6661bce94b4e/revscoring/languages/english.py#L17 | english here ]]. Similarly, browse through the directory for [[ https://github.com/wikimedia/revscoring/tree/773f9cd8029de7ef5c7713addd2f6661bce94b4e/revscoring/languages | other languages ]]. [3] If #2 returns false, it is non_dictionary word [4] is_non_safe_word(//word_token//) -> True/ False is_informal_word(//word_token//) -> True/ False You will need to copy/paste the list of bad_words and informals for each language. Refer to the [[ https://github.com/wikimedia/revscoring/blob/773f9cd8029de7ef5c7713addd2f6661bce94b4e/revscoring/languages/english.py#L142 | non_safe words ]] and [[ https://github.com/wikimedia/revscoring/blob/773f9cd8029de7ef5c7713addd2f6661bce94b4e/revscoring/languages/english.py#L206 | informal ]] for english in the link. Similarly, browse through the directory for [[ https://github.com/wikimedia/revscoring/tree/773f9cd8029de7ef5c7713addd2f6661bce94b4e/revscoring/languages | other languages ]].