Once we have a set of [[ https://phabricator.wikimedia.org/T299428 | word tokens ]] identified in the wikitext, we want to classify these word tokens into dictionary words, non-dictionary words, non_safe words (bad words) and informal words. This will allow us to build the credibility signals for these classes of words.
Implementation details:
[1] This utility will live under structured-data/packages.
[2] is_dictionary_word(//word_token//) -> True/ False
If a word token exists in the dictionary for any language, we return true. Specific language dictionary checks are implemented using enchant/aspell/myspell in [[ https://github.com/wikimedia/revscoring/tree/773f9cd8029de7ef5c7713addd2f6661bce94b4e#macos | revscoring ]]. You will need to find similar libraries for dictionary check in golang. Please refer to the check for [[ https://github.com/wikimedia/revscoring/blob/773f9cd8029de7ef5c7713addd2f6661bce94b4e/revscoring/languages/english.py#L17 | english here ]]. Similarly, browse through the directory for [[ https://github.com/wikimedia/revscoring/tree/773f9cd8029de7ef5c7713addd2f6661bce94b4e/revscoring/languages | other languages ]].
[3] If #2 returns false, it is non_dictionary word
[4] is_non_safe_word(//word_token//) -> True/ False
is_informal_word(//word_token//) -> True/ False
You will need to copy/paste the list of bad_words and informals for each language. Refer to the [[ https://github.com/wikimedia/revscoring/blob/773f9cd8029de7ef5c7713addd2f6661bce94b4e/revscoring/languages/english.py#L142 | non_safe words ]] and [[ https://github.com/wikimedia/revscoring/blob/773f9cd8029de7ef5c7713addd2f6661bce94b4e/revscoring/languages/english.py#L206 | informal ]] for english in the link. Similarly, browse through the directory for [[ https://github.com/wikimedia/revscoring/tree/773f9cd8029de7ef5c7713addd2f6661bce94b4e/revscoring/languages | other languages ]].