Once we have a set of [[ https://phabricator.wikimedia.org/T299428 | word tokens ]] identified in the wikitext, we want to classify these word tokens into dictionary words or non-dictionary words. This will allow us to build the credibility signals related to dictionary and non-dictionary words.
{F34923602}
Implementation details:
[1] This utility will live under structured-data/packages.
[2] is_dictionary_word(//word_token//) -> True/ False
If a word token exists in the dictionary for any language, we return true. Specific language dictionary checks are implemented using enchant/aspell/myspell in [[ https://github.com/wikimedia/revscoring/tree/773f9cd8029de7ef5c7713addd2f6661bce94b4e#macos | revscoring ]]. You will need to find similar libraries for dictionary check in golang. Please refer to the check for [[ https://github.com/wikimedia/revscoring/blob/773f9cd8029de7ef5c7713addd2f6661bce94b4e/revscoring/languages/english.py#L17 | english here ]]. Similarly, browse through the directory for [[ https://github.com/wikimedia/revscoring/tree/773f9cd8029de7ef5c7713addd2f6661bce94b4e/revscoring/languages | other languages ]].
[3] If #2 returns false, it is non_dictionary word