We need to create a utility method to return a set of tokens of interest from wikitext. These tokens will be the datasources for computing credibility signals' values.
The implementation is basically a regex matching with some text formatting.
Input (string): "That's that :: :) > n95 Latin. follows 3rd:"
Output (dictionary) : {"wordTokens" : [ "That's", "that", "n95", "3rd", "follows", "Latin"]}
This is the complete set of tokens. However, at this point, our credibility signals are all related to word token. So, we will only return word tokens from this utility at this point.
Implementation details:
[1] The definition of a word token includes any number of characters of english/arabic/devanagari/bengali charaters with aportrophes, some symbols and punctuation.
[2] Chinese, korean and japanese characters are not included in word tokens in the current revscoring. (Please refer to this). We will do the same for now - not include cjk words in word tokens. We will discuss this later and enhance.
[3] The following are valid word tokens. We include word tokens by stripping off the preceding and trailing punctuations/symbols. Here are some examples:
Raw text | Returned word token |
2nd | 2nd |
m80. | m80 |
we'll | we'll |
follows: | follows |
REST | REST |
total- | total |
"Oh" | Oh |
Please refer to some more examples here
[4] This utility will live under structured-data/packages
[5] We will receive wikitext as argument and return a dictionary of { token: list of values} as follows. This will allow us to extend the token dictionary with other tokens, as needed in future.
tokenize(wikitext) -> {"wordTokens" : [ "That's", "that", "n95", "3rd", "follows", "Latin"]}