We need to create a utility method to return a set of tokens of interest from wikitext. These tokens will be the datasources for computing credibility signals' values.
This is the complete [[ https://github.com/halfak/deltas/blob/03f64e9c26c6480e458e2defd515a31d846ff667/deltas/tokenizers/lexicon.py#L127 | set of tokens ]]. However, at this point, our credibility signals are all related to [[ https://github.com/halfak/deltas/blob/03f64e9c26c6480e458e2defd515a31d846ff667/deltas/tokenizers/lexicon.py#L143 | word token ]]. So, we will only return word tokens from this utility at this point.
Details:
**[1]** The [[ https://github.com/halfak/deltas/blob/03f64e9c26c6480e458e2defd515a31d846ff667/deltas/tokenizers/lexicon.py#L80 | definition ]] of a word token includes any number of characters of english/arabic/devanagari/bengali charaters with aportrophes, some symbols and punctuation.
**[2] **Chinese, korean and japanese characters are not included in word tokens in the current revscoring. (Please refer to [[ https://github.com/halfak/deltas/blob/03f64e9c26c6480e458e2defd515a31d846ff667/deltas/tokenizers/lexicon.py#L63 | this ]]). We will do the same for now - not include //cjk// words in word tokens. We will discuss this later and enhance.
**[3]** The following are valid word tokens. We include word tokens by stripping off the preceding and trailing punctuations/symbols. Here are some examples:
| Raw text | Returned word token |
| 2nd | 2nd |
| m80. | m80 |
| we'll | we'll |
| follows: | follows |
| REST | REST |
| total- | total |
| "Oh" | Oh |
Please refer to some more examples [[ https://github.com/wikimedia/revscoring/blob/275302c7b103513b51cf63b89e81ea051fba4786/tests/features/wikitext/tests/test_tokenized.py#L27 | here ]]
**[4]** This utility will live under **structured-data/packages**
**[5] **We will receive wikitext as argument and return a dictionary of //token//: //list of values// as follows. This will allow us to extend the token dictionary with other tokens, as needed in future.
tokenize(wikitext) -> {"word" : [ "That's", "that", "n95", "3rd", "follows", "Latin"]}