Change Details

We need to create a utility method to return a set of tokens of interest from wikitext. These tokens will be the datasources for computing credibility signals' values. This is the complete [[ https://github.com/halfak/deltas/blob/03f64e9c26c6480e458e2defd515a31d846ff667/deltas/tokenizers/lexicon.py#L127 | set of tokens ]]. However, at this point, our credibility signals are all related to [[ https://github.com/halfak/deltas/blob/03f64e9c26c6480e458e2defd515a31d846ff667/deltas/tokenizers/lexicon.py#L143 | word token ]]. So, we will only return word tokens from this utility at this point. Details: **[1]** The [[ https://github.com/halfak/deltas/blob/03f64e9c26c6480e458e2defd515a31d846ff667/deltas/tokenizers/lexicon.py#L80 | definition ]] of a word token includes any number of characters of english/arabic/devanagari/bengali charaters with aportrophes, some symbols and punctuation. **[2] **Chinese, korean and japanese characters are not included in word tokens in the current revscoring. (Please refer to [[ https://github.com/halfak/deltas/blob/03f64e9c26c6480e458e2defd515a31d846ff667/deltas/tokenizers/lexicon.py#L63 | this ]]). We will do the same for now - not include //cjk// words in word tokens. We will discuss this later and enhance. **[3]** The following are valid word tokens. We include word tokens by stripping off the preceding and trailing punctuations/symbols. Here are some examples: | Raw text | Returned word token | | 2nd | 2nd | | m80. | m80 | | we'll | we'll | | follows: | follows | | REST | REST | | total- | total | | "Oh" | Oh | Please refer to some more examples [[ https://github.com/wikimedia/revscoring/blob/275302c7b103513b51cf63b89e81ea051fba4786/tests/features/wikitext/tests/test_tokenized.py#L27 | here ]] **[4]** This utility will live under **structured-data/packages** **[5] **We will receive wikitext as argument and return a dictionary of //token//: //list of values// as follows. This will allow us to extend the token dictionary with other tokens, as needed in future. tokenize(wikitext) -> {"word" : [ "That's", "that", "n95", "3rd", "follows", "Latin"]}