Page MenuHomePhabricator

Create A Utility To Get Word Tokens From Wikitext
Closed, ResolvedPublic8 Estimated Story Points

Description

We need to create a utility method to return a set of tokens of interest from wikitext. These tokens will be the datasources for computing credibility signals' values.

The implementation is basically a regex matching with some text formatting.
Input (string): "That's that :: :) > n95 Latin. follows 3rd:"
Output (dictionary) : {"wordTokens" : [ "That's", "that", "n95", "3rd", "follows", "Latin"]}

This is the complete set of tokens. However, at this point, our credibility signals are all related to word token. So, we will only return word tokens from this utility at this point.

Implementation details:
[1] The definition of a word token includes any number of characters of english/arabic/devanagari/bengali charaters with aportrophes, some symbols and punctuation.
[2] Chinese, korean and japanese characters are not included in word tokens in the current revscoring. (Please refer to this). We will do the same for now - not include cjk words in word tokens. We will discuss this later and enhance.
[3] The following are valid word tokens. We include word tokens by stripping off the preceding and trailing punctuations/symbols. Here are some examples:

Raw textReturned word token
2nd2nd
m80.m80
we'llwe'll
follows:follows
RESTREST
total-total
"Oh"Oh

Please refer to some more examples here
[4] This utility will live under structured-data/packages
[5] We will receive wikitext as argument and return a dictionary of { token: list of values} as follows. This will allow us to extend the token dictionary with other tokens, as needed in future.

tokenize(wikitext) -> {"wordTokens" : [ "That's", "that", "n95", "3rd", "follows", "Latin"]}

Event Timeline

prabhat updated the task description. (Show Details)
prabhat updated the task description. (Show Details)
prabhat updated the task description. (Show Details)
prabhat triaged this task as High priority.Jan 18 2022, 5:02 PM
prabhat updated the task description. (Show Details)
prabhat updated the task description. (Show Details)
Lena.Milenko changed the task status from Open to In Progress.Feb 28 2022, 11:51 AM
Lena.Milenko changed the task status from In Progress to Open.Apr 13 2022, 9:28 PM
Lena.Milenko changed the status of subtask T299584: Create A Utility For Informal Words Check from In Progress to Open.
Lena.Milenko changed the status of subtask T299582: Create Utility For Non-Safe Words Check from In Progress to Open.