Create A Utility To Get Word Tokens From Wikitext
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	prabhat
	Jan 18 2022, 4:55 PM

Description

We need to create a utility method to return a set of tokens of interest from wikitext. These tokens will be the datasources for computing credibility signals' values.

The implementation is basically a regex matching with some text formatting.
Input (string): "That's that :: :) > n95 Latin. follows 3rd:"
Output (dictionary) : {"wordTokens" : [ "That's", "that", "n95", "3rd", "follows", "Latin"]}

This is the complete set of tokens. However, at this point, our credibility signals are all related to word token. So, we will only return word tokens from this utility at this point.

Implementation details:
[1] The definition of a word token includes any number of characters of english/arabic/devanagari/bengali charaters with aportrophes, some symbols and punctuation.
[2] Chinese, korean and japanese characters are not included in word tokens in the current revscoring. (Please refer to this). We will do the same for now - not include cjk words in word tokens. We will discuss this later and enhance.
[3] The following are valid word tokens. We include word tokens by stripping off the preceding and trailing punctuations/symbols. Here are some examples:

Raw text	Returned word token
2nd	2nd
m80.	m80
we'll	we'll
follows:	follows
REST	REST
total-	total
"Oh"	Oh

Please refer to some more examples here
[4] This utility will live under structured-data/packages
[5] We will receive wikitext as argument and return a dictionary of { token: list of values} as follows. This will allow us to extend the token dictionary with other tokens, as needed in future.

tokenize(wikitext) -> {"wordTokens" : [ "That's", "that", "n95", "3rd", "follows", "Latin"]}

Related Objects
Search...

Status	Assigned	Task
Resolved	JArguello-WMF	T297256 Credibility Signals v1.1
Resolved	Alexander.lauie	T299428 Create A Utility To Get Word Tokens From Wikitext
Resolved	prabhat	T299432 Create Utility For Dictionary/Non-Dictionary Words Check
Resolved	Anribolon	T299586 Create A Utility For Uppercase Words Check
Resolved	Anribolon	T299584 Create A Utility For Informal Words Check
Resolved	Anribolon	T299582 Create Utility For Non-Safe Words Check

Event Timeline

prabhat created this task.Jan 18 2022, 4:55 PM

Restricted Application added subscribers: revi, Aklapper. · View Herald TranscriptJan 18 2022, 4:55 PM

prabhat updated the task description. (Show Details)Jan 18 2022, 4:56 PM

prabhat updated the task description. (Show Details)

prabhat updated the task description. (Show Details)Jan 18 2022, 4:58 PM

prabhat updated the task description. (Show Details)

prabhat triaged this task as High priority.Jan 18 2022, 5:02 PM

prabhat updated the task description. (Show Details)

prabhat updated the task description. (Show Details)Jan 18 2022, 5:07 PM

prabhat updated the task description. (Show Details)

• Lena.Milenko added a parent task: T297256: Credibility Signals v1.1.Jan 19 2022, 2:04 PM

prabhat updated the task description. (Show Details)Jan 19 2022, 10:32 PM

Protsack.stephan lowered the priority of this task from High to Medium.Jan 20 2022, 2:14 PM

Protsack.stephan added a subtask: T299432: Create Utility For Dictionary/Non-Dictionary Words Check.

Protsack.stephan moved this task from Incoming to Estimated /Discussed on the Wikimedia Enterprise board.

• Lena.Milenko set the point value for this task to 8.Jan 20 2022, 8:15 PM

• Lena.Milenko changed the status of subtask T299432: Create Utility For Dictionary/Non-Dictionary Words Check from Open to In Progress.Jan 22 2022, 4:26 AM

Protsack.stephan added a subtask: T299586: Create A Utility For Uppercase Words Check.Feb 3 2022, 1:38 PM

Protsack.stephan added a subtask: T299584: Create A Utility For Informal Words Check.

Protsack.stephan added a subtask: T299582: Create Utility For Non-Safe Words Check.

Protsack.stephan added a subtask: T299592: Create A Utility To Generate A Frequency Table From A List Of Tokens.Feb 3 2022, 1:40 PM

Protsack.stephan removed a subtask: T299592: Create A Utility To Generate A Frequency Table From A List Of Tokens.

Protsack.stephan moved this task from Estimated /Discussed to Incoming on the Wikimedia Enterprise board.Feb 23 2022, 12:24 PM

Protsack.stephan assigned this task to Alexander.lauie.Feb 28 2022, 11:11 AM

Protsack.stephan moved this task from Incoming to In Progress on the Wikimedia Enterprise board.

• Lena.Milenko changed the task status from Open to In Progress.Feb 28 2022, 11:51 AM

Protsack.stephan added a project: Wikimedia Enterprise Engineering.Mar 2 2022, 12:08 PM

Anribolon changed the status of subtask T299586: Create A Utility For Uppercase Words Check from Open to In Progress.Mar 7 2022, 10:41 AM

• Lena.Milenko changed the status of subtask T299586: Create A Utility For Uppercase Words Check from In Progress to Open.Mar 7 2022, 4:12 PM

• Lena.Milenko changed the status of subtask T299586: Create A Utility For Uppercase Words Check from Open to In Progress.Mar 9 2022, 8:10 PM

Anribolon changed the status of subtask T299584: Create A Utility For Informal Words Check from Open to In Progress.Mar 11 2022, 8:57 PM

• Lena.Milenko changed the status of subtask T299584: Create A Utility For Informal Words Check from In Progress to Open.Mar 17 2022, 12:00 PM

Protsack.stephan closed subtask T299432: Create Utility For Dictionary/Non-Dictionary Words Check as Resolved.Mar 17 2022, 9:29 PM

Protsack.stephan closed subtask T299586: Create A Utility For Uppercase Words Check as Resolved.

Anribolon changed the status of subtask T299582: Create Utility For Non-Safe Words Check from Open to In Progress.Mar 21 2022, 11:47 AM

• Lena.Milenko changed the status of subtask T299584: Create A Utility For Informal Words Check from Open to In Progress.Mar 21 2022, 2:34 PM

• Lena.Milenko changed the status of subtask T299582: Create Utility For Non-Safe Words Check from In Progress to Open.Mar 21 2022, 2:50 PM

• Lena.Milenko changed the status of subtask T299582: Create Utility For Non-Safe Words Check from Open to In Progress.Mar 22 2022, 10:58 PM

Alexander.lauie moved this task from In Progress to Merge Request on the Wikimedia Enterprise board.Apr 7 2022, 12:34 PM

Protsack.stephan moved this task from Merge Request to Machine Readability PB on the Wikimedia Enterprise board.Apr 13 2022, 7:22 PM

Protsack.stephan moved this task from Machine Readability PB to Done Sprint 17 (31Mar2022-13Apr2022) on the Wikimedia Enterprise board.Apr 13 2022, 7:44 PM

• Lena.Milenko changed the task status from In Progress to Open.Apr 13 2022, 9:28 PM

• Lena.Milenko changed the status of subtask T299584: Create A Utility For Informal Words Check from In Progress to Open.

• Lena.Milenko changed the status of subtask T299582: Create Utility For Non-Safe Words Check from In Progress to Open.

• Lena.Milenko closed this task as Resolved.Apr 22 2022, 3:08 PM

• Lena.Milenko closed subtask T299584: Create A Utility For Informal Words Check as Resolved.

• Lena.Milenko closed subtask T299582: Create Utility For Non-Safe Words Check as Resolved.

Create A Utility To Get Word Tokens From Wikitext Closed, ResolvedPublic8 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Create A Utility To Get Word Tokens From Wikitext
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...