Page MenuHomePhabricator

Text normalization as part of regex extractor
Open, LowPublic

Description

For example, Ch and Sh characters keep appearing together in reverted revisions on tr.wikipedia. These are written as a placeholder for ç ş characters. In normal writing this is never done.

I have been thinking of this problem for some time now. I think best method is to identify if the words written with ch and sh are in English and have this as a feature only for tr wikipedia.

Mind that it is valid to have ch and sh in tr wikipedia to illustrate english words which is why a simple regex would not be optimal.


Another example is the use of "е" in crylic -- which is different from the letter "e" in the Latin alphabet, but displays very similarly.

>>> "e" == "е"
False

Event Timeline

ToAruShiroiNeko claimed this task.
ToAruShiroiNeko raised the priority of this task from to Medium.
ToAruShiroiNeko updated the task description. (Show Details)
ToAruShiroiNeko added a subscriber: ToAruShiroiNeko.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 15 2015, 6:29 AM

Hi @ToAruShiroiNeko. Please associate at least one project with this task, otherwise nobody can find this task when searching in the corresponding project(s). Thanks.

Halfak renamed this task from Sh an Ch use on Tr wikipedia to Text normalization as part of regex extractor .Apr 5 2016, 3:23 PM
Halfak updated the task description. (Show Details)
Halfak lowered the priority of this task from Medium to Low.Aug 18 2016, 2:48 PM
Halfak added a project: revscoring.
Halfak set Security to None.
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptApr 3 2019, 4:45 AM
Aklapper removed ToAruShiroiNeko as the assignee of this task.Jun 19 2020, 4:29 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)