Page MenuHomePhabricator

Improve non-translatable content blacklisting mechanism in cxserver
Open, NormalPublic

Description

As a follow up of T190254: Remove irrelevant sections from source article for translation, we need to improve the way non-translatable meta content removal in cxserver. Currently a yaml file is used to blacklist templates, classes, rdfa identifiers. Regular expression support and case insensitve support was further added.

@Nikerabbit observed that

I suspect we also need to normalize spaces vs. underscores (or better yet use the canonicalized name from the href?) Bunch of such examples in https://quarry.wmflabs.org/query/27460

Also, the current YAML configuration, if extended for all language pairs, can be a big one. We need to find out smarter ways to do this.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 21 2018, 11:57 AM
Vvjjkkii renamed this task from Improve non-translatable content blacklisting mechanism in cxserver to 8iaaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from 8iaaaaaaaa to Improve non-translatable content blacklisting mechanism in cxserver.Jul 2 2018, 7:50 AM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
Pginer-WMF updated the task description. (Show Details)Jul 20 2018, 8:25 AM
Pginer-WMF triaged this task as Normal priority.Jul 20 2018, 8:47 AM
Pginer-WMF moved this task from Backlog to Other on the CX-cxserver board.