HomePhabricator

Segment by tags

Description

Segment by tags

Cleaned text is now split by certain tags. These are specified in by
the config variable WikispeechSegmentBreakingTags. By default, these
tags are h*, p, br and li. Removed ol and ul from
WikispeechRemoveTags, since lists are now recited reasonably well.

During cleaning, SegmentBreak objects are added where the specified
tags are encountered (the tags themselves are still removed). During
segmenting, when a SegmentBreak is encountered, a new segment is
created.

Renamed what "things" (CleanedText and SegmentBreak) in content are
called to "item".

Bug: T149091
Change-Id: I688f20f6e4a662efb4a74eb2e3e94996b231445f

Details

Provenance
Sebastian_Berlin-WMSEAuthored on May 3 2017, 12:50 PM
Lokal_ProfilCommitted on Jul 6 2017, 10:50 AM
Parents
rEWIS6c69045ada24: Add variable for content wrapper element
Branches
Unknown
Tags
Unknown
ChangeId
I688f20f6e4a662efb4a74eb2e3e94996b231445f