Segment by tags

Unpublished Commit · Learn More

Not On Permanent Ref: This commit is not an ancestor of any permanent ref.
This commit no longer exists in the repository. It may have been part of a branch which was deleted.This commit has been deleted in the repository: it is no longer reachable from any branch, tag, or ref.


Segment by tags

Cleaned text is now split by certain tags. These are specified in by
the config variable WikispeechSegmentBreakingTags. By default, these
tags are h*, p, br and li. Removed ol and ul from
WikispeechRemoveTags, since lists are now recited reasonably well.

During cleaning, SegmentBreak objects are added where the specified
tags are encountered (the tags themselves are still removed). During
segmenting, when a SegmentBreak is encountered, a new segment is

Renamed what "things" (CleanedText and SegmentBreak) in content are
called to "item".

Change-Id: I688f20f6e4a662efb4a74eb2e3e94996b231445f
Bug: T149091


Sebastian_Berlin-WMSEAuthored on May 3 2017, 12:50 PM

Commit No Longer Exists

This commit no longer exists in the repository.