Page MenuHomePhabricator

Use revision id for segmenting
Closed, ResolvedPublic

Description

It should be possible to provide a revision_id to the Segmenter to return the segmented version to that revision of the page.

The minimum implementation is that the revision_id is used to look for a segmented version of that revision in the cache (i.e. if it was segmented when the revision was the latest one). If not found it can simply return an error saying the revision is too old.

A more fully fledged implementation later would allow segmentation of historical revisions.

In either case the Segmenter must check if the requested revision is still accessible (in case of RevDel).

Event Timeline

From the meeting notes (does not handle RevDel case). Page title is left as a required parameter to allow for the easiest possible entry point (defaulting to current revision).

public function segmentPage(
    $title,
    $removeTags,
    $segmentBreakingTags,
    $revisionId = null
) {
    $cache = MediaWikiServices::getInstance()->getMainWANObjectCache();
    $page = WikiPage::factory( $title );
    if ($revisionId == null) {
        $revisionId = $page->getLatest();
    }
    $cacheKey = $cache->makeKey( 'Wikispeech.segments', $revisionId );
    $segments = $cache->get( $cacheKey );
    if ( $segments == null ) {
        if ($revisionId != $page->getLatest()){
            throw new MWException( 'An outdated or invalid revision id was provided' );
        }
        $cleanedText =
            $this->cleanPage( $page, $removeTags, $segmentBreakingTags );
        $segments = $this->segmentSentences( $cleanedText );
        $cache->set( $cacheKey, $segments, 3600 );
    }
    return $segments;
}

Change 622306 had a related patch set uploaded (by Lokal Profil; owner: Lokal Profil):
[mediawiki/extensions/Wikispeech@master] [WIP] Allow old revisions to be looked up in Segmenter cache

https://gerrit.wikimedia.org/r/622306

Change 622306 merged by Sebastian Berlin (WMSE):
[mediawiki/extensions/Wikispeech@master] Allow old revisions to be looked up in Segmenter cache

https://gerrit.wikimedia.org/r/622306

Lokal_Profil removed the point value for this task.Sep 3 2020, 8:10 AM