Page MenuHomePhabricator

Move transcription converter from Pronlex to a separate API
Closed, ResolvedPublic

Description

Actions:

  • Copy symbolset code to a separate repository
  • Write a new server for the new repository
  • Write setup scripts for the new repository
  • Create a new pronlex branch, and:
    • Change pronlex references to point to the new repository
    • Run tests
    • Remove the now deprecated symbolset code from pronlex
    • Push new branch to master
  • Create a new pronlex branch (or re-use the old one), and:
    • Remove symbolset, mapper and converter from the lexserver API
  • Create a new wikispeech_mockup branch, and:
    • Adapt wikispeech_mockup to use the new server
  • Test the new setup (using non-docker builds, docker builds will be changed in future versions, so there's no point in testing them right now)
  • Merge new branches to master

Event Timeline

Definitions
SymbolSet - a symbol set definition for a given language - https://github.com/stts-se/pronlex/tree/master/symbolset
Mapper - map transcriptions between different symbol sets for a given language - https://github.com/stts-se/pronlex/tree/master/symbolset/mapper
Converter - map transcriptions between different languages - https://github.com/stts-se/pronlex/tree/master/symbolset/converter
Validation - component for validating transcriptions partly based on a pre-defined symbolset - https://github.com/stts-se/pronlex/tree/master/validation

When we say "transcription converter", we mean the mapper and the converter. They are both heavily dependent on the symbol set package, so that package will probably go with them. Or? [TODO: decide]


In order to move the transcription converting calls from the pronlex API to a separate API, what do we do with the validation package? It is dependent on the symbol set (but no really the mapper or the converter). Can it stay in the pronlex repository? [TODO: decide]

If we keep the validation package in pronlex, and the command line tools for importing/converting/validation lexicon files and entries, the pronlex repository will have dependencies on the new package. Is this OK? [TODO: decide]

The symbolset has a dependency on lex.Entry but I think that could easily be removed if we put that logic somewhere else, and have the symbolset handle only single transcription strings (instead of full entries).

In the pronlex repository, the use of code related to the symbolset/mapper/converter packages are primarily used in the following components.

Component usedLocation/PackageComment
SymbolSetconverter/
SymbolSetmapper/
SymbolSetvalidation/
SymbolSetdbapi/dbapi_testused to prepare lexicon import (lexicon import is part of the dbapi)
SymbolSetdbapi/validation_testused to prepare validation tests (validation is part of the dbapi)
SymbolSetlexserver APIAPI calls to mapper, symbolset, validation
Mapperlexserver APIAPI calls to mapper
Converterlexserver APIAPI calls to converter
Mappercmd/lexio/convert/converting transcriptions (and file format) from an external lexicon file to the Wikispeech default format

A complete list of usages:

Use of package symbolset:

  • cmd/lexio/importLex/
  • cmd/test_validator/
  • cmd/validate_lex_file/
  • dbapi/dbapi_test.go
  • dbapi/validation_test.go
  • lexserver/lexserver.go
  • lexserver/mapper.go
  • lexserver/symbolset.go
  • lexserver/validation.go
  • symbolset/converter/
  • symbolset/mapper/
  • validation/

Use of package symbolset/mapper:

  • cmd/lexio/convert/CMU2WS/
  • cmd/lexio/convert/csCzPhword2WS/
  • cmd/lexio/convert/nbNoNST2WS/
  • cmd/lexio/convert/svSeNST2WS/
  • lexserver/mapper.go

Use of package symbolset/converter:

  • lexserver/converter.go

Use of the mapper/converter service in the current version of wikispeech_mockup:

API URLCalled by componentComment
mapper/maptablemapper_client.pyUsed in initialization tests
mapper/mapmapper_client.pyUsed for mapping between phonetic symbol sets
mapper/mapmarytts_adapter.pyUsed for mapping between phonetic symbol sets -- TODO: what differs from the mapper_client call?
wikispeech_server/adapters/mapper_client.py:16:         self.base_url = "%s/mapper" % config.config.get("Services", "lexicon")
wikispeech_server/adapters/mapper_client.py:22:         url = "%s/%s/%s/%s" % (self.base_url, "maptable", self.from_symbol_set, self.to_symbol_set)
wikispeech_server/adapters/mapper_client.py:40:         url = "%s/%s/%s/%s/%s" % (self.base_url, "map", self.from_symbol_set, self.to_symbol_set, string)
wikispeech_server/adapters/marytts_adapter.py:21:       mapper_url = config.config.get("Services", "lexicon")
wikispeech_server/adapters/marytts_adapter.py:610:      url = mapper_url+"/mapper/map/%s/%s/%s" % (from_symbol_set, to_symbol_set, quote(trans))
wikispeech_server/adapters/marytts_adapter.py:649:      url = mapper_url+"/mapper/map/%s/%s/%s" % (from_symbol_set, to_symbol_set, quote(trans))

A subtask automatically inherits all project tags and subscribers. Just cleaning these up a bit.

HannaLindgren updated the task description. (Show Details)
HannaLindgren updated the task description. (Show Details)
HannaLindgren updated the task description. (Show Details)
HannaLindgren updated the task description. (Show Details)

@Sebastian_Berlin-WMSE @kalle @Lokal_Profil I found a note saying that we should inform you guys when this task is ready for testing from your side, so here we go: The updates have been pushed to master, and can be tested now (and docs updated).

The new repo: https://github.com/stts-se/symbolset

Our Wikispeech summary page: http://stts-se.github.io/wikispeech/