Page MenuHomePhabricator

[wiki-nlp-tools] Tokenizer: update asset loading and initiations
Open, Needs TriagePublic

Description

Sentence Piece:

Currently, our tokenizer class loads the sentencepiece model by default, ever when we are not planning to do any NWS word tokenization. In the future, we might have contexts where we *load multiple separate sentencepiece models for different languages*. So, it is not feasible to load a fixed model at the beginning.

Goal:

  • Update the tokenizer class
  • Accommodate dynamic loading of SPC models

Abbreviation list:

Currently we load the entire list of abbreviations in before filtering down to just the particular language, our start-up cost for a Tokenizer object is 5ms vs. 50µs when no abbreviation file is passed.

Goal:

  • Split up the abbreviation files into language-specific files so only the relevant set is loaded.

Adding to this -- because we load the entire list of abbreviations in before filtering down to just the particular language, our start-up cost for a Tokenizer object is 5ms vs. 50µs when no abbreviation file is passed. Not huge but can become quite big if not re-using a Tokenizer in code that does a bunch of NLP processing. A clear alternative is to split up the abbreviation files into language-specific files so only the relevant set is loaded. The downside is that requires 300+ files but it does simplify updates to a specific language and should speed up start-up.

Maybe a single sqlite table that we can query without loading into memory (e.g. via sqlitedict)? Loading would be a query all rows where lang matches some value. Dont know if that is faster than loading everything into memory first, though.