Performance review of Wikispeech
[Wikispeech](https://meta.wikimedia.org/wiki/Wikispeech) is a text-to-speech [extension](https://www.mediawiki.org/wiki/Extension:Wikispeech) for MediaWiki. Audio data is produced by a service backend we call Speechoid. Speechoid consists of [a suite of services](https://gerrit.wikimedia.org/r/admin/repos/q/filter:services%252Fwikispeech) (speech synthesizers, phonetic lexicons, etc) that are reached via a single interface, [wikispeech-server](https://gerrit.wikimedia.org/r/admin/repos/mediawiki/services/wikispeech/wikispeech-server).
Initial release includes support for Swedish, English and Arabic. There are several reasons for choosing these languages. We are the Swedish chapter of Wikimedia and the project is primarily funded by Swedish authorities. Swedish is a [relatively small language](https://en.wikipedia.org/wiki/Swedish_language) in number of speakers but is [well represented on Wikipedia](https://en.wikipedia.org/wiki/Swedish_Wikipedia). Arabic is written from right to left and using non latin characters. English and Arabic are both major [world languages](https://en.wikipedia.org/wiki/World_language). In terms of requirements on the infrastructure, this allows us to introduce the extension in a number of ways to the community so that we can choose between high or low end requirements.
#### Preview environment
Wikispeech is currently somewhat stuck between paths in the process of being deployed to the beta cluster, and it has been suggested that a performance review may be needed prior to the security review (T180021) and thus beta cluster deployment.
We do however have a [public demo installation available at wmflabs](https://wikispeech.wmflabs.org/). Admin access to the machine can be provided if needed.
#### Which code to review
All services are built with [Blubber](https://wikitech.wikimedia.org/wiki/Blubber) using [PipelineLib](https://wikitech.wikimedia.org/wiki/PipelineLib). Production images are available in the [WMF Docker Registry](https://dockerregistry.toolforge.org/). [Helm](https://wikitech.wikimedia.org/wiki/Helm) chart setup and Kubernetes deployment settings is pending Service-ops response, which at the time of authoring this document was pending this performance review.
#### Performance assessment
##### Terminology and test data set
For complete terminology please see [the Wikispeech terminology page](https://meta.wikimedia.org/wiki/Wikispeech/Terminology).
- Segment is a sentence-size unit of text that when passed on to Speechoid will turn into an utterance.
- Token is a word-size unit of text produced by Speechoid from breaking up a segment in pieces.
- Utterance is a unit of continuous speech. A synthesized segment.
We often refer to the [Barack Obama-page](https://en.wikipedia.org/wiki/Barack_Obama) on English Wikipedia. This is to be considered a very large page, a realistic worst case scenario.
When we refer to a single letter page size article, we speak of something that would fill a single printed legal letter- or A4-sheet when extracted to pure text content. For example the article on the small town of [Halmstad](https://en.wikipedia.org/wiki/Halmstad), Sweden.
##### Client web interface
The performance impact on the user web interface when Wikispeech is enabled but before the //Listen// action is selected, or when Wikispeech is disabled, is next to none.
Any manipulation of the DOM is done in such a way that it is restored when no longer needed, to not interfere with other components.
We can, from profiling and manually forcing garbage collection, clearly see that we are not causing any zombie DOM elements or other memory leaks. The RAM footprint impact from temporarily modifying the DOM during highlighting is thus very small, but will cause small bursts of CPU processing as the garbage collector is executed due to highlighted text being constantly modified.
The client uses HTML5 to play Opus audio in an ogg envelope. Utterances are requested one at the time in advance. Further thoughts on a smoother playback experience is considered in T160149, which includes the cost of loading more data in advance and thus increasing the memory footprint for the client.
The audio data is received as JSON containing a Base64 encoded field with the audio. There is thus a small extra cost for downloading the complete audio and decoding the Base64 compared to streaming raw audio from the server. But utterances are generally small chunks of data, around 70kB each. For further protection it’s possible to limit the maximum size of a segment (and thus utterance) in order to avoid performance problems for both the client and servers.
There is further information about Base64 in the end of the next section about the extension.
##### MediaWiki instance, extension and database
The CPU footprint on the MediaWiki instance is mainly due to the segmentation of page text. This is however a rather speedy procedure which is also cached to further minimize the impact on repeated executions. Segmenting the Barack Obama-page on an i7-8650U CPU @ 1.90GHz takes about 600-900 milliseconds depending on current CPU load. That page is about 170k characters divided in 42k tokens, plus whitespaces and punctuation. This translates roughly to 0.2-0.3 milliseconds per segment or 0.003-0.005 milliseconds per character. When already in cache, these values move close towards 0. Segmentation is not a bottleneck.
Utterances representing unique text segments per page are generated in the backend and sent to the frontend. Utterances are cached in the MediaWiki instance using an extension database table for utterance metadata, and a [FileBackend](https://www.mediawiki.org/wiki/Manual:FileBackend.php) for utterance audio and synthesis metadata. This cache can be configured in a variety of ways to manually and automatically flush out data. By default a job will be queued every 30 minutes which will flush utterances that are older than 31 days. In addition to minimizing the size of the cache due to rarely requested utterances, this periodic flushing also ensure that improvements to Speechoid eventually also hits the cache for commonly requested utterances.
For each utterance metadata a row is created in the utterance cache database table, worth about 150 bytes of data.
For each utterance a synthesis metadata JSON file is stored in the FileBackend. The size in bytes depends on the size of the segmented text, about 25 bytes + token for each token in the segment. As the mean segment contains 15 tokens and the mean token is 4.5 characters, that equals up to about 450 bytes of data.
The true size will however depend on the underlying filesystem used. For instance, ext4 commonly uses a 4kB allocation unit, meaning that each file actually consumes 4kB of the disk, while a compressed Swift backend might consume less than 450 bytes.
It is tough to give a good estimate of how much the audio data of each utterance consumes in the cache, as this depends on language and the complexity of the written text. However, about 4.5kB per word in an english page is a fairly good ballpark figure. This translates roughly to a 70kB utterance audio file per sentence.
Utterances are stored as Base64 in the FileBackend. Thus, the mean 70kB utterance audio file consumes 93kB when stored since it takes 4 bytes to store 3 bytes as Base64. This means 33% more disk use, but also keeps the CPU requirements down a bit as there will be no need to Base64 encode the binary data prior to passing it on to the client. Again, the true amount of space consumed depends on the underlying filesystem.
Replacing the Base64 encoded utterances with binary data all the way from the cache to the player in the webclient via true streaming could be implemented but will require quite a bit of code and finding new solutions for both the webclient and the extension. It requires more investigations to find out whether this would actually save any resources for the client or if it just adds complexity.
Simply storing utterances as binary and encoding them on request would be a simple thing to implement and would save 25% disk and bandwidth in exchange for more server side CPU cycles.
##### Speechoid backend
Speechoid is where all the hardware resources are consumed. Per our test VPS on wmflabs (1 VCPU, 2GB RAM) we see a synthesized voice length:time spent synthesizing ratio around 1.5:1 at 100% thread CPU usage and 600MB RAM usage. Thus there are enough resources on such a machine to simultaneously synthesize two segments in faster than real time.
Speechoid is horizontally scalable. The simplest ways we can think of is to either deploy multiple pods containing the complete suite of services, or to deploy multiple instances of the most resource intensive services balanced using simple round robin scheduling.
In order to avoid overloading Mary TTS, our current main text-to-speech engine, we’ve added an [HAProxy](http://www.haproxy.org/) in front of [the service to queue requests](https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/wikispeech/mary-tts/+/refs/heads/master/blubber-entrypoint.sh). Number of requests passed down to the actual service needs to be [configured](https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/wikispeech/mary-tts/+/refs/heads/master/haproxy.cfg) to one per available native thread to make sure we spend all available CPU. Our assessment is that we prefer queuing requests and spending CPU synthesizing rather than on sharing the CPU. This way requests that time out will be dropped in the queue and never passed on to synthesis, rather than accepting all requests and possibly overloading the CPU while synthesizing text that might be dropped due to timeout.
We have still not been able to communicate with the WMF service ops team to see what alternatives are available, but we’re hoping it’s possible to automate the scaling and queuing from within Kubernetes rather than from inside of our deployed pod.
How much resources that needs to be allocated to Speechoid has to be evaluated with user load. We have no such numbers here and now. We hope that deployment on the beta cluster will be helpful when assessing this.
##### Pronunciation lexicon
Wikispeech offers the possibility for users to update pronunciation of words using [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet).
Speechoid will on speech synthesis lookup all words of a segment in a database to find any specific IPA and pass it down to the speech synthesis engine to be used. This database can be stored in either a MariaDB/MySQL or using a local SQLite. As of reviewing, this is a local SQLite database populated with default data in the distribution of Speechoid.
Wikispeech also includes support for adding a second layer lexicon in order to allow for revision history and rollbacks, just as with any the revision history of any wiki page, as the Speechoid database does not support such a feature. The synchronization and handling of the two layers is implemented, but the actual second level lexicon is not implemented. We mean to implement this by storing lexicon entries as wiki pages in a new namespace. As of reviewing, this feature is not enabled. Lexicon operation will be sent straight to the Speechoid database.
There is a bit of cost when it comes to updating lexicon information, especially since it will includes operations in multiple persistent layers. But lexicon edits are to be considered very rare. They will probably mainly be edited for non standard [given names](https://en.wikipedia.org/wiki/Given_name) and [homographs](https://en.wikipedia.org/wiki/Homograph).
##### Benchmark results
We have developed [a script](https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Wikispeech/+/4db6aac6516863cc3c080db8a7979ba0a2467bde/maintenance/benchmark.php) that benchmark the performance of the Wikispeech extension and the Speechoid services. Here are the results from executing that script on the English Wikipedia Barack Obama-page: