Performance review of Wikispeech
Description
Wikispeech is a text-to-speech extension for MediaWiki. Audio data is produced by a service backend we call Speechoid. Speechoid consists of a suite of services (speech synthesizers, phonetic lexicons, etc) that are reached via a single interface, wikispeech-server.
Initial release includes support for Swedish, English and Arabic. There are several reasons for choosing these languages. We are the Swedish chapter of Wikimedia and the project is primarily funded by Swedish authorities. Swedish is a relatively small language in number of speakers but is well represented on Wikipedia. Arabic is written from right to left and using non latin characters. English and Arabic are both major world languages. In terms of requirements on the infrastructure, this allows us to introduce the extension in a number of ways to the community so that we can choose between high or low end requirements.
Preview environment
Wikispeech is currently somewhat stuck between paths in the process of being deployed to the beta cluster, and it has been suggested that a performance review may be needed prior to the security review (T180021) and thus beta cluster deployment.
We do however have a public demo installation available at wmflabs. Admin access to the machine can be provided if needed.
Which code to review
- https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikispeech - the REW_0.1.8 branch
- https://gerrit.wikimedia.org/g/mediawiki/services/wikispeech/wikispeech-server
- https://gerrit.wikimedia.org/g/mediawiki/services/wikispeech/symbolset
- https://gerrit.wikimedia.org/g/mediawiki/services/wikispeech/pronlex
- https://gerrit.wikimedia.org/g/mediawiki/services/wikispeech/mishkal
- https://gerrit.wikimedia.org/g/mediawiki/services/wikispeech/mary-tts
All services are built with Blubber using PipelineLib. Production images are available in the WMF Docker Registry (see the docker-compose.yml for a list of the current images). Helm chart setup and Kubernetes deployment settings is pending Service-ops response, which at the time of authoring this document was pending this performance review.
Performance assessment
Terminology and test data set
For complete terminology please see the Wikispeech terminology page.
- Segment is a sentence-size unit of text that when passed on to Speechoid will turn into an utterance.
- Token is a word-size unit of text produced by Speechoid from breaking up a segment in pieces.
- Utterance is a unit of continuous speech. A synthesized segment.
We often refer to the Barack Obama-page on English Wikipedia. This is to be considered a very large page, a realistic worst case scenario.
When we refer to a single letter page size article, we speak of something that would fill a single printed legal letter- or A4-sheet when extracted to pure text content. For example the article on the small town of Halmstad, Sweden.
Client web interface
CPU and RAM footprints of the browser were evaluated using the profiler built in to Chrome. We primarily studied the RAM footprint on Javascript and DOM-tree by themselves, while the CPU footprints were measured together.
By default the extension will only add a Listen entry to the action menu on supported pages. Only once this action is selected will the audio player and remaining javascript be loaded. A Pre-load Wikispeech user settings is available to load the audio player directly on page load, a Disable Wikispeech setting is available to also disable the action menu entry. When play is selected on the audio player utterance playback will start and the corresponding text (segment and current token) will be highlighted.
The performance impact on the user web interface when Wikispeech is enabled but before the Listen action is selected, or when Wikispeech is disabled, is next to none.
When enabled (Listen action selected), the player will always request the segmented version of the page. This occurs prior to hitting the play button. We believe that a user that chooses to enable the Wikispeech player prefers a faster response when hitting the play button rather than saving a bit of bandwidth and Javascript memory when not hitting the play button.
For a very large page such as the Barack Obama-page the segmentation DOM metadata, used to highlight what is currently being read, is represented by a 500kB JSON-blob. We have not been able to profile the actual amount of RAM this consumes when deserialized to instances within the javascript environment of a browser. For a more normal single letter page size article, the DOM metadata JSON-blob is about 3kB.
Any manipulation of the DOM is done in such a way that it is restored when no longer needed, to not interfere with other components.
Due to the garbage collection of modern browsers, it is rather difficult to evaluate the true RAM consumption caused by playing utterances (Javascript memory footprint) and highlighting text (DOM memory footprint).
We can, from profiling and manually forcing garbage collection, clearly see that we are not causing any zombie DOM elements or other memory leaks. The RAM footprint impact from temporarily modifying the DOM during highlighting is thus very small, but will cause small bursts of CPU processing as the garbage collector is executed due to highlighted text being constantly modified.
The Barack Obama page will consume about 6MB of Javascript RAM when loaded. When hitting the play button it quickly reaches 7MB. Due to garbage collection it will keep below 7MB at all times throughout playing the complete page content from start to end. The results for a more normal single letter page size article is so similar to that of the Barack Obama case that we can’t quite tell them apart. It seems to us that the RAM footprint of Wikispeech execution is rather tiny compared to the rest of the Javascript libraries (jQuery, skins, etc) MediaWiki depends on.
The client uses HTML5 to play Opus audio in an ogg envelope. Utterances are requested one at the time in advance. Further thoughts on a smoother playback experience is considered in T160149, which includes the cost of loading more data in advance and thus increasing the memory footprint for the client.
The audio data is received as JSON containing a Base64 encoded field with the audio. There is thus a small extra cost for downloading the complete audio and decoding the Base64 compared to streaming raw audio from the server. But utterances are generally small chunks of data, around 70kB each. For further protection it’s possible to limit the maximum size of a segment (and thus utterance) in order to avoid performance problems for both the client and servers.
There is further information about Base64 in the end of the next section about the extension.
MediaWiki instance, extension and database
The CPU footprint on the MediaWiki instance is mainly due to the segmentation of page text. This is however a rather speedy procedure which is also cached to further minimize the impact on repeated executions. Segmenting the Barack Obama-page on an i7-8650U CPU @ 1.90GHz takes about 600-900 milliseconds depending on current CPU load. That page is about 170k characters divided in 42k tokens, plus whitespaces and punctuation. This translates roughly to 0.2-0.3 milliseconds per segment or 0.003-0.005 milliseconds per character. When already in cache, these values move close towards 0. Segmentation is not a bottleneck.
Utterances representing unique text segments per page are generated in the backend and sent to the frontend. Utterances are cached in the MediaWiki instance using an extension database table for utterance metadata, and a FileBackend for utterance audio and synthesis metadata. This cache can be configured in a variety of ways to manually and automatically flush out data. By default a job will be queued every 30 minutes which will flush utterances that are older than 31 days. In addition to minimizing the size of the cache due to rarely requested utterances, this periodic flushing also ensure that improvements to Speechoid eventually also hits the cache for commonly requested utterances.
For each utterance metadata a row is created in the utterance cache database table, worth about 150 bytes of data.
For each utterance a synthesis metadata JSON file is stored in the FileBackend. The size in bytes depends on the size of the segmented text, about 25 bytes + token for each token in the segment. As the mean segment contains 15 tokens and the mean token is 4.5 characters, that equals up to about 450 bytes of data.
The true size will however depend on the underlying filesystem used. For instance, ext4 commonly uses a 4kB allocation unit, meaning that each file actually consumes 4kB of the disk, while a compressed Swift backend might consume less than 450 bytes.
It is tough to give a good estimate of how much the audio data of each utterance consumes in the cache, as this depends on language and the complexity of the written text. However, about 4.5kB per word in an english page is a fairly good ballpark figure. This translates roughly to a 70kB utterance audio file per sentence.
Utterances are stored as Base64 in the FileBackend. Thus, the mean 70kB utterance audio file consumes 93kB when stored since it takes 4 bytes to store 3 bytes as Base64. This means 33% more disk use, but also keeps the CPU requirements down a bit as there will be no need to Base64 encode the binary data prior to passing it on to the client. Again, the true amount of space consumed depends on the underlying filesystem.
Replacing the Base64 encoded utterances with binary data all the way from the cache to the player in the webclient via true streaming could be implemented but will require quite a bit of code and finding new solutions for both the webclient and the extension. It requires more investigations to find out whether this would actually save any resources for the client or if it just adds complexity.
Simply storing utterances as binary and encoding them on request would be a simple thing to implement and would save 25% disk and bandwidth in exchange for more server side CPU cycles.
Speechoid backend
Speechoid is where all the hardware resources are consumed. Per our test VPS on wmflabs (1 VCPU, 2GB RAM) we see a synthesized voice length:time spent synthesizing ratio around 1.5:1 at 100% thread CPU usage and 600MB RAM usage. Thus there are enough resources on such a machine to simultaneously synthesize two segments in faster than real time.
Speechoid is horizontally scalable. The simplest ways we can think of is to either deploy multiple pods containing the complete suite of services, or to deploy multiple instances of the most resource intensive services balanced using simple round robin scheduling.
In order to avoid overloading Mary TTS, our current main text-to-speech engine, we’ve added an HAProxy in front of the service to queue requests. Number of requests passed down to the actual service needs to be configured to one per available native thread to make sure we spend all available CPU. Our assessment is that we prefer queuing requests and spending CPU synthesizing rather than on sharing the CPU. This way requests that time out will be dropped in the queue and never passed on to synthesis, rather than accepting all requests and possibly overloading the CPU while synthesizing text that might be dropped due to timeout.
We have still not been able to communicate with the WMF service ops team to see what alternatives are available, but we’re hoping it’s possible to automate the scaling and queuing from within Kubernetes rather than from inside of our deployed pod.
How much resources that needs to be allocated to Speechoid has to be evaluated with user load. We have no such numbers here and now. We hope that deployment on the beta cluster will be helpful when assessing this.
Pronunciation lexicon
Wikispeech offers the possibility for users to update pronunciation of words using IPA.
Speechoid will on speech synthesis lookup all words of a segment in a database to find any specific IPA and pass it down to the speech synthesis engine to be used. This database can be stored in either a MariaDB/MySQL or using a local SQLite. As of reviewing, this is a local SQLite database populated with default data in the distribution of Speechoid.
Wikispeech also includes support for adding a second layer lexicon within the wiki in order to allow for revision history and rollbacks, just as with the revision history of any wiki page, as the Speechoid database does not support such a feature. The synchronization and handling of the two layers is implemented, but the actual second level lexicon is not implemented. We mean to implement this by storing lexicon entries as wiki pages in a new namespace. As of reviewing, this feature is not enabled. Lexicon operations will be sent straight to the Speechoid database.
There is a bit of cost to I/O when it comes to updating lexicon information, especially since it will include operations in multiple persistent layers. But lexicon edits are to be considered very rare. They will probably mainly be edited for non standard given names and for homographs.
It is hard to predict a future worst case scenario for the lexicon stored in the wiki, but we only store words that have been changed compared to the lexicon distributed with Speechoid. So perhaps a few thousand words on a site such as English Wikipedia?
One important issue we have identified is how to handle synchronization problem with conflicts between the local lexicon and the one in Speechoid. E.g. if redeploying a new version of Speechoid that comes with a new default lexicon. This is however a discussion for the future as it would require having Wikispeech running on multiple wikis from which we can gather changes we want to redistribute.
Speechoid benchmark results
We have developed a script that benchmark the performance of the Wikispeech extension and the Speechoid services. Here are the results from executing that script on the English Wikipedia Barack Obama-page: