Page MenuHomePhabricator

Performance review of Wikispeech
Closed, ResolvedPublic


Performance review of Wikispeech


Wikispeech is a text-to-speech extension for MediaWiki. Audio data is produced by a service backend we call Speechoid. Speechoid consists of a suite of services (speech synthesizers, phonetic lexicons, etc) that are reached via a single interface, wikispeech-server.

Initial release includes support for Swedish, English and Arabic. There are several reasons for choosing these languages. We are the Swedish chapter of Wikimedia and the project is primarily funded by Swedish authorities. Swedish is a relatively small language in number of speakers but is well represented on Wikipedia. Arabic is written from right to left and using non latin characters. English and Arabic are both major world languages. In terms of requirements on the infrastructure, this allows us to introduce the extension in a number of ways to the community so that we can choose between high or low end requirements.

Preview environment

Wikispeech is currently somewhat stuck between paths in the process of being deployed to the beta cluster, and it has been suggested that a performance review may be needed prior to the security review (T180021) and thus beta cluster deployment.

We do however have a public demo installation available at wmflabs. Admin access to the machine can be provided if needed.

Which code to review

All services are built with Blubber using PipelineLib. Production images are available in the WMF Docker Registry (see the docker-compose.yml for a list of the current images). Helm chart setup and Kubernetes deployment settings is pending Service-ops response, which at the time of authoring this document was pending this performance review.

Performance assessment
Terminology and test data set

For complete terminology please see the Wikispeech terminology page.

  • Segment is a sentence-size unit of text that when passed on to Speechoid will turn into an utterance.
  • Token is a word-size unit of text produced by Speechoid from breaking up a segment in pieces.
  • Utterance is a unit of continuous speech. A synthesized segment.

We often refer to the Barack Obama-page on English Wikipedia. This is to be considered a very large page, a realistic worst case scenario.

When we refer to a single letter page size article, we speak of something that would fill a single printed legal letter- or A4-sheet when extracted to pure text content. For example the article on the small town of Halmstad, Sweden.

Client web interface

CPU and RAM footprints of the browser were evaluated using the profiler built in to Chrome. We primarily studied the RAM footprint on Javascript and DOM-tree by themselves, while the CPU footprints were measured together.

By default the extension will only add a Listen entry to the action menu on supported pages. Only once this action is selected will the audio player and remaining javascript be loaded. A Pre-load Wikispeech user settings is available to load the audio player directly on page load, a Disable Wikispeech setting is available to also disable the action menu entry. When play is selected on the audio player utterance playback will start and the corresponding text (segment and current token) will be highlighted.

The performance impact on the user web interface when Wikispeech is enabled but before the Listen action is selected, or when Wikispeech is disabled, is next to none.

When enabled (Listen action selected), the player will always request the segmented version of the page. This occurs prior to hitting the play button. We believe that a user that chooses to enable the Wikispeech player prefers a faster response when hitting the play button rather than saving a bit of bandwidth and Javascript memory when not hitting the play button.

For a very large page such as the Barack Obama-page the segmentation DOM metadata, used to highlight what is currently being read, is represented by a 500kB JSON-blob. We have not been able to profile the actual amount of RAM this consumes when deserialized to instances within the javascript environment of a browser. For a more normal single letter page size article, the DOM metadata JSON-blob is about 3kB.

Any manipulation of the DOM is done in such a way that it is restored when no longer needed, to not interfere with other components.

Due to the garbage collection of modern browsers, it is rather difficult to evaluate the true RAM consumption caused by playing utterances (Javascript memory footprint) and highlighting text (DOM memory footprint).

We can, from profiling and manually forcing garbage collection, clearly see that we are not causing any zombie DOM elements or other memory leaks. The RAM footprint impact from temporarily modifying the DOM during highlighting is thus very small, but will cause small bursts of CPU processing as the garbage collector is executed due to highlighted text being constantly modified.

The Barack Obama page will consume about 6MB of Javascript RAM when loaded. When hitting the play button it quickly reaches 7MB. Due to garbage collection it will keep below 7MB at all times throughout playing the complete page content from start to end. The results for a more normal single letter page size article is so similar to that of the Barack Obama case that we can’t quite tell them apart. It seems to us that the RAM footprint of Wikispeech execution is rather tiny compared to the rest of the Javascript libraries (jQuery, skins, etc) MediaWiki depends on.

The client uses HTML5 to play Opus audio in an ogg envelope. Utterances are requested one at the time in advance. Further thoughts on a smoother playback experience is considered in T160149, which includes the cost of loading more data in advance and thus increasing the memory footprint for the client.

The audio data is received as JSON containing a Base64 encoded field with the audio. There is thus a small extra cost for downloading the complete audio and decoding the Base64 compared to streaming raw audio from the server. But utterances are generally small chunks of data, around 70kB each. For further protection it’s possible to limit the maximum size of a segment (and thus utterance) in order to avoid performance problems for both the client and servers.

There is further information about Base64 in the end of the next section about the extension.

MediaWiki instance, extension and database

The CPU footprint on the MediaWiki instance is mainly due to the segmentation of page text. This is however a rather speedy procedure which is also cached to further minimize the impact on repeated executions. Segmenting the Barack Obama-page on an i7-8650U CPU @ 1.90GHz takes about 600-900 milliseconds depending on current CPU load. That page is about 170k characters divided in 42k tokens, plus whitespaces and punctuation. This translates roughly to 0.2-0.3 milliseconds per segment or 0.003-0.005 milliseconds per character. When already in cache, these values move close towards 0. Segmentation is not a bottleneck.

Utterances representing unique text segments per page are generated in the backend and sent to the frontend. Utterances are cached in the MediaWiki instance using an extension database table for utterance metadata, and a FileBackend for utterance audio and synthesis metadata. This cache can be configured in a variety of ways to manually and automatically flush out data. By default a job will be queued every 30 minutes which will flush utterances that are older than 31 days. In addition to minimizing the size of the cache due to rarely requested utterances, this periodic flushing also ensure that improvements to Speechoid eventually also hits the cache for commonly requested utterances.

For each utterance metadata a row is created in the utterance cache database table, worth about 150 bytes of data.

For each utterance a synthesis metadata JSON file is stored in the FileBackend. The size in bytes depends on the size of the segmented text, about 25 bytes + token for each token in the segment. As the mean segment contains 15 tokens and the mean token is 4.5 characters, that equals up to about 450 bytes of data.

The true size will however depend on the underlying filesystem used. For instance, ext4 commonly uses a 4kB allocation unit, meaning that each file actually consumes 4kB of the disk, while a compressed Swift backend might consume less than 450 bytes.

It is tough to give a good estimate of how much the audio data of each utterance consumes in the cache, as this depends on language and the complexity of the written text. However, about 4.5kB per word in an english page is a fairly good ballpark figure. This translates roughly to a 70kB utterance audio file per sentence.

Utterances are stored as Base64 in the FileBackend. Thus, the mean 70kB utterance audio file consumes 93kB when stored since it takes 4 bytes to store 3 bytes as Base64. This means 33% more disk use, but also keeps the CPU requirements down a bit as there will be no need to Base64 encode the binary data prior to passing it on to the client. Again, the true amount of space consumed depends on the underlying filesystem.

Replacing the Base64 encoded utterances with binary data all the way from the cache to the player in the webclient via true streaming could be implemented but will require quite a bit of code and finding new solutions for both the webclient and the extension. It requires more investigations to find out whether this would actually save any resources for the client or if it just adds complexity.

Simply storing utterances as binary and encoding them on request would be a simple thing to implement and would save 25% disk and bandwidth in exchange for more server side CPU cycles.

Speechoid backend

Speechoid is where all the hardware resources are consumed. Per our test VPS on wmflabs (1 VCPU, 2GB RAM) we see a synthesized voice length:time spent synthesizing ratio around 1.5:1 at 100% thread CPU usage and 600MB RAM usage. Thus there are enough resources on such a machine to simultaneously synthesize two segments in faster than real time.

Speechoid is horizontally scalable. The simplest ways we can think of is to either deploy multiple pods containing the complete suite of services, or to deploy multiple instances of the most resource intensive services balanced using simple round robin scheduling.

In order to avoid overloading Mary TTS, our current main text-to-speech engine, we’ve added an HAProxy in front of the service to queue requests. Number of requests passed down to the actual service needs to be configured to one per available native thread to make sure we spend all available CPU. Our assessment is that we prefer queuing requests and spending CPU synthesizing rather than on sharing the CPU. This way requests that time out will be dropped in the queue and never passed on to synthesis, rather than accepting all requests and possibly overloading the CPU while synthesizing text that might be dropped due to timeout.

We have still not been able to communicate with the WMF service ops team to see what alternatives are available, but we’re hoping it’s possible to automate the scaling and queuing from within Kubernetes rather than from inside of our deployed pod.

How much resources that needs to be allocated to Speechoid has to be evaluated with user load. We have no such numbers here and now. We hope that deployment on the beta cluster will be helpful when assessing this.

Pronunciation lexicon

Wikispeech offers the possibility for users to update pronunciation of words using IPA.

Speechoid will on speech synthesis lookup all words of a segment in a database to find any specific IPA and pass it down to the speech synthesis engine to be used. This database can be stored in either a MariaDB/MySQL or using a local SQLite. As of reviewing, this is a local SQLite database populated with default data in the distribution of Speechoid.

Wikispeech also includes support for adding a second layer lexicon within the wiki in order to allow for revision history and rollbacks, just as with the revision history of any wiki page, as the Speechoid database does not support such a feature. The synchronization and handling of the two layers is implemented, but the actual second level lexicon is not implemented. We mean to implement this by storing lexicon entries as wiki pages in a new namespace. As of reviewing, this feature is not enabled. Lexicon operations will be sent straight to the Speechoid database.

There is a bit of cost to I/O when it comes to updating lexicon information, especially since it will include operations in multiple persistent layers. But lexicon edits are to be considered very rare. They will probably mainly be edited for non standard given names and for homographs.

It is hard to predict a future worst case scenario for the lexicon stored in the wiki, but we only store words that have been changed compared to the lexicon distributed with Speechoid. So perhaps a few thousand words on a site such as English Wikipedia?

One important issue we have identified is how to handle synchronization problem with conflicts between the local lexicon and the one in Speechoid. E.g. if redeploying a new version of Speechoid that comes with a new default lexicon. This is however a discussion for the future as it would require having Wikispeech running on multiple wikis from which we can gather changes we want to redistribute.

Speechoid benchmark results

We have developed a script that benchmark the performance of the Wikispeech extension and the Speechoid services. Here are the results from executing that script on the English Wikipedia Barack Obama-page:

1Benchmark results
4Number of segments: 2824
5Milliseconds spent segmenting: 694.09912109375
6Mean milliseconds spent segmenting per segment: 0.24578580775274
7Mean milliseconds spent segmenting per token synthesized: 0.016021122728597
8Mean milliseconds spent segmenting per token character synthesized: 0.0038654262003595
10Number of synthesized segments: 2824
11Number of synthesized tokens: 43324
12Number of synthesized token characters: 179566
14Mean number of tokens per synthesized segment: 15.341359773371
15Mean number of token characters per synthesized segment: 63.585694050991
17Mean milliseconds synthesizing per token: 376.71673898994
18Mean milliseconds synthesizing per token character: 90.890680863861
19Mean bytes synthesized voice per token: 4404
20Mean bytes synthesized voice per token character: 1062
22Milliseconds of synthesized voice: 16320876
23Seconds of synthesized voice: 16320
24Minutes of synthesized voice: 272
26Milliseconds spent synthesizing: 10745405
27Seconds spent synthesizing: 10745
28Minutes spent synthesizing: 179
30Synthesized voice bytes: 190831440
31Synthesized voice kilobytes: 186358
32Synthesized voice megabytes: 181

Event Timeline

This is still a draft, hence the missing performance review tag

Ping @Addshore, would you mind taking a glance at this text?

Looks nice and detailed but this will all need a review from the Performance-Team

Thank you for the extremely thorough work you've done analysing the performance of this new system.

I used the demo and noticed that API calls to get the audio are POST requests. Furthermore, when replaying an article from the beginning, they didn't reoccur. I assume that means that you're keeping the audio data in JS memory for the lifetime of the page.

Why did you choose POST requests, instead of GET requests for those API calls? GET would allow you to benefit from optional server-side and client-side caching of requests with those parameters. Server-side caching of the API responses means that the edge cache (Varnish/ATS) would be able to cache those responses, which improves the performance for visitors, as 2nd and beyond visitors of a page would get the response from their local Cache PoP rather than the active datacenter in the US. And client-side caching means that a user re-visiting an article would get the API responses from their browser cache.

Are the segment ids (eg. e37361e65b08c8aefcc8ecbab5e25818835496ecd398a26cdb582ff09edab543) hashes of the actual text? It would be ideal if they are, as it means that all caching strategies would be a lot more efficient when articles are edited, where usually the bulk of the content remains unchanged.

You mention the existence of a job to evict older versions of the data. If stored in Swift (which they would be in production for FileBackend), simply using an expiry header would be enough, as we now have swift-object-expirer enabled in production: T229584: Run swift-object-expirer as part of the swift cluster Which means that you wouldn't need to run a custom job to delete old entries, Swift would do it for you.

The HAProxy queuing and concurrency approach to maximize CPUs makes sense for a physical machine, this is exactly what we do for Thumbor. I don't know how that translates to Kubernetes, though, where I suspect Envoy would probably be recommended instead of HAProxy and auto-scaling might be worth considering. I defer to Service Ops on that topic.

On the subject of the lexicon, the part that's unclear to me is whether you intend to update all utterances where a word is present when that word's pronunciation is updated (via a job?), or simply wait for existing copies of the old pronunciation to expire after 31 days and be replaced by the new pronunciation when renewed.

It sounds like you've done a fair amount of profiling on the client side, it might be interesting to do some on the service as well, to figure out if it could benefit from internal in-memory caching of some repetitive tasks.

A few things will have to investigated, here is what I can reply to right of the bat:

Segment hashes are evaluated on segment text. Thus a text change on a page will only require re-synthesizing the updated segments (sentences). The hashes are however unique per wiki page. We came to the conclusion that it's rather rare with segments shared between multiple pages, that it's better to ensure no conflicts between hashes over multiple pages compared to get one conflict and spending a bit of disk. The number of segments on all wiki pages on e.g. English Wikipedia is enough for a potential conflict at some point. Getting the completely wrong sentence read up would be really confusing for the user.

Just evicting utterances in the FileBackend isn't quite enough for us. We need to keep track of what language, voice and page an utterance is associated with in order to evict per page, voice and language. We could rely on a file system directory structure such as /language/voice/page/segment, but that would mean traversing the structure in order to find what data to evict on manually or event triggered eviction, such as via the maintenance script or when a user is updating the lexicon. Also, keeping this metadata in the database allows for other future eviction strategies.

We do not plan to evict all segments of all pages when a word is updated in the lexicon. What we do is to evict all segments of the page the lexicon edit was invoked from. We could tokenize the page and limit the eviction to only those segments that include the word, but we do not. Evicting the segments containing the word on all pages would require a compatible inverted index. We have not investigated whether or not ApiSearch can be used for this, but given the tokenization is compatible we could go that route. In the end we simply felt that evicting the page the user edited makes for the impression of an instant change even though on just that page, and we know that the background eviction job will cause eventual consistency. This is in many ways (programmatically, disk resource wise, system complexity, etc) a cheap solution. We would be happy to discuss this further if you want to.

We will get back to you on the rest.

Why did you choose POST requests, instead of GET requests for those API calls?

I think this is because of the limitations of request length when using GET. I don't remember if this was done as a precaution or if this caused any actual problems.

Early on, the requests were sent by a Javascript module, so in that case the browser could be a limiting factor for the request length. When we made the API I think we just copied it over and kept POST. Now that it's done with RequestFactory, that may not be an issue anymore.

Any URI length up to 2048 bytes is safe (IE being the bottleneck). I didn't see anything in the form POST data that got anywhere near that amount.

As for evictions, I understand that you want to be able to evict for other reasons than expiry and need to keep track of where things are stored in Swift, but for the expiry part you would be reinventing the wheel and creating a lot of needless jobs for something that could be handled at the Swift level. Less chatter between MediaWiki and the file storage would be a win.

I think I misunderstood what you referred with GET/POST. There are (potentially) two requests to get the audio: one from the module to the MW API and one from MW to the service (if needed). I thought that you meant the latter. We'll have an answer regarding the former soon.

Thanks again for taking the time to check out the project. You’ve absolutely identified a couple of things that got past us!


We used POST because the request could change data. If an utterance already exists, it will just be retrieved, so no change there. However if there is no utterance for the given segment, a new one will be generated by Speechoid and saved to storage to be used later. If POST should not be used for this behaviour we are happy to use GET instead.

Re: Utterances in JS-heap

It turns out that we do keep utterances in the JS-heap. We agree, this is not optimal. If we switch to GET-requests we could remove the internal caching and run straight of the browser cache. If we stay with POST-requests we will instead have to implement a small priority queue, as we do want a bit of cache if the user is going back and forth listening to some part of the article.

Re: Swift expiry of utterances

The biggest problem with switching to expiration of files in Swift is that not all MediaWiki FS backends are Swift. Just dropping the extension in a standard MediaWiki Vagrant installation will cause it to be stored in /srv/images/wikispeech_utterances. We could however detect if it is Swift, and in that case add the TTL on file creation. We would still have to run our existing expiry job on the database (but skip the FS delete action in case of a Swift backend) or the table will keep growing in infinity.

This solution would not move complexity from the extension to Swift, it would rather increase the complexity within the extension. But it would indeed remove the delete actions in Swift currently invoked from the extension.

Re: Speechoid profiling

The short answer to possible improvements of the Speechoid services by adding more cache in that layer to lower their load, is that in most cases this would indicate a failure of caching in the extension. We are caching all repeat requests from the extension to Speechoid.

There are some instances where introducing caching in specific points in Speechoid might speed things up, e.g. when looking up words in the pronunciation lexicon. In this specific instance we would have to analyze how large portions of the lookups are duplicated over a short time and weigh this against the overhead cost of the cache. This is however a tiny part of the resources spent while synthesizing utterances. Lookup up pronunciations in the database are measured in milliseconds, while synthesizing an utterance in the actual speech synthesis implementation (read: Mary-TTS) is measured in seconds or something along the way of 99,9% of the resources spent. Mary-TTS is a third party product with a huge code base that is hard for us to dive into enough to suggest possible improvements.

Our conclusion is that there is very little to win compared to the amount of time we need to spend in order to introduce improvements in the various Speechoid services.

if there is no utterance for the given segment, a new one will be generated by Speechoid

It sounds like the write operation is idempotent, in which case it's safe to use a GET request, which under a future multi-datacenter deployment of MediaWiki would be safe to re-run in a secondary datacenter. Given the browser cache benefits we discussed, please switch to using GET requests.

We could however detect if it is Swift, and in that case add the TTL on file creation. We would still have to run our existing expiry job on the database (but skip the FS delete action in case of a Swift backend) or the table will keep growing in infinity.

That sounds good to me!

Our conclusion is that there is very little to win compared to the amount of time we need to spend in order to introduce improvements in the various Speechoid services.

That sounds fair. Are there alternatives to Mary-TTS, in case its performance when under the full load of production is insufficient? It would be good to have a contingency plan at least, since it's hard to predict how that will behave under stress.

The subtask is the only actionable left from this review.

Just to clarify, does this mean that the code is good enough (from a performance point of view) that it could run on WMF servers? Given that the subtask is solved, that is.

From my perspective, yes, enough due diligence has been done on the performance front once that subtask has been completed. But the Performance Team doesn't decide what gets deployed to WMF production or not.

For confirmation, was this just review of the MW extension? Or does it include the various services etc? Or just the external interactions (GET etc) with those services?