Smarter utterance buffer
The current buffering is very basic; it just loads the next utterance when an utterance starts playing. This could be changed to load more utterance ahead, but it would be better to load based on the duration of audio ready to play. I.e. load audio until the total audio duration that is loaded is greater than a threshold.