Page MenuHomePhabricator

Create Larynx phonos engine
Closed, ResolvedPublic5 Estimated Story Points

Description

Extend the EngineInterface (similar to the espeak demo) to send API requests to Larynx


Accepted SSML format
<?xml version="1.0"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.1" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd" xml:lang="en-US">
  <lexicon xml:id="ipaInput" alphabet="ipa">
    <lexeme>
      <grapheme>{word}</grapheme>
      <phoneme>{ipa}</phoneme>
    </lexeme>
  </lexicon>
  <lookup ref="ipaInput">
    <w>{word}</w>
  </lookup>
</speak>

Event Timeline

Change 809211 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[mediawiki/extensions/Phonos@master] Add LarynxEngine and basic tests

https://gerrit.wikimedia.org/r/809211

Input

{{#phonos: text=hello | ipa=/həˈləʊ/ | type=ipa | lang=en}}

API call

/api.php?action=phonos&format=json&ipa=h%C9%99%CB%88l%C9%99%CA%8A%22&text=&lang=en-gb

Generated SSML

<?xml version="1.0"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" version="1.1" xml:lang="en">
  <lexicon xml:id="ipaInput" alphabet="ipa">
    <lexeme>
      <grapheme>hello</grapheme>
      <phoneme>/h&#x259;&#x2C8;l&#x259;&#x28A;/</phoneme>
    </lexeme>
  </lexicon>

  <lookup ref="ipaInput">
    <w>hello</w>
  </lookup>
</speak>

Audio


Changing the SSML to

<?xml version="1.0"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" version="1.1" xml:lang="en">
  <lexicon xml:id="ipaInput" alphabet="ipa">
    <lexeme>
      <phoneme>/h&#x259;&#x2C8;l&#x259;&#x28A;/</phoneme>
    </lexeme>
  </lexicon>

  <lookup ref="ipaInput">
    <w>hello</w>
  </lookup>
</speak>

(i.e. removing the grapheme from the lexeme) seems to produce a better sounding result?

Change 809211 merged by jenkins-bot:

[mediawiki/extensions/Phonos@master] Add LarynxEngine and basic tests

https://gerrit.wikimedia.org/r/809211

QA notes: Depending on what IPA you give it, there are noticeable issues with audio. See discussion at https://gerrit.wikimedia.org/r/809211 and Sammy's notes at T311234#8055863. I think we're going to figure all of that out in a separate task/patch (maybe T311693), so for now you can test with anything that works for you, and sort of pretend it sounds decent :) Anything you find that is WAY off is probably worth documenting here, however.

dom_walden subscribed.

I submitted all the words currently in our test corpus to the Phonos API. Below are the results in HTML including the word, IPA, SSML and audio file. Hopefully you can open it in your browser.

Further to T311234#8055863 I repeated it with lines 84 and 85 of LarynxEngine.php commented out so that the <grapheme> tag is not added. It does improve audio in most cases, in my opinion:

Better without grapheme:

  • Xochimilco (still not 100%)
  • Tenochtitlan (still not 100%)
  • Albuquerque
  • Mexico (but has an "English" pronunciation)
  • anticonstitutionnellement (but has an "English" pronunciation)
  • subtle
  • jewellery
  • hamburger
  • awful
  • fly (still not 100%)
  • catnip (still not 100%)
  • vin blanc (but has an "English" pronunciation)
  • wean
  • llandudno (still not 100%)
  • louisville (still not 100%)
  • chocolate
  • sushi
  • supercalifragilisticexpialidocious (sounds better to my ears)
  • cornwall
  • thames
  • gunwalloe (I think, I don't know how to pronounce this myself)
  • somerset

(Some of these would probably be improved even more if we chose the correct voice language.)

Still wrong without grapheme:

  • peculiar
  • apt
  • namaste (no sound)
  • nanri (no sound)
  • montreal
  • ευχαριστώ (no sound)
  • спасибо (no sound, but this sounds good with grapheme)
  • porthleven

Further to T311234#8055863 I repeated it with lines 84 and 85 of LarynxEngine.php commented out so that the <grapheme> tag is not added. It does improve audio in most cases, in my opinion:

Hmm, thinking about this again. I have the suspicion that removing the <grapheme> tag means Larynx ignores the <phoneme> tag (with the IPA) so it just tries to pronounce the word inside the <w> tags rather than pronouncing the IPA. It does a better job trying to pronounce the word in its normal spelling rather than by trying to interpreting the IPA.

To test this hypothesis, I commented out lines 80 to 89 to completely remove the <lexeme> tag so we just have the <w> tag and the word in its normal spelling. The audio was very similar[1] to without the <grapheme> tag. I uploaded it below.

Taggings @TheresNoTime @MusikAnimal.

Notes

  1. When testing Larynx locally I found a certain amount of unreliability in the audio when I repeated the same request for a pronunciation multiple times. Each time could have a slight variation in pronunciation.

I guess we can have the debate about how to make the Larynx engine better at some point in the future.

For now, I have nothing more to add here.

@NRodriguez Actually I can't create a Patch Demo since the Phonos extension is not available on it yet. I can make a pull request to get that added, but considering we aren't planning to use the Larynx engine in production, in addition to its known problems (see comments above), perhaps it's not worth the hassle. If you still want to test it let me know, but otherwise this task can probably be closed.

MusikAnimal set the point value for this task to 5.Jul 27 2022, 2:34 PM

I'm going to call this resolved if only because the scope of the task (create this engine) has been completed & tested, and we are otherwise not focusing on Larynx