Page MenuHomePhabricator

Investigate Google SSML/IPA rendering issues
Closed, ResolvedPublic2 Estimated Story Points

Description

As raised by Dom in T311233: Create Google phonos engine, Google is not correctly rendering audio for some SSML/IPA — we should investigate to see if there is a common denominator between these issues (e.g. encoding of IPA unicode characters, differing language phoneme support)

Event Timeline

Re T311233#8100597: (cc @dom_walden) It seems Google wants the language code to match up with the language of the actual word. I didn't test everything, but take Xochimilco for an example. This is a Spanish name, and even in the wikitext editors are using {{IPA-es}} to reinforce that it should be Spanish. If I pass the Spanish language code to Google, sotʃiˈmilko is pronounced correctly. Thus, perhaps we should be letting the editors (in this case the {{IPA-es}} template) dictate which language is used. Phonos currently lets you pass in the language code to override the default, which is the content language.

I couldn't get some of the examples from our corpus to work, but I think we're on to something in thinking the language code takes precedence in Google's algorithm.

To add to this, one thing we at least having going for us is there's a reasonable expectation that editors (or the templates they use) will pass in the correct text to go with the IPA, so Google should usually still get it right. But it does seem odd that the IPA is completely ignored, and of course there will be cases where the plain text isn't pronounced correctly either.

I might be thinking about this wrongly, but are we looking to pronounce IPA, or just pronounce the words? I was reading this old comment on Reddit and wondered…

It seems that Google gives pretty good results if you just give it text and a language code. For example (if we were to make an option for only supplying the text param) Google gives the following, where the first two are pretty much the same:

{{#phonos: text=Xochimilco |lang=es}}

{{#phonos: ipa=sotʃiˈmilko |lang=es}}

{{#phonos: text=Xochimilco |lang=en}}

I think that requiring editors to supply a lang code to Phonos is pretty reasonable. It'd be similar to the usage of the Use English templates, wouldn't it? i.e. on an Australian article, en-au could be set and so the pronunciation would be closer to regionally correct.

For example, this works {{#phonos: text=car park |lang=en-au}} where this {{#phonos: ipa=ˈkɑːpɑːk |lang=en-au }} fails to produce any output.

One advantage of only supplying text is that the Google quota usage will be much lower.

One advantage of only supplying text is that the Google quota usage will be much lower.

And as an added bonus, if you don't know how to write IPA you could still add the pronunciation to the article

Re T311233#8100597: (cc @dom_walden) It seems Google wants the language code to match up with the language of the actual word. I didn't test everything, but take Xochimilco for an example. ...

I repeated my tests but this time I passed the language. I took the language from the data.json file, only looking at the first two characters.

There are still some words it didn't pronounce correctly:

  • Tenochtitlan (tenoːt͡ʃˈtit͡ɬan)
  • Hyderabad (ˈɦaɪ̯daraːbaːd) (this is in English, perhaps it needs to be Hindi?)
  • Hasan Minhaj (ˈhʌsən ˈmɪnhɑː(d)ʒ)
  • Smørrebrød (ˈsmɶɐ̯ˌpʁœðˀ)
  • subtle (ˈsʌt(ə)l)
  • awful (ˈɔːfɫ̩)
  • fly (flaɪ̯)
  • catnip (ˈkætⁿnɪp)
  • apt (ˈæp̚t)
  • spotless (ˈspɒtˡlɨs)
  • peculiar (pʰə̥ˈkj̊uːliɚ)
  • key (k̟ʰi)
  • ευχαριστώ (ef.xa.ɾiˈsto)
  • chocolate (ˈt͡ʃɔk(ə)lɪt)
  • спасибо (spɐˈsʲibə)

These words returned the error below:

  • Ibibio (ɪbɪˈbiːəʊ) (perhaps it does not recognise the language code ib or ibb?)
  • wean (ˈwɪən) (lang code sco)
  • llandudno (ɬanˈdɨdno) (lang code cy)
{
    "error": {
        "code": "internal_api_error_MediaWiki\\Extension\\Phonos\\Exception\\PhonosException",
        "info": "[cf32e4f9e737a085a599afcb] Exception caught: Unable to retrieve audio using the Google engine: There was a problem during the HTTP request: 400 Bad Request",
        "errorclass": "MediaWiki\\Extension\\Phonos\\Exception\\PhonosException",
        "*": "MediaWiki\\Extension\\Phonos\\Exception\\PhonosException at /var/www/html/w/extensions/Phonos/includes/Engine/GoogleEngine.php(60)\nfrom /var/www/html/w/extensions/Phonos/includes/Engine/GoogleEngine.php(60)\n#0 /var/www/html/w/extensions/Phonos/includes/PhonosApi.php(37): MediaWiki\\Extension\\Phonos\\Engine\\GoogleEngine->getAudioData(string, string, string)\n#1 /var/www/html/w/includes/api/ApiMain.php(1901): MediaWiki\\Extension\\Phonos\\PhonosApi->execute()\n#2 /var/www/html/w/includes/api/ApiMain.php(875): ApiMain->executeAction()\n#3 /var/www/html/w/includes/api/ApiMain.php(846): ApiMain->executeActionWithErrorHandling()\n#4 /var/www/html/w/api.php(90): ApiMain->execute()\n#5 /var/www/html/w/api.php(45): wfApiMain()\n#6 {main}"
    },
    "servedby": "369413a69bdf"
}

Here is the output:

With @MPhamWMF's help, we were able to figure out some of the examples Google is getting wrong is due to incorrect unicode characters. In particular from the list at T313711#8104497, the IPA that starts with an apostrophe ' should actually be U+02C8 (ˈ). Google also seems to get confused with the optional notation using parentheses, which we understand we can safely remove. So in the case of Chocolate for example, the IPA ˈt͡ʃɔk(ə)lɪt should be ˈt͡ʃɔkəlɪt. Using the latter, Google pronounces it correctly. Finally, there's also a colon that has a unicode doppelgänger. If I understand correctly ː should be : (the normal colon typed from your keyboard).

There are still some it gets wrong apparently, but I think we can safely say we need to do a find/replace server side for the lookalike apostrophes, colons, and strip out any parentheses. Our understanding is this should be safe to do. In addition, we could keep track of the fixes Phonos makes automatically and list them in the API response. This way, Phonos also serves as sort of an IPA "validator".

With respect to the :, I'm not sure which one google wants. IPA will use the one that is NOT the normal colon typed from the keyboard, unless somebody is taking some shortcuts. Google may be fine with the IPA symbol

Change 822464 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[mediawiki/extensions/Phonos@master] GoogleEngine: Normalize apostrophes and remove parentheses

https://gerrit.wikimedia.org/r/822464

The above patch probably doesn't fix everything, but it fixes what we know about so far with apostrophes and parentheses. I did not include a replacement for the colons, since Google doesn't seem to care which one is used.

Change 822464 merged by jenkins-bot:

[mediawiki/extensions/Phonos@master] GoogleEngine: Normalize apostrophes and remove parentheses

https://gerrit.wikimedia.org/r/822464

I repeated my previous experiment with all the IPA from our corpus, but setting the text parameter to "foo". When Google does not understand an IPA character, it ignores all the IPA and just pronounces the text, in this case "foo". (I used this wikitext P32560).

There are still some words it didn't pronounce correctly:

  • Tenochtitlan (tenoːt͡ʃˈtit͡ɬan)
  • Hyderabad (ˈɦaɪ̯daraːbaːd) (this is in English, perhaps it needs to be Hindi?)
  • Hasan Minhaj (ˈhʌsən ˈmɪnhɑː(d)ʒ)
  • Smørrebrød (ˈsmɶɐ̯ˌpʁœðˀ)
  • subtle (ˈsʌt(ə)l)
  • awful (ˈɔːfɫ̩)
  • fly (flaɪ̯)
  • catnip (ˈkætⁿnɪp)
  • apt (ˈæp̚t)
  • spotless (ˈspɒtˡlɨs)
  • peculiar (pʰə̥ˈkj̊uːliɚ)
  • key (k̟ʰi)
  • ευχαριστώ (ef.xa.ɾiˈsto)
  • chocolate (ˈt͡ʃɔk(ə)lɪt)
  • спасибо (spɐˈsʲibə)

It still pronounces all the above as "Foo", with the exception of the below which are now pronounced correctly:

  • Hasan Minhaj (ˈhʌsən ˈmɪnhɑː(d)ʒ)
  • subtle (ˈsʌt(ə)l)
  • chocolate (ˈt͡ʃɔk(ə)lɪt)

So there still seems to be IPA characters Google does not understand or we need to be convert to equivalent characters it does understand.

Test environment: local docker Phonos 0.1.0 (3ccf24e) 23:53, 17 August 2022.

For tenoːt͡ʃˈtit͡ɬan, I worked backwards removing a character at a time until Google correctly tried to use the IPA — tenoːt͡ʃˈtit "works", so it's something to do with

I guess we could do this per word and find the problematic characters?

I'll do a couple below, and if @MPhamWMF has any input that'd be appreciated!


  • tenoːt͡ʃˈtit͡ɬan works, pronounced almost correctly, as tenoːtʃˈtitɬan
  • ˈɔːfɫ̩ works, pronounced correctly, as ˈɔːfl
  • ˈkætⁿnɪp works, pronounced correctly, as ˈkætnɪp
  • flaɪ̯ works, pronounced correctly, as flaɪ
  • pʰə̥ˈkj̊uːliɚ works, pronounced correctly, as phəˈkjuːliɚ

I'm starting to see a pattern — the issue words have Combining Diacritical Marks, which when removed are rendered correctly...

I've been half-working on a node.js IPA validator/normalizer (and a composer package too!) — running the corpus through this prior to sending the IPA to Google (along with the text "foo" to test, and leaving the lang blank) results in all words pronouncing the IPA (mostly correctly) except for the following normalized IPA:

  • sotʃiˈmilko (works if you set lang to es)
  • paɾanɡaɾikutiɾiˈmikwaɾo (works if you set lang to es)
  • mexiko (works if you set lang to es)
  • ˈwɪən (works if you set lang to en-gb)
  • ˈsmɶɐˌpʁœðˀ (did not normalize fully, will fix!)
  • ef.xa.ɾiˈsto

https://phonos.theresnotime.io/w/index.php?title=T313711 has the demo (they're all cached)

For tenoːt͡ʃˈtit͡ɬan, I worked backwards removing a character at a time until Google correctly tried to use the IPA — tenoːt͡ʃˈtit "works", so it's something to do with

I guess we could do this per word and find the problematic characters?

I'll do a couple below, and if @MPhamWMF has any input that'd be appreciated!


  • tenoːt͡ʃˈtit͡ɬan works, pronounced almost correctly, as tenoːtʃˈtitɬan
  • ˈɔːfɫ̩ works, pronounced correctly, as ˈɔːfl
  • ˈkætⁿnɪp works, pronounced correctly, as ˈkætnɪp
  • flaɪ̯ works, pronounced correctly, as flaɪ
  • pʰə̥ˈkj̊uːliɚ works, pronounced correctly, as phəˈkjuːliɚ

I'm starting to see a pattern — the issue words have Combining Diacritical Marks, which when removed are rendered correctly...

This mostly makes sense to me. Lots of the diacritics are specifying some sort of modification to an existing sound -- e.g. has voicing or not; tongue touches teeth instead of the alveolar ridge behind the top teeth; etc -- and are generally not contrastive. Which is to say that rendering the un-diacriticked symbol should usually not change the meaning of the word in the target language. It may at times not sound fully accurate, but this was always going to be an issue due to what we talked about before about how narrow/broad IPA transcriptions decide to go. I think this is within an acceptable scope of "close enough".
My guess is that because each of these diacritics adds an extra dimension to an existing sound, trying to create a library for each possible combination would create a huge combinatorial space, of which many sounds may not be attested yet in real languages, so there would be no way of actually recording it properly anyway -- so people who create the sound recording libraries probably just skip them altogether unless they are very common/required for a language

@TheresNoTime Very cool you made a composer package! Shall we start using that (or copy the code over) in Phonos? I think ideally end users who are using valid IPA shouldn't have to manipulate it to appease Google. Rather, any normalization should silently happen behind the scenes.

@TheresNoTime Very cool you made a composer package! Shall we start using that (or copy the code over) in Phonos? I think ideally end users who are using valid IPA shouldn't have to manipulate it to appease Google. Rather, any normalization should silently happen behind the scenes.

Thank you! It'd be very cool to say https://packagist.org/packages/theresnotime/ipa-validator was being used on Wikipedia, but having another external package probably isn't going to do us many favours in the security review?

Thank you! It'd be very cool to say https://packagist.org/packages/theresnotime/ipa-validator was being used on Wikipedia, but having another external package probably isn't going to do us many favours in the security review?

I was thinking the same thing. For security review alone it probably makes more sense to just migrate the code to Phonos, which I guess technically might require first re-licensing it to GPL-2.0-or-later, or leaving a comment above the copied code in Phonos, ...or do neither! I have my doubts there will be any sort of litigation regarding you copying your own code :)

Discussed in last RTL, unsure if this should be closed as Resolved or if any tickets should be cut by followup action items

@TheresNoTime mind letting me know?

Discussed in last RTL, unsure if this should be closed as Resolved or if any tickets should be cut by followup action items

@TheresNoTime mind letting me know?

Sorry, didn't get the ping! I've relicensed https://github.com/theresnotime/php-ipa-validator as GPL-2.0-or-later (GPL-3.0 I guess), so it can be migrated into Phonos (or, we can just lock theresnotime/ipa-validator to 1.0.5 which will be fairly safe security review wise?)

I've relicensed https://github.com/theresnotime/php-ipa-validator as GPL-2.0-or-later (GPL-3.0 I guess), so it can be migrated into Phonos (or, we can just lock theresnotime/ipa-validator to 1.0.5 which will be fairly safe security review wise?)

Great, thanks for the update! Any other outstanding work from this task or is it good to resolve?

Noting we talked to the Security team and they consider the new package to be low-risk, and thus we can use it \o/

I guess I'll code this up today since the task is already assigned to me, but the underlying credit for the work goes to Sammy as she wrote the package! :)

Change 829068 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[mediawiki/extensions/Phonos@master] GoogleEngine: use php-ipa-validator to normalize IPA input

https://gerrit.wikimedia.org/r/829068

Change 831998 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[mediawiki/vendor@master] Add theresnotime/ipa-validator

https://gerrit.wikimedia.org/r/831998

Since https://gerrit.wikimedia.org/r/831998 is stalled until the associated security review is completed (T316913), we are left with option of either not doing this bit of normalization for the time being, or copying/pasting the code. I'm going to go with the former for now, and just put this ticket in Needs Attention so we don't forget about it. The patch to make use of ipa-validator is already +2'd and should merge on its own once https://gerrit.wikimedia.org/r/831998 is merged (which I assume the Security team will do).

  • catnip (ˈkætⁿnɪp)
  • apt (ˈæp̚t)
  • peculiar (pʰə̥ˈkj̊uːliɚ)
  • key (k̟ʰi)

These shouldn't be converted willy-nilly unless the engine is capable of reproducing the exact phonetic (= physical) realizations these symbols represent. [p̚], for example, means there's no popping noise arising from exhalation and pressure change when the lips are opened. Unless the engine can reliably output audio with and without such subtleties at command, these phonetic transcriptions should never be clickable.

/ˈmɪnhɑː(d)ʒ)/ means /ˈmɪnhɑːdʒ/ or /ˈmɪnhɑːʒ/. If it doesn't return two distinct iterations then it should be considered a fail.

Change 831998 abandoned by MusikAnimal:

[mediawiki/vendor@master] Add theresnotime/ipa-validator 1.1.1

Reason:

https://gerrit.wikimedia.org/r/831998

Change 829068 abandoned by Samtar:

[mediawiki/extensions/Phonos@master] GoogleEngine: use php-ipa-validator to normalize IPA input

Reason:

Upstream change abandoned

https://gerrit.wikimedia.org/r/829068