Example (title pronunciation for Neptune): https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-lead/Neptune
- Expected: https://upload.wikimedia.org/wikipedia/commons/f/f3/En-us-Neptune.ogg
- Actual: https://en.wikipedia.org/wiki/File:En-us-Neptune.ogg
Parsoid https://en.wikipedia.org/api/rest_v1/page/html/Neptune gives us:
<span typeof="mw:Image" data-mw="{"caption":"Listen"}"><a href="//upload.wikimedia.org/wikipedia/commons/f/f3/En-us-Neptune.ogg"><img resource="./File:Speakerlink-new.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Speakerlink-new.svg/11px-Speakerlink-new.svg.png" data-file-width="11" data-file-height="11" data-file-type="drawing" height="11" width="11"></a></span>Ideas:
- We could look for the first occurrence of <span typeof="mw:Image" data-mw="{"caption":"Listen"}">.
A selector like 'span[typeof*=Image][data-mw*=caption][data-mw*=Listen] a' should work. I ran into issues getting a selector for the exact strings of these attributes. If anyone knows how, please let me know.
- Make an extra request to MW API https://en.wikipedia.org/w/api.php?action=query&prop=imageinfo&format=json&iiprop=url&titles=File%3AEn-us-Neptune.ogg