Page MenuHomePhabricator

AI/ML Model Request: Text-to-Speech
Open, Needs TriagePublic

Description

Please respond to the following questions, and provide as much detail as possible for each.

Scoping details

  • Use case: Describe the user-facing experience(s) that this model will serve. Who is the intended audience? Where and how will the model outputs be surfaced to users? Please feel free to link to any demos, prototypes, design files, etc.

Readers are looking for low-friction ways to get information, and we are interested in hands-free audio features to meet that opportunity. Two use cases we envision: (1) As a reader, I can tap/click a button and have Wikipedia read an article to me, with navigation controls (including skip ahead/back [10s], skip to section, [0.75-2]x speed). (2) As a reader, I can speak a question into the in-article search have Wikipedia read back the answer to me as it jumps to the highlighted answer within the article.

  • Model purpose: What should the model do? What does it need to predict or generate?

Model should generate audio (speech) at varying speeds from article content as well as timestamps that will allow navigation.

  • Goal: What's the goal of this user experience? What patterns in user behavior do we want to impact? What metrics will let us know we're successful?

We want to help readers see that Wikipedia is an easy place to find information and thereby encourage them to come back more.
North star metric is 21d logged-out reader retention. We'd also look at usage metrics, e.g., click-through.

  • Prior art: How much of the UI for this experience has already been developed and/or tested? Are there any previous models or manually-created rules that we can learn from?

I think Apps previously did an early exploration with ElevenLabs on a version of this feature that didn't pass muster.

Prioritization details

  • Timing: When are you hoping to launch an experiment or feature using this model? How flexible is your timeline? Is there any other planned work that's blocked by this experiment or feature?

FY26-27, flexible and would love to partner to determine work-back timing together

  • KR impact: Which KRs are enabled by this project, and how critical is this project for moving the needle on those KRs?

OW3.1

Other comments

  • [Optional] Model requirements: If you have any specific concerns around model performance (latency, cost, etc.) or model output quality (likelihood of false positives, ability to detect all possible instances, etc.), please note them here.
  • [Optional] Is there anything else you'd like to share?

Event Timeline

added info in description above, thanks so much!

Cross-posting some thoughts that @Dbrant shared on Slack:

Here are some thoughts I had about how this might work architecturally:

  • All of the audio would be pregenerated and stored as static .mp3 files. i.e. we wouldn't want to do any kind of real-time or on-demand audio generation.
  • For every article, there would be an audio file per section of the article, so that it would be possible to jump to a section, and then seek within the audio of that section.
    • it would make it simple to navigate from one section to another, and would enable features that work better with per-section audio; I'm thinking of a playlist of lead-sections of the most popular articles etc.
    • it would also improve efficiency of audio regeneration: if we know the section where an edit was made, we only need to regenerate that section, instead of the whole article.
  • The audio for all articles would be re-generated on a rolling ~weekly basis, to account for new edits made to articles. I'm imagining a service that runs continuously and (a) provides an API endpoint that serves the .mp3 URLs based on article/section title, and (b) schedules audio regeneration for articles that are out of date.
  • Given the sizeable storage space requirements of such a system, we could consider limiting coverage to the top ~100 or 1000 most-visited articles, and scale from there.

Just jotting some requirements down to make sure we're all aligned.

Model Requirements
  • Natural, expressive, human-like speech with conversational flow
  • Multi-language coverage — suggest starting with English and expanding as needed
Some Candidate Models for initial exploration (will need to be enriched with more
Architecture & Infrastructure Considerations
  • Generate MP3 files per article section, store them (Ceph? Wikimedia Commons?)
  • REST interface needed for retrieval — need to investigate how Wikimedia Commons handles this
  • Audio and text must be indexed together for text-to-sound navigation
  • Schedule re-generation for sections that have changed (weekly rolling basis as mentioned above). After the experiment this would most likely need to be a stream
Proposed Initial Experiment Scope
  • Limit to top 100–1000 most-visited articles
  • English only to start
  • Generate per-section MP3s, storage to be decided (probably Ceph)
  • Build a small service with an endpoint to retrieve audio by article/section
Next Steps
  1. Investigate Ceph access patterns and whether Wikimedia Commons has existing infrastructure we can reuse
  2. Run quick PoC with the candidate models above to compare quality
  3. Estimate storage costs for 100–1000 articles × average sections
Things to figure out
  • storage: where will the mp3s be stored? on commons or in Ceph directly?
  • retrieval: how are they going to be retrieved. We'll likely need an article+section+revision index