Page MenuHomePhabricator

Share/sync with Mozilla's "Common Voice" project?
Open, Needs TriagePublic

Description

No idea if this makes sense / is in scope, but "Please share your feedback here" link on https://lingualibre.fr/wiki/Help:Main brought me here.

https://voice.mozilla.org/en/data has voice data sets - you may want to also get listed or investigate if one could work together.
Just a heads-up; feel free to close. :)

Related Objects

Event Timeline

No idea if this makes sense / is in scope

Yes it is :)

https://voice.mozilla.org/en/data has voice data sets - you may want to also get listed or investigate if one could work together.

We just had a talk yesterday with @Pamputt and @Unuaiga about that. From our point of view, having Lingua Libre listed in their dataset could definitely be a good idea, but T196500 is currently blocking this.

Vvjjkkii renamed this task from Share/sync with Mozilla's "Common Voice" project? to w4aaaaaaaa.Jul 1 2018, 1:04 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from w4aaaaaaaa to Share/sync with Mozilla's "Common Voice" project?.Jul 2 2018, 1:56 PM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

@Poslovitch, you seemed knowledgeable on this issue. What do you think.

  • Aren't they already following us via lingualibre.org/datasets/ page ?
  • Should we send them periodical update ? (Like I do for INALCO Université)

Mozilla CommonVoice does not use our datasets. They are creating their own. However, one of their developers (if I'm not mistaken) is working on a DeepSpeech French model (and maybe others!) which uses our French dataset amongst the CommonVoice one and others.

I can't find a way in which we could "work together". In some ways, we might actually already do.
Lingua Libre allows recording user-generated words lists and uploads the recordings to Commons; CommonVoice allows recording pre-defined sentences: the recordings are then peer-reviewed and shipped in a somewhat curated dataset. Both projects aim at creating a library of sound recordings, yet CommonVoice is clearly aimed at providing training data for speech-to-text machine-learning algorithms.

Our primary goal differs. But the way we're achieving it is kinda similar.

So, in a nutshell, I don't think there's a need to "partner" with CommonVoice. We are making datasets, they are making their own. We are not listing CommonVoice's datasets on our datasets page, and they don't do that either.

However, we could do the following things :

  • similarly to the Wiktionnaire's homepage, we could list CommonVoice amongst other projects that are creating datasets of sound recordings and/or share similar goals to ours.
  • advertise about our datasets: the only "entry point" to our datasets seems to be word of mouth. How come we don't have a dedicated link in the homepage header (next to "Statistics"?) ?! And what about having a nice, user-friendly page like CommonVoice's to download the datasets?