Page MenuHomePhabricator

Microphone upload from browser for reading tutoring with pronunciation assessment
Open, LowPublic


As suggested on to do page of quiz extension, sound clips can be used for assessment.
The upload from browser feature isn't completed yet.
More information about microphone upload over here

2018 version:
2017 paper:
2018 blueprint:
2020 multi-phrasal version:

Event Timeline

Hmm, is this really a quiz extension thing? Doesn't *sound* like it. (pun intended).

Mvolz changed the task status from Open to Stalled.Jun 3 2017, 1:46 PM
Mvolz changed the task status from Stalled to Open.Jun 3 2017, 1:59 PM

Maybe the task is for embedding sound files inside a quiz? Does that work? I.e. Like in wikitionary: specifies that audio can be embedded, I think reported feature is related to creating a web interface that would allow a user's microphone to record a file, for which the proposal has been made (link).

It is wikiversity thing rather than quiz.

Mvolz renamed this task from Microphone upload from browser for reading tutoring with pronunciation assessment to Microphone upload from browser for reading tutoring with pronunciation assessment in Wikiversity .Jun 3 2017, 3:06 PM
Mvolz removed a project: MediaWiki-extensions-Quiz.

Not sure what to tag this then...

As suggested on to do page of quiz extension,

Links generally welcome so anyone can look up potential previous discussion.

More information about microphone upload over here

Just pointing out (for anybody visiting it) that the page was last updated in 2010 and hence some info there might be outdated.

The upload from browser feature isn't completed yet.

Does that mean there is some code somewhere that you could point to? Or does "not completed" rather mean "not existing at all yet"? :)

@Harjotsingh: I assume you plan to work on fixing this, as the task is assigned to you and has a higher ("normal") priority set?

Would this end up as a MediaWiki extension (MediaWiki-extension-requests) which automatically converts to accepted file formats? "Upload from browser" means reusing Special:Upload somehow?
Or is this all too early to ask? Curious about your plans. :)

If you allow me to make a comment, audio files can be embedded inside a quiz as you can see here.

The page linked in the description is about allowing MediaWiki to record directly from the browser, create a file and upload it to commons. Mostly to help with creating audio files for the Wiktionary. That would be a feature separated from the quiz extension (I guess).

If you want to implement what is mentioned in the title of this item, I think that would look like: allow audio capture from the browser as part of a question, compare it with a specified file to figure out if it is close enough, report a result and discard the recording. That would help with projects trying to use the pronunciation files in Commons to teach languages.

It would be nice but I think it is a complicated feature since it requires some sort of library to analyze and compare audio and I am not aware of anyone looking for it, at least in the en and es Wikiversities. Not sure about the other languages.

Suggestion given at - Archived to do list

Currently the proposal has status doing, but no link to code is available.

As @Lsanabria suggested this would need to make a feature that will allow mediawiki to directly from the browser, create a file and upload it to commons. Mostly to help with creating audio files for the Wiktionary.

But it seems to be out of scope for my GSoC project, which is primarily upgrading Quiz extension and add Data storage.This would require to make another extension/feature and then implement it on Quiz.

Aklapper lowered the priority of this task from Medium to Lowest.Jun 12 2017, 3:16 PM

...and note that according to (but that's English Wikiversity only) only OGG format seem to be allowed.

Jsalsman renamed this task from Microphone upload from browser for reading tutoring with pronunciation assessment in Wikiversity to Microphone upload from browser for reading tutoring with pronunciation assessment in Wiktionary .Sep 7 2017, 10:34 AM
srishakatux added a subscriber: srishakatux.

Removing the Possible-Tech-Projects tag as we are planning to kill it soon! This project does not seem to fit in the Outreach-Programs-Projects category in its current state, so I am not adding that tag right now!

@Brijsri and I are working on this. Our paper from last year just got cited by some speech language pathologist instructional designers at Texas A&M and Sydney:

Also, the 2011 bug fix from Dr. Nakagawa we are including is tremendously important, with social impacts on thousands of migrants:

Someone who wishes to remain anonymous offered to review and interface with the Wiktionary admin community last year. I've not forgotten that kindness and hope to accept it soon. We're building a freemium site doing adaptive learning for those who want to try words in context, and are independently fundraising for the pure javascript wiktionaries' solution.

@Jsalsman, I don't recall offering to review and interface w/ Wiktionary admin community. I do not edit on Wiktionary.

BAMyers, please accept my apologies for confusing you with someone who wishes to remain anonymous. Thankfully my mistake prevented a larger one.

and I are working on this. [...] We're building a freemium site

@Jsalsman: Could you please provide an URL where to find your work-in-progress code? What does the word "freemium" mean?

@Aklapper sure, is the GSoC 4 feature/phoneme feature extraction code from last year, as published. Since then we've added five more features per phoneme as per slide 13 of and soon we will have 10 features, adding the nasal flap. We're converting that from Python to Google Firebase, or at least we were before I started having latency problems with it, so we might just stick to Python Flask.

Fremium means a sliding scale. We want the adaptive site to be self-supporting, so occasional interstitial qualification (multiple choice) questions are delivered to the learners, and if they have the financial means they are asked to contribute. The interstitial qualification system is also used for referrals. If they need to learn conversational English in a short time, we give up on them and try to get them to register with e.g. Berlitz or EF for an immersion class. That might be a source of revenue at greater volumes, too. All of these billing options are theoretical at present and not yet implemented.

Thanks, though I do not understand many words, like what "referrals" are here or what "phoneme" is or what an "adaptive site" is. Maybe too complicated for me.

Looks like your project uses to track tasks and might have a slightly different scope, if I understand correctly.

I forgot to include Brij's single-line widget for Wiktionary:

Referrals in this case would be people who want to learn a language faster than non-immersion or non-brick-and-mortar or non-instructor-led-online can teach. We have a pretty good chance at doing better than those last two (at class sizes above ~1.9 students per teacher, according to e.g. and ) so, we "refer" those students in greater need of help we can provide, and some proportion of them theoretically sign up for immersion classes, and some proportion of their tuition keeps the adaptive site running for the less affluent.

If we stick with the Python architecture on a Google Cloud instance like is now, we may be less than ten user stories from completion, working into the new attached database schema.

@Aklapper suppose for the sake of argument that the Foundation hires Brij as a contractor for the month he has before starting his Ph.D. at INRIA. Are there any downsides? The upside is that we need to build the freemium adaptive system before we can collect enough data to get the accuracy to levels appropriate for Wiktionary. (We've been saying 95%, and we're at about 82%, whereas the state of the art is 58% as per this 2018 SpeechRater 5.0 stat from ETS: )

@Jsalsman: That seems very offtopic for this task and I do not know why you ask me such questions.

@klove who is the correct resource to ask about this?

@Jsalsman: Please see and follow - again, this is off-topic for this very task.

@Brijsri: I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!).
Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task.
Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome! Thanks for your understanding!

Jsalsman added subscribers: LucasWerkmeister, Halfak.

Thanks @Aklapper, and for your and @LucasWerkmeister's help on wikitech-l with e.g. and and my old strategy proposal. I am reviewing the first two.

There has been considerable progress beyond what is documented at Such as using the cubic (or perhaps harmonic) mean of the attached Microsoft SAPI 5(.1?) SDK C++ program for unconstrained responses (free-form spoken response questions instead of read-off-the-screen only) allowing for temporal keyword spotting as a front-end to the assessment interaction process:

Here is my latest attempt to load a database from my flat file collection:

I will keep you appraised of my progress there, which I expect to be rapid from this point on, thanks in large part to your help. Let me know if you would like me to hold a 0:30+Q&A workshop at the Foundation offices in San Francisco, please.

@Halfak thanks also for your help. (And, I'm sorry I didn't catch the toxicity sniffer bugs, those are some doozies!)

@Brijsri are those recorders better than the old WebRTC code we used two years ago?

@LucasWerkmeister thank you. is apparently the demo pointed to there. I am a huge fan of

There is a 0x0 pixel telephony version for the visually disabled planned, which is just a chatbot asking people to say things and talking to them about what they got wrong when they get something wrong, e,g., top-1 or -2 mistake(s). We can use as a starting point, originally using a non-neural, procedural shell inside which we may turn on the actual neural nets described in and implemented at -- for the time being, that will just be a glorified ELIZA. Technically it's a variety of LUNAR until we actually train a neural net of some kind with it, but it's probably better just to forward unknown responses to a more advanced chatbot. The non-visual version can use the Twilio API:

But of course the wikipedias, wiktionaries, and Wikiversity are probably not going to be interested in the non-visual version, so your help is spectacular.

@Halfak can you find someone who wants to make a chatbot summarizing and critiquing automatic parses of various speech recognition engines' transcription results using, for example, the LOGON parser?

@Brijsri can you use

to continue work on ?

@Brijsri here is the full 987 word file:

@Aklapper is there a graphics standard saying for visual representations to use heat map color palettes which convey the same information in greyscale as they do in color, such as Viridis?

viridis.jpg (720×814 px, 55 KB)

@Jsalsman: Why do you ask me specifically? Why do you think that I could answer your question?

@Aklapper if you aren't not familiar with the answer, your idea of who would be most likely to know is someone with whom I want to talk about accessibility.

@Jsalsman: My question was why you asked me specifically and why you think that I could answer your question, which you did not answer.
It feels like you've added a bunch of people to this task and some of those added people do not know why you're doing that.

@Aklapper do you want to be able to train dozens of languages or hundreds? I want to know if you know about graphics standards, and if you do, then I would like to talk to you about accessibility. If you don't, I would like to talk to the person you think is most likely to know who might know, about accessibility.

@Brijsri how's this?

def load_homophones_and_phonemes(spelling_schema=True):
    db = dataset.connect(connect_string)
    if spelling_schema:
        t = db['words']
        t.create_column('homops', ARRAY(Text))
    wt = db['words']
    pt = db['prompts']
    ut = db['utterances']
    with open('words.txt', 'r') as f:
        while l in f:
            p = l.strip().split('#')[0].split('also')[0].split()
            h = l.strip().split('#')[0].split('also')[1].strip().split(', ')

@Jsalsman: Could please you simply answer my question instead? Again, based on which criteria do you ask random people about random things?

@Aklapper I can think of nobody other than you who would be more likely to know about graphics standards for colorblindness compatibility. I can think of nobody more likely than @Halfak who would know if there are people who want to work on an interactive speech-in and voice-out chatbot compatible with 0x0 pixel accessibility to the blind. I'm trying to get @Brijsri to transition our pronunciation assessment and intelligibility remediation system from Firebase to something more appropriate for Toolforge Labs, in a way it can work with both single wiktionary words as well as phrases in which they appear.

However, none of you are random unless you consider the stochastic Darwinian history of the universe, unlike my use of .split('also')[1] in the homophone parsing code for

above, because it assumes that each line has the word "also" which is incorrect. There is still a spurious semicolon as I was using to separate alternate pronunciations based on our collected exemplar database, in a duplicated entry for "live" -- which we have as almost entirely rhyming with "give" in our exemplars, even though rhyming with "dive" won a recent informal poll:

@Jsalsman: I know absolutely nothing about "graphics standards for colorblindness compatibility" and I wonder why you think that I would and I do not understand most of the stuff that you've been writing here (which makes me also wondering if there is a big XY problem in this task). Cheers :)

@Aklapper Thanks anyway; if you find out let me know.

@Halfak if there is interest in using educational chatbots in any manner, there are various ethical considerations involved e.g. this is located in West China, where corpuses such as are not:

Screen Shot 2019-07-05 at 2.56.08 PM.png (402×502 px, 155 KB)
Please see also

@Brijsri I am at Google asking about from

@Halfak I have been unable to recruit but am trying again with a new approach.

Can you please suggest an appropriate consideration for receipt of (600 MB) under CC-BY-SA? It is: 986 words, 1150 prompts, 34623 utterances, and 84878 transcriptions in Postgres from

My preferred form of compensation is, "sure we can hire three or four speech/phonology, ML, telephony, QA, and DevOps people to try to get this to work, and here's a huge cash reward so you can pay off all your bills and take a long enough vacation that you can help mentor the project to completion ."

I am back to because
I wish I was able to use React components without having to learn React. I still haven't sent anything out to the list yet, so if you want me to ask them or my ~600 volunteer learners, Brij and I can help with those too. For reference we are trying to make an 80% telephony system now which will still have a web interface, like I need to work on remixing the exemplars, by diphone this time instead of phoneme. That will solve the problem with quiet consonants.

@Brijsri at this juncture we need to decide about whether to include anything from (which will get us pitch for Chinese, Vietnamese, etc.) and

Prompts and user interactions can be added to an intelligibility assessment and remediation system using Tolchorp, the Topic-Lesson-Choice-Response-Prompt (TLCRP) format, mediatype text/tlcrp. Tolchirp (previous versions: "tolchorp") is not YAML but is similar.

// example

topic: tolchirp example
 level: 1 // CEFR A1
 lesson: format
   level: 6 // CEFR C2
   choice: Are you enjoying the demonstration?
     media: beep.mp3
     mediatype: audio/mp3;text=filename
     response: {affirmative}
       result: good
     response: {negative}
       result: bad
     response: what
       result: Are you enjoying the demonstration?
       freeform: true
   choice: good
     media: Excellent!
   choice: bad
     media: I'm sorry to hear that.
prompt: {affirmative}
  words: yes|yeah|sure|ok|fine|ja|si
prompt: {negative}
  words: no|nah
Tolchirp exampleFormatAre you enjoying the demonstration?{affirmative}Goodyes,Yeah
Level 6GoodExcellent!
BadI'm sorry to hear that.
WhatAre you enjoying the demonstration?
Jsalsman renamed this task from Microphone upload from browser for reading tutoring with pronunciation assessment in Wiktionary to Microphone upload from browser for reading tutoring with pronunciation assessment.Mar 8 2020, 5:22 AM
Jsalsman claimed this task.
Jsalsman raised the priority of this task from Lowest to Medium.
Jsalsman updated the task description. (Show Details)
Aklapper removed Jsalsman as the assignee of this task.

Removing assignee as that account is disabled.

Aklapper lowered the priority of this task from Medium to Low.

Uh I did not want to change task status, sorry