Page MenuHomePhabricator

Avoid duplicate recording : compare list to speaker's previous records in target language so to hide words previously recorded
Closed, ResolvedPublic

Description

Ideally, when in the Record wizard > Studio, facing a words list and ready to record, the list on screen can be compared with the speakers' previously recorded words. Via a checkbox, words already recorded could be toggled grey and skipped or be completely hidden. This feature request emerged recently as users are coming back <3 to record again different words lists.

Note also:

J'avais enregistré 20 mots "autour de moi". Là, je viens d'en lancer 20 autres... et c'est les mêmes.
Il pourrait être intéressant d'ajouter une option pour éviter d'enregistrer plusieurs fois la même chose (mon accent ne change pas d'un jour à l'autre).
Exilexi (talk) 05:36, 11 October 2018 (UTC)

Event Timeline

Yug renamed this task from Compare list to speaker's previous records in target language so to hide words previously recorded to Avoid duplicate recording : compare list to speaker's previous records in target language so to hide words previously recorded.Dec 23 2018, 9:52 PM
Yug updated the task description. (Show Details)

This could be a cool feature, but I fear all technical implementations will be ressource-expensive and time-consuming. So before choosing a solution, we should test its scalability.

  1. Doing a SPARQL request could be a good idea, but when I tried to fetch all records of Davidgrosclaude —which is currently the biggest contributor to Lingua Libre with ~20 000 audio recordings— it took up to 9s, which is way too much.
  2. Fetching the current user contributions via the MediaWiki API could be even more expensive, due to the API limits set to 500 pages per request, and no abilities no filter per speaker or per language.
  3. Saving a static list somewhere will produce many edge-cases and desyncs to manage.
0x010C triaged this task as High priority.Mar 15 2019, 11:03 AM

Maybe using LDF instead of SPARQL would be an option?

@MichaelSchoenitzer Yes, it could definitely be an option as it is really quick, but again we have a limit issue : the number of triples returned is limited to 100 by request. So each request is fast, but we will have to do several dozens of them to get all the data we need, loosing a lot of time in network discussions. I've dug the subject a little bit but didn't see any way to configure this limit server-side, is there any ?

Hello, I'am thinking of an out-of-wikibase solution.

Given we want in a recording session to avoid duplicate either against the current speaker's already recorded items, or against the current language's already recorded items.

Given it is possible to automatically store and list these previously recorded items on either the current speaker's userspace, example : user:Yug/fra-recorded or user:Yug/fra-history. Or under the language's overall history under /fra-history.

Isn't it an viable option to use server side approach of [[ http://man7.org/linux/man-pages/man1/sort.1.html | sort -u ]] to sort and uniquify both the history and then the ongoing session's words, then [[ http://man7.org/linux/man-pages/man1/comm.1.html | shell command comm ]] to compare them and return the one never recorded :

comm - compare two sorted files line by line.
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output.  Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and
column three contains lines common to both files.

@Yug the real issue here is not the way to compare history and current session, it is how to get efficiently (= in a way that will handle nicely a load rise) the recording history of a speaker. Storing everything in a wikipage is a good idea, but (as I said in my first comment), it will produce many edge-cases and desyncs to manage. For example, during a recording session, if we save the history page before sending the files to Commons and a crash occurs, we would end up with words in the history that arn't on Commons. And reverse, if we save the history page after sending the files to Commons and a crash occurs, we would have files on Commons that arn't in the history.

tl;dr: A new API endpoint which runs a tricky MySQL query on the database should do the work.

I spend the whole day yesterday thinking about how to implement this feature efficiently and benchmark potential solutions, and I could have found a good way to do it.
I didn't mention in my first comment the possibility to request directly our database. This is because wikibase items are stored serialized in the text table (which is a particularly large table), so we should have done some regex search on it, which took on average 14s to run during my tests. This is really bad and could endangered the whole database.
But I just come up with a new idea: Contrary to Wikidata, the datastructure of LinguaLibre's items is fixed. If a language item is used on the item of an audio recording, it will always be with the P4 (language) property. Same apply for the speakers. Knowing this, we can use the pagelinks table to get all items which have both a link to a specific language item and a specific speaker item; but also the wb_terms table to retrieve the label of an item (which for audio recordings is it's transcription). And after tests, this took only ~50ms.

SELECT page.page_title, wb_terms.term_text
FROM pagelinks
INNER JOIN page on page.page_id = pl_from
INNER JOIN wb_terms on page.page_title = wb_terms.term_full_entity_id
WHERE
    pagelinks.pl_from_namespace = 0
    AND pagelinks.pl_namespace = 0
    AND pagelinks.pl_title = "Q42" -- the speaker's Qid
    AND pagelinks.pl_from IN (select pl_from from pagelinks WHERE pl_from_namespace = 0 AND pl_namespace = 0 AND pagelinks.pl_title = "Q21") -- the language's Qid
    AND wb_terms.term_type = "label"

(I'm aware of T221764, but that will only be a small change to do at the version switch)

I can now build a new Mediaiki action API endpoint that will handle this MySQL query, integrate it in the RecordWizard, and we should be done!

0x010C claimed this task.

Done in the commits dc6538b5, d3571387 and 39ca9e92.