Page MenuHomePhabricator

Create database for utterance data
Closed, ResolvedPublic0 Estimated Story Points

Description

Create a database for storing data about utterances with the parameters below.

FieldTypeNullKeyDefaultExtra
speech_idint(10)NOPRINULLauto_increment
langvarbinary(35)NONULL
seg_hash?NOMULNULL
voicevarchar(30)NONULL
dates_storedbinary(14)NOMUL?

speech_id: primary key of the table
lang: language of the segment (often the page). Type [varbinary(35)] suggested per Page table, it can probably be a lot smaller
seg_hash: Unique hash of the segment/sentence. To be investigated, adding on length to end of hash might make sense
voice: The selected voice. Could also make use of the NameTableStore to make the type an int instead
date_stored (could also be expiry_date): indication of when to expire the entry. Having an expiration ensures audio related to sentences that have since been edited are phased out and that lexicon improvements are phased in. Exact logic for this to be investigated.

For adding database tables for extensions see Manual:Developing extensions.

Event Timeline

If we choose SHA1 as hash, then seg_hash is a 20 byte long binary. I suggest renaming the field to seg_sha1 (or whaterver hash we choose) to make it clear what the data is. Also allows a bit more for future change of hash function with fallback to the previous hash.

kalle set the point value for this task to 5.Apr 2 2020, 10:24 AM

Need to investigate table creation scripts, read a bit of docs and what not.

New suggestion.

FieldTypeNullKeyDefaultExtra
speech_idint(10)NOPRINULLauto_increment
langvarbinary(35)NONULL
seg_hash?NOMULNULL
page_idint(10)NOMULNULL
voicevarchar(30)NONULL
dates_storedbinary(14)NOMUL?

Hash is unique per page. Not for the whole site, i.e. page_id must be added.
seg_hash is SHA256, 256 bytes. 64 lower case hex characters. 44 base64 characters. Or a blob. I'd go for the hex string.

@kalle Digging a bit into how the file hash seems to be handled it looks like its converted to a different base very early on. This makes use of base_convert. See function making use of it

Change 596674 had a related patch set uploaded (by Karl Wettin (WMSE); owner: Karl Wettin (WMSE)):
[mediawiki/extensions/Wikispeech@master] Create database for utterance data

https://gerrit.wikimedia.org/r/596674

Aw man, did I just try to review to master rather than a new branch? Let me abandon that patch and do it again!

Change 596674 abandoned by Karl Wettin (WMSE):
Create database for utterance data

Reason:
Managed to create patch on Master rather in a new branch. ABORTING

https://gerrit.wikimedia.org/r/596674

Change 596676 had a related patch set uploaded (by Karl Wettin (WMSE); owner: Karl Wettin (WMSE)):
[mediawiki/extensions/Wikispeech@master] Create database for utterance data

https://gerrit.wikimedia.org/r/596676

These are functions we need to discuss. They are WiP. Perhaps I shouldn't have included them yet, but it was just too tempting.

	/**
	 * Clears database of utterances older than a given age.
	 *
	 * @since 0.1.3
	 * @param $maximumAgeHours
	 */
	public function flushUtterancesByExpirationAge ( $maximumAgeHours ) {
		// todo flush items from file storage here?
		// todo return identity of flushed utterances?
	}

	/**
	 * Clears database of all utterances for a given page.
	 *
	 * @since 0.1.3
	 * @param $page_id
	 * @param $language
	 * @param $voice
	 */
	public function flushUtterancesByPage ( $page_id, $language, $voice ) {
		// todo flush items from file storage here?
		// todo return identity of flushed utterances?
	}

	/**
	 * Clears database of all utterances for a language and voice.
	 *
	 * @since 0.1.3
	 * @param $language
	 * @param $voice
	 */
	public function flushUtterancesByLanguageAndVoice ( $language, $voice ) {
		// todo flush items from file storage here?
		// todo return identity of flushed utterances?
	}
Lokal_Profil changed the point value for this task from 5 to 1.May 28 2020, 10:48 AM
Lokal_Profil changed the point value for this task from 1 to 0.Jun 11 2020, 8:22 AM

Change 596676 merged by jenkins-bot:
[mediawiki/extensions/Wikispeech@master] Create database for utterance data

https://gerrit.wikimedia.org/r/596676