Create segment hasher
Closed, ResolvedPublic1 Estimated Story Points
Actions

Description

The hashes should be unique for a given segment (typically a sentence) and will be used as identifiers in the database and for filenames. Appending the length of a segment to the hash may be useful to avoid collision.

Details

	Subject	Repo	Branch	Lines +/-
	Segment hasher	mediawiki/extensions/Wikispeech	master	+67 -8

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Lokal_Profil	T246080 ☂Re-architecture storage of utterances
Resolved	• kalle	T249200 Connect API to utterance store
Duplicate	• kalle	T249203 Read utterance audio and data as files
Resolved	• kalle	T248469 Create database for utterance data
Resolved	• kalle	T248472 Create segment hasher

Event Timeline

Sebastian_Berlin-WMSE created this task.Mar 25 2020, 1:55 PM

I suggest SHA1, 160 bits. It includes message length. Supported in PHP since PHP4.

SHA1 is no longer considered a secure one way hash, it is possible for well funded organizations to brute force and reconstruct the original data. In practice that means something like running 10000 nodes of modern computers for a couple of weeks. If this is a problem we could upgrade to SHA2 or even SHA3, but there is no native support for this in PHP.

Sebastian_Berlin-WMSE moved this task from Unsorted to Code hygiene and deployment requirements [MW Extension] on the Wikispeech-Text-to-Speech board.Apr 2 2020, 9:53 AM

Sebastian_Berlin-WMSE moved this task from Incoming to Backlog on the Wikispeech-Jobrunner board.

Although hashing of a segment is a simple thing to implement, we need to spend time understanding (and documenting) how data is passed to Speechoid, the user, database and everything else so that we can store utterances in the database, allow for users to request these utterances based on hashes and revision, and so on.

• kalle moved this task from Backlog to Proposed for next sprint on the Wikispeech-Jobrunner board.Apr 2 2020, 10:43 AM

Sebastian_Berlin-WMSE moved this task from Proposed for next sprint to Sprint on the Wikispeech-Jobrunner board.Apr 2 2020, 10:46 AM

Sebastian_Berlin-WMSE edited projects, added Wikispeech-Jobrunner (Sprint); removed Wikispeech-Jobrunner.

• kalle claimed this task.Apr 27 2020, 11:03 AM

• kalle mentioned this in T248469: Create database for utterance data.

• kalle added a project: User-kalle.

• kalle moved this task from 🥴 Backlog to 🤠 This week on the User-kalle board.

• kalle moved this task from Backlog to In progress on the Wikispeech-Jobrunner (Sprint) board.Apr 30 2020, 8:26 AM

• kalle changed the point value for this task from 20 to 4.Apr 30 2020, 8:32 AM

Change 593643 had a related patch set uploaded (by Karl Wettin (WMSE); owner: Karl Wettin (WMSE)):
[mediawiki/extensions/Wikispeech@master] Segment hasher

https://gerrit.wikimedia.org/r/593643

gerritbot added a project: Patch-For-Review.May 1 2020, 12:07 AM

• kalle moved this task from 🤠 This week to 😘 Review on the User-kalle board.May 1 2020, 12:20 AM

Sebastian_Berlin-WMSE added a project: User-Sebastian_Berlin-WMSE.May 4 2020, 7:56 AM

Sebastian_Berlin-WMSE moved this task from Backlog to Reviewing on the User-Sebastian_Berlin-WMSE board.

• kalle moved this task from 😘 Review to 🤔 Awaits action on the User-kalle board.May 6 2020, 9:27 AM

• kalle moved this task from 🤔 Awaits action to 🤠 This week on the User-kalle board.May 12 2020, 9:59 AM

• kalle moved this task from 🤠 This week to 🤔 Awaits action on the User-kalle board.May 12 2020, 11:28 AM

Lokal_Profil changed the point value for this task from 4 to 1.May 14 2020, 8:35 AM

Change 593643 merged by jenkins-bot:
[mediawiki/extensions/Wikispeech@master] Segment hasher

https://gerrit.wikimedia.org/r/593643

• kalle mentioned this in rEWISb95140cb245f: Segment hasher.May 19 2020, 2:42 PM