Page MenuHomePhabricator

Create segment hasher
Closed, ResolvedPublic1 Estimated Story Points

Description

The hashes should be unique for a given segment (typically a sentence) and will be used as identifiers in the database and for filenames. Appending the length of a segment to the hash may be useful to avoid collision.

Event Timeline

I suggest SHA1, 160 bits. It includes message length. Supported in PHP since PHP4.

SHA1 is no longer considered a secure one way hash, it is possible for well funded organizations to brute force and reconstruct the original data. In practice that means something like running 10000 nodes of modern computers for a couple of weeks. If this is a problem we could upgrade to SHA2 or even SHA3, but there is no native support for this in PHP.

kalle set the point value for this task to 20.Apr 2 2020, 10:43 AM

Although hashing of a segment is a simple thing to implement, we need to spend time understanding (and documenting) how data is passed to Speechoid, the user, database and everything else so that we can store utterances in the database, allow for users to request these utterances based on hashes and revision, and so on.

kalle changed the point value for this task from 20 to 4.Apr 30 2020, 8:32 AM

Change 593643 had a related patch set uploaded (by Karl Wettin (WMSE); owner: Karl Wettin (WMSE)):
[mediawiki/extensions/Wikispeech@master] Segment hasher

https://gerrit.wikimedia.org/r/593643

Lokal_Profil changed the point value for this task from 4 to 1.May 14 2020, 8:35 AM

Change 593643 merged by jenkins-bot:
[mediawiki/extensions/Wikispeech@master] Segment hasher

https://gerrit.wikimedia.org/r/593643