The hashes should be unique for a given segment (typically a sentence) and will be used as identifiers in the database and for filenames. Appending the length of a segment to the hash may be useful to avoid collision.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Segment hasher | mediawiki/extensions/Wikispeech | master | +67 -8 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Lokal_Profil | T246080 ☂Re-architecture storage of utterances | |||
Resolved | • kalle | T249200 Connect API to utterance store | |||
Duplicate | • kalle | T249203 Read utterance audio and data as files | |||
Resolved | • kalle | T248469 Create database for utterance data | |||
Resolved | • kalle | T248472 Create segment hasher |
Event Timeline
I suggest SHA1, 160 bits. It includes message length. Supported in PHP since PHP4.
SHA1 is no longer considered a secure one way hash, it is possible for well funded organizations to brute force and reconstruct the original data. In practice that means something like running 10000 nodes of modern computers for a couple of weeks. If this is a problem we could upgrade to SHA2 or even SHA3, but there is no native support for this in PHP.
Although hashing of a segment is a simple thing to implement, we need to spend time understanding (and documenting) how data is passed to Speechoid, the user, database and everything else so that we can store utterances in the database, allow for users to request these utterances based on hashes and revision, and so on.
Change 593643 had a related patch set uploaded (by Karl Wettin (WMSE); owner: Karl Wettin (WMSE)):
[mediawiki/extensions/Wikispeech@master] Segment hasher
Change 593643 merged by jenkins-bot:
[mediawiki/extensions/Wikispeech@master] Segment hasher