we don't need the key (i.e. the normalized title) as a string at all. We just need a hash of the normalized title. A 64 bit hash can be represented as an integer. You may still want to have a table for titles, so you don't have to store the same title 20 times if the title exists on 20 wikis. If you do this, I suggest to again use the (unnormalized) title's hash as the numeric representation of the title. This ensures consistency between wikis, and reduces the need for lookups.
|mediawiki/extensions/Cognate : master||New schema & normalizing keys|
- Mentioned In
- rECOGcfa170323877: Update patch set 2
T183737: Cognate's StringHasher requires 64-bit PHP, doesn't work with 32-bit
rECOG69601da63f1b: New schema & normalizing keys
rECOG877fb74078c2: New schema & normalizing keys
rECOGe75a2b6f18a9: New schema & normalizing keys
rECOGf5cf24073fb4: New schema & normalizing keys
T149082: Security review for Cognate Extension
rECOG68ab41ce3259: New schema & normalizing keys
rECOGe49f8179e87c: New schema & normalizing keys
rECOGb4ee52e60f1f: New schema & normalizing keys
rECOGcdd13cf07eab: New schema & normalizing keys
rECOGc82e4144e32d: New schema & normalizing keys
rECOG1b6c674dee49: New schema & normalizing keys
rECOG181cf98c085d: New schema & normalizing keys
rECOGfce7cf173b38: New schema & normalizing keys
rECOG569ed8e0b31a: New schema & normalizing keys
rECOGb50be0a77e95: New schema & normalizing keys
rECOG9d51cb9d44ba: New schema & normalizing keys
rECOG021f241d79d7: New schema & normalizing keys
rECOGd47e145876e9: New schema & normalizing keys
rECOG4873c0d005bf: New schema & normalizing keys
rECOGe5acac5e46cf: New schema & normalizationg keys
rECOG647ac1ce9e2c: New schema & normalizing keys
Quick note: the "unnormalized" title would not be normalized according to Cognate rules, but it would still undergo regular title normalization (ucfirst, space to underscore, etc), as well as unicode normalization. The Title class takes care of this, but be careful when processing title strings from other sources.
For use with a BIGINT column in the database, you need a 16 digit hex string to represent a 64 bit number. The simples way to get a 16 digit hex string based on a hash is to take the 16 first digits of an sha1 hash:
substr( sha1( "Hello Worlds!" ), 0, 16 );
This is pretty inefficient though, lots of cycles are spend computing a 160 bit hash, though we only need 64 bit. I'll check if I can find something standard and more efficient.
You could also use two crc32-sums to generate the hash - one for the characters at even indexes, and one for the characters at odd indexes, and then concatenate them. But I suspect doing this in PHP is going to be slower than relying on a highly optimized implementation of SHA1 in the standard library...
According to some very quick research, truncated SHA1 or MD5 is the best solution in practice. Here's an overview of the hash algorithms supported by PHP, along with a benchmark:
SHA1 and MD5 are really fast, a lot faster than doing CRC32 twice. Any pure PHP implementation is going to be a *lot* slower. Stack Overflow agrees:
Here's a blog post that matches our use case nearly 100%:
They end up using truncated MD5.
In any case, whatever you use, make sure it is very clearly documented, not just in the code but also in the DB schema.
Btw, if we really want, we could also use a 128 bit hash by having two BIGINT columns, and an index spanning both of them. But I don't think that's worth the trouble.
One thing we should think about though: what exactly should happen if we do hit a collision? Ignore? Overwrite? Fail? Can we even detect it?
I think the collision case is rare enough that we don't have to worry much about it, and we can fail pretty hard if we hit it. But the behavior in such a case should be well-defined.