we don't need the key (i.e. the normalized title) as a string at all. We just need a hash of the normalized title. A 64 bit hash can be represented as an integer. You may still want to have a table for titles, so you don't have to store the same title 20 times if the title exists on 20 wikis. If you do this, I suggest to again use the (unnormalized) title's hash as the numeric representation of the title. This ensures consistency between wikis, and reduces the need for lookups.
Quick note: the "unnormalized" title would not be normalized according to Cognate rules, but it would still undergo regular title normalization (ucfirst, space to underscore, etc), as well as unicode normalization. The Title class takes care of this, but be careful when processing title strings from other sources.
For use with a BIGINT column in the database, you need a 16 digit hex string to represent a 64 bit number. The simples way to get a 16 digit hex string based on a hash is to take the 16 first digits of an sha1 hash:
substr( sha1( "Hello Worlds!" ), 0, 16 );
This is pretty inefficient though, lots of cycles are spend computing a 160 bit hash, though we only need 64 bit. I'll check if I can find something standard and more efficient.
You could also use two crc32-sums to generate the hash - one for the characters at even indexes, and one for the characters at odd indexes, and then concatenate them. But I suspect doing this in PHP is going to be slower than relying on a highly optimized implementation of SHA1 in the standard library...
According to some very quick research, truncated SHA1 or MD5 is the best solution in practice. Here's an overview of the hash algorithms supported by PHP, along with a benchmark:
SHA1 and MD5 are really fast, a lot faster than doing CRC32 twice. Any pure PHP implementation is going to be a *lot* slower. Stack Overflow agrees:
Here's a blog post that matches our use case nearly 100%:
They end up using truncated MD5.
In any case, whatever you use, make sure it is very clearly documented, not just in the code but also in the DB schema.
Btw, if we really want, we could also use a 128 bit hash by having two BIGINT columns, and an index spanning both of them. But I don't think that's worth the trouble.
One thing we should think about though: what exactly should happen if we do hit a collision? Ignore? Overwrite? Fail? Can we even detect it?
I think the collision case is rare enough that we don't have to worry much about it, and we can fail pretty hard if we hit it. But the behavior in such a case should be well-defined.