Page MenuHomePhabricator

Use hashes to identify matching page titles in Cognate DB schema
Closed, ResolvedPublic

Description

From https://phabricator.wikimedia.org/T148988#2782161 :

we don't need the key (i.e. the normalized title) as a string at all. We just need a hash of the normalized title. A 64 bit hash can be represented as an integer. You may still want to have a table for titles, so you don't have to store the same title 20 times if the title exists on 20 wikis. If you do this, I suggest to again use the (unnormalized) title's hash as the numeric representation of the title. This ensures consistency between wikis, and reduces the need for lookups.

Related Objects

StatusSubtypeAssignedTask
OpenFeatureNone
OpenFeatureNone
OpenFeatureNone
OpenFeatureNone
OpenNone
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
OpenNone
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
Resolvedaude
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
ResolvedAddshore
Resolvedaude
ResolvedNone
OpenNone

Event Timeline

Quick note: the "unnormalized" title would not be normalized according to Cognate rules, but it would still undergo regular title normalization (ucfirst, space to underscore, etc), as well as unicode normalization. The Title class takes care of this, but be careful when processing title strings from other sources.

@daniel did you have a method in mind for getting a numeric hash of a string in order to store it in the DB?

For use with a BIGINT column in the database, you need a 16 digit hex string to represent a 64 bit number. The simples way to get a 16 digit hex string based on a hash is to take the 16 first digits of an sha1 hash:

substr( sha1( "Hello Worlds!" ), 0, 16 );

This is pretty inefficient though, lots of cycles are spend computing a 160 bit hash, though we only need 64 bit. I'll check if I can find something standard and more efficient.

You could also use two crc32-sums to generate the hash - one for the characters at even indexes, and one for the characters at odd indexes, and then concatenate them. But I suspect doing this in PHP is going to be slower than relying on a highly optimized implementation of SHA1 in the standard library...

According to some very quick research, truncated SHA1 or MD5 is the best solution in practice. Here's an overview of the hash algorithms supported by PHP, along with a benchmark:

http://php.net/manual/de/function.hash.php#89574

SHA1 and MD5 are really fast, a lot faster than doing CRC32 twice. Any pure PHP implementation is going to be a *lot* slower. Stack Overflow agrees:

Here's a blog post that matches our use case nearly 100%:

They end up using truncated MD5.

In any case, whatever you use, make sure it is very clearly documented, not just in the code but also in the DB schema.

Btw, if we really want, we could also use a 128 bit hash by having two BIGINT columns, and an index spanning both of them. But I don't think that's worth the trouble.

One thing we should think about though: what exactly should happen if we do hit a collision? Ignore? Overwrite? Fail? Can we even detect it?

I think the collision case is rare enough that we don't have to worry much about it, and we can fail pretty hard if we hit it. But the behavior in such a case should be well-defined.

daniel renamed this task from Adjust Cognate DB schema to Use hashes to identify matching page titles in Cognate DB schema.Nov 10 2016, 2:26 PM

Change 320743 had a related patch set uploaded (by Addshore):
New schema & normalizationg keys

https://gerrit.wikimedia.org/r/320743

Change 320743 merged by jenkins-bot:
New schema & normalizing keys

https://gerrit.wikimedia.org/r/320743

Addshore moved this task from Currently in sprint to Done on the WMDE-TechWish board.
Addshore moved this task from Active 🚁 to Closing ✔️ on the User-Addshore board.