⚓ T150404 Use hashes to identify matching page titles in Cognate DB schema

	Subject	Repo	Branch	Lines +/-
	New schema & normalizing keys	mediawiki/extensions/Cognate	master	+720 -391

Status	Subtype	Assigned	Task
Open	Feature	None	T13996 A way to select which parts of Wiktionary articles to show
Open	Feature	None	T14213 Following a link to a language entry in Wiktionary should display only that entry
Open	Feature	None	T13998 A way to show only those languages on Wiktionary that the user is interested in
Open	Feature	None	T38881 Wiktionary needs usable API
Open		None	T31229 Extension to provide access via the dict protocol
Resolved		Lydia_Pintscher	T146637 Wikidata 2016 Q4 goals
Resolved		Lydia_Pintscher	T150179 Wikidata 2017 Q1 goals
Open		None	T109579 [Epic] Give more sister projects access to Wikidata
Resolved		Lydia_Pintscher	T986 Use structured data on Wiktionary
Resolved		Lydia_Pintscher	T988 Phase 1: Represent Wiktionary lexicon using structured data
Resolved		Lydia_Pintscher	T987 [Story] Phase 0: Automate interwiki language links for Wiktionary
Resolved		Lydia_Pintscher	T169708 Wikidata 2017 Q3 goals
Resolved		Lydia_Pintscher	T150178 Wikidata 2017 Q2 goals
Resolved		Lydia_Pintscher	T159316 Enable arbitrary access on English Wiktionary
Resolved		aude	T158323 enable sitelinks on Wikidata for Wiktionary pages outside main namespace (phase 1)
Resolved		Addshore	T150182 Deploy Cognate extension to production
Resolved		Addshore	T156241 Deploy Cognate extension to beta
Resolved		Addshore	T149082 Security review for Cognate Extension
Resolved		Addshore	T150404 Use hashes to identify matching page titles in Cognate DB schema
Resolved		Addshore	T160503 Wikimedia\Assert\ParameterTypeException from line 89 of /srv/mediawiki-staging/php-master/vendor/wikimedia/assert/src/Assert.php: Bad value for parameter $namespace: must be a integer
Resolved		aude	T158324 Enable phase 1 for Wiktionary at beta cluster
Resolved		None	T158325 Enable phase 1 for Wiktionary at test Wikidata
Open		None	T166765 Investigate the case of custom namespaces

Addshore created this task.Nov 10 2016, 9:13 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 10 2016, 9:13 AM

Addshore moved this task from Proposed to Sprint ready on the WMDE-TechWish board.Nov 10 2016, 9:13 AM

Addshore moved this task from Unsorted 💣 to Back Burner 🏛️ on the User-Addshore board.

Addshore added a parent task: T150182: Deploy Cognate extension to production.

Quick note: the "unnormalized" title would not be normalized according to Cognate rules, but it would still undergo regular title normalization (ucfirst, space to underscore, etc), as well as unicode normalization. The Title class takes care of this, but be careful when processing title strings from other sources.

@daniel did you have a method in mind for getting a numeric hash of a string in order to store it in the DB?

For use with a BIGINT column in the database, you need a 16 digit hex string to represent a 64 bit number. The simples way to get a 16 digit hex string based on a hash is to take the 16 first digits of an sha1 hash:

substr( sha1( "Hello Worlds!" ), 0, 16 );

This is pretty inefficient though, lots of cycles are spend computing a 160 bit hash, though we only need 64 bit. I'll check if I can find something standard and more efficient.

You could also use two crc32-sums to generate the hash - one for the characters at even indexes, and one for the characters at odd indexes, and then concatenate them. But I suspect doing this in PHP is going to be slower than relying on a highly optimized implementation of SHA1 in the standard library...

According to some very quick research, truncated SHA1 or MD5 is the best solution in practice. Here's an overview of the hash algorithms supported by PHP, along with a benchmark:

http://php.net/manual/de/function.hash.php#89574

SHA1 and MD5 are really fast, a lot faster than doing CRC32 twice. Any pure PHP implementation is going to be a *lot* slower. Stack Overflow agrees:

Here's a blog post that matches our use case nearly 100%:

https://www.sitepoint.com/create-unique-64bit-integer-string/

They end up using truncated MD5.

In any case, whatever you use, make sure it is very clearly documented, not just in the code but also in the DB schema.

Btw, if we really want, we could also use a 128 bit hash by having two BIGINT columns, and an index spanning both of them. But I don't think that's worth the trouble.

One thing we should think about though: what exactly should happen if we do hit a collision? Ignore? Overwrite? Fail? Can we even detect it?

I think the collision case is rare enough that we don't have to worry much about it, and we can fail pretty hard if we hit it. But the behavior in such a case should be well-defined.

daniel renamed this task from Adjust Cognate DB schema to Use hashes to identify matching page titles in Cognate DB schema.Nov 10 2016, 2:26 PM

Addshore moved this task from Sprint ready to Currently in sprint on the WMDE-TechWish board.Nov 10 2016, 6:46 PM

Addshore moved this task from Back Burner 🏛️ to Active 🚁 on the User-Addshore board.

Change 320743 had a related patch set uploaded (by Addshore):
New schema & normalizationg keys

https://gerrit.wikimedia.org/r/320743

gerritbot added a project: Patch-For-Review.Nov 10 2016, 6:46 PM

Addshore mentioned this in rECOG647ac1ce9e2c: New schema & normalizing keys.Nov 10 2016, 6:47 PM

Addshore mentioned this in rECOGe5acac5e46cf: New schema & normalizationg keys.

Addshore mentioned this in rECOG4873c0d005bf: New schema & normalizing keys.Nov 11 2016, 1:03 PM

Addshore mentioned this in rECOGd47e145876e9: New schema & normalizing keys.Nov 11 2016, 5:34 PM

Addshore mentioned this in rECOG021f241d79d7: New schema & normalizing keys.Nov 14 2016, 12:16 PM

Addshore mentioned this in rECOG9d51cb9d44ba: New schema & normalizing keys.

Addshore mentioned this in rECOGb50be0a77e95: New schema & normalizing keys.Nov 14 2016, 12:28 PM

Addshore mentioned this in rECOG569ed8e0b31a: New schema & normalizing keys.Nov 14 2016, 5:29 PM

Addshore mentioned this in rECOGfce7cf173b38: New schema & normalizing keys.Nov 16 2016, 2:09 PM

Addshore mentioned this in rECOG181cf98c085d: New schema & normalizing keys.Nov 18 2016, 11:23 AM

Addshore mentioned this in rECOG1b6c674dee49: New schema & normalizing keys.Nov 18 2016, 12:43 PM

Addshore mentioned this in rECOGc82e4144e32d: New schema & normalizing keys.Nov 18 2016, 4:36 PM

Addshore mentioned this in rECOGcdd13cf07eab: New schema & normalizing keys.Nov 22 2016, 8:19 PM

Addshore mentioned this in rECOGb4ee52e60f1f: New schema & normalizing keys.Nov 24 2016, 3:28 PM

Addshore mentioned this in rECOGe49f8179e87c: New schema & normalizing keys.Nov 30 2016, 3:36 PM

Addshore mentioned this in rECOG68ab41ce3259: New schema & normalizing keys.

daniel added a project: User-Daniel.Dec 7 2016, 7:52 PM

daniel added a parent task: T149082: Security review for Cognate Extension.Dec 7 2016, 10:15 PM

daniel mentioned this in T149082: Security review for Cognate Extension.

Addshore mentioned this in rECOGf5cf24073fb4: New schema & normalizing keys.Dec 8 2016, 12:35 PM

daniel mentioned this in rECOGe75a2b6f18a9: New schema & normalizing keys.Dec 9 2016, 3:22 PM

daniel mentioned this in rECOG877fb74078c2: New schema & normalizing keys.Dec 9 2016, 4:03 PM

daniel mentioned this in rECOG69601da63f1b: New schema & normalizing keys.Dec 9 2016, 4:06 PM

Change 320743 merged by jenkins-bot:
New schema & normalizing keys

https://gerrit.wikimedia.org/r/320743

Addshore closed this task as Resolved.Dec 11 2016, 3:58 PM

Addshore moved this task from Currently in sprint to Done on the WMDE-TechWish board.

Addshore moved this task from Active 🚁 to Closing ✔️ on the User-Addshore board.

Addshore moved this task from Done to Demoed on the WMDE-TechWish board.Dec 13 2016, 1:09 PM

Tobi_WMDE_SW removed a project: WMDE-TechWish.Jun 7 2017, 12:01 PM

thiemowmde mentioned this in T183737: Cognate's StringHasher requires 64-bit PHP, doesn't work with 32-bit.Dec 28 2017, 10:12 AM

thiemowmde mentioned this in rECOGcfa170323877: Update patch set 2.Jun 8 2018, 11:56 PM

Use hashes to identify matching page titles in Cognate DB schema
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Use hashes to identify matching page titles in Cognate DB schemaClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Use hashes to identify matching page titles in Cognate DB schema
Closed, ResolvedPublic
Actions

Related Objects
Search...