Page MenuHomePhabricator

Remove reference duplicates from RDF dump
Closed, ResolvedPublic

Description

Right now when the dump is generated, references are identified by content has. This means reference to German WIkipedia always produces "ref:004ec6fbee857649acdbdbad4f97b2c8571df97". However, since these are many such references, the data for this reference is repeated over and over, potentially creating thousands of copies of the same information. We need to remove the duplicates from the dump - or change the way the hash is generated (how?)

Additionally, we may encounter the same problem when importing updates, so we must account for this when we make the update procedure.

Event Timeline

Smalyshev claimed this task.
Smalyshev raised the priority of this task from to High.
Smalyshev updated the task description. (Show Details)
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 13 2015, 12:17 AM

Same may be necessary for values when we store them as nodes.

Possible solution:

After generating the value and reference nodes, put their hash in a "seen" array (as key).
Before generating the value and reference nodes, check whether they have been "seen" already.
The "seen" list would be a member of RdfBuilder, so no problems arise with changing options.

To reduce memory consumption of the approach I suggested above, use part (the first few digits) of the hash as the key in the "seen" array, keep the full hash as the value. For instance, using 4 digits would limit the size of the "seen" list to 2^16 entires.

When looking up x:

  • !isset( $seen[ key($x) ] ) -> not seen
  • isset( $seen[ key($x) ] ) && $seen[ key($x) ] === x -> seen
  • isset( $seen[ key($x) ] ) && $seen[ key($x) ] !== x -> probably not seen

Finally, set $seen[ key($x) ] = x

Hat Tip to http://www.somethingsimilar.com/2012/05/21/the-opposite-of-a-bloom-filter/ and https://news.ycombinator.com/item?id=4251313

I'll try full hash first and see how much memory it consumes. If it's anything substantial then I'll go with what @daniel proposed.

Smalyshev closed this task as Resolved.Mar 20 2015, 5:23 PM