Page MenuHomePhabricator

Remove reference duplicates from RDF dump
Closed, ResolvedPublic


Right now when the dump is generated, references are identified by content has. This means reference to German WIkipedia always produces "ref:004ec6fbee857649acdbdbad4f97b2c8571df97". However, since these are many such references, the data for this reference is repeated over and over, potentially creating thousands of copies of the same information. We need to remove the duplicates from the dump - or change the way the hash is generated (how?)

Additionally, we may encounter the same problem when importing updates, so we must account for this when we make the update procedure.

Event Timeline

Smalyshev claimed this task.
Smalyshev raised the priority of this task from to High.
Smalyshev updated the task description. (Show Details)

Possible solution:

After generating the value and reference nodes, put their hash in a "seen" array (as key).
Before generating the value and reference nodes, check whether they have been "seen" already.
The "seen" list would be a member of RdfBuilder, so no problems arise with changing options.

To reduce memory consumption of the approach I suggested above, use part (the first few digits) of the hash as the key in the "seen" array, keep the full hash as the value. For instance, using 4 digits would limit the size of the "seen" list to 2^16 entires.

When looking up x:

  • !isset( $seen[ key($x) ] ) -> not seen
  • isset( $seen[ key($x) ] ) && $seen[ key($x) ] === x -> seen
  • isset( $seen[ key($x) ] ) && $seen[ key($x) ] !== x -> probably not seen

Finally, set $seen[ key($x) ] = x

Hat Tip to and

I'll try full hash first and see how much memory it consumes. If it's anything substantial then I'll go with what @daniel proposed.