Page MenuHomePhabricator

Design and implement an algorithm to provide stable element ids
Open, MediumPublic

Description

Stable element ID assignments would be desirable for many use cases:

  • Stable links to sections.
  • A modern alternative for Labeled Section Transclusion (see T100139)
  • Annotations like blame maps.
  • Fine-grained dependency tracking for change propagation.
  • Content section tracking for translations: T90187

We have been discussing ways to implement stable IDs for a while now, but haven't found the time to really flesh out a plan yet.

Idea 1): Leverage DOM diffing to transfer IDs for unmodified content

After each parse, DOM-diff against the previous revision. For all unchanged content, transfer the old element ID assignments.

Issues:

  • Our DOM diff algorithm is not very sophisticated. For example, it lacks move detection. However, move detection could be added (using the xydiff algorithm or some tricks from hypothesis, for example), which would also improve selective serialization and possible future blame map generation.

See also

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: Parsoid.
GWicke added subscribers: GWicke, ssastry.
GWicke edited subscribers, added: cscott, Arlolra, tstarling; removed: Aklapper.

We've discussed this several times as well, and we discussed this during the offsite as well.

There are a few reasons why this hasn't been done so far. (a) there has been no compelling use case that required this as a high-priority thing - at one point, we thought we would use this for wt <-> html editing in VE, but we figured that temporary storage is a more reliable solution for now. (b) we have more things to do than we have people to work on them .. so without a compelling use case that forces us to get to this, it will continue to get deprioritized. (c) there is the other important bit about the 'stability' part of the stable ids, i.e. we cannot really guarantee stability across a series of edits. So, if there is any application that requires hard guarantees about stability, then we need a solution where we can provide such guarantees and a domdiff-based solution may or may not cut it.

GWicke renamed this task from Design an implement an algorithm to provide stable element ids to Design and implement an algorithm to provide stable element ids.Oct 23 2015, 3:43 AM

@ssastry, ContentTranslation has the translation restore feature, which allows translators to continue the translation they saved early. While doing that we load the source article(latest revision) and use the section ids to properly align the sections translated to the source sections.(sections are paragraphs, headers etc). We have multiple ways to show the orphan translations(translation section where we dont find a source section when we restore), but the better stable ids for sections will certainly help translators. We have upcoming feature development to use this sectionid matching to develop a parallel corpora aligned at section levels.

So any improvements in this direction is much appreciated.

Arlolra triaged this task as Medium priority.Nov 18 2015, 1:47 AM

It is possible to use Locality-sensitive hashing to create a digest, and it is often so stable it can be linked on as-is. On a link miss (there are no exact match) it is usually possible to calculate a distance metric and redirect the request to the closest hit (reload with the closest digest).

The calculation of the digests are heavy, but done when storing the revision. Several algorithms are of O(n) so even if they are heavy they are quite efficient for large texts.

Simple 16 and 32-bits digests should be sufficient for Wikipedia.

Notes from a Jan 6, 2016 conversation between @santhosh, @Arlolra, and me at the WMF dev summit.

There have been followup discussions on email and in person with others, but the general gist of this hasn't fundamentally changed, i.e. stability requirements seems to be application-specific and more clarity is needed as to what a generic stable id solution is going to accomplish.

I am dumping these notes here so this can spark future discussion rather than be stuck in an email thread.


Stable ids not very well understood:
The notion of stable id is poorly specified so far. i.e. given a dom A and dom B, if nodes nA (from A) and nB (from B) have the same id, what does that tell us in terms of properties of nA and nB?

For example, consider a paragragh p in A. In the course of edits that take A to B, p has had a bunch of typos fixed, grammatical errors corrected, oxford commas inserted, whatever. Intuitively, it makes sense that p gets the same id in A and B.

However, what if 50% of the content of p has been modified (new sentences added, some deleted, etc.)? How about if 80% of the content has been modified? So, this is the first part of the vagueness of stable ids.

There seems to be some application-specific aspects that determine what id-stability needs to guarantee.

For example, if the guarantee of stable ids are used to assign annotations to DOM nodes (say comments, or other properties), it does seem that comments attached to a node should carry over only if the paragraph has not been wholesale modified (i.e. beyond some content similarity threshold, the identical id assignment feels useless / stale).

Also, in the general case, when ids are used to retrieve properties about the attached DOM nodes (and hence associated wikitext because of DSR mapping that Parsoid maintains), it seems that false positives are going to be worse than false negatives. Any id assignment that introduces false negatives is probably better and conservative, although in the general case, this once again seems application specific.

CX-specific requirements:

All of these consideration apply to what CX is trying to do. CX has very specific requirements.

They only need to track the immediate children of <body>. i.e. given revisions R1 and R2 (later revision), for each
child c2 of R2, they want to find a child c1 of R1 that is "similar" to c2 (this is so that they can re-present the previously saved translation of c1 against c2 and have the editor update the previous translation).

So, the problem for CX reduces to: What is the most efficient way to compute similarity between top-level children of 2 dom nodes?

  • it seems there are lots of approaches to doing this:
    • some edit distance metric normalized to node size
    • https://en.wikipedia.org/wiki/W-shingling
    • probably any of the many techniques to detect plagiarism the naive way of doing either of them would result in N^2 or worse complexity. But, in combination with dom-diff, or by using some simple linear walk of top-level nodes, you can probably do similarity detection a lot more efficiently.

      In my mind, I am intrigued by the idea of assigning signatures / fingerprints to nodes which can be used as proxies for computing / detecting similarity. i.e. instead of comparing strings X and Y, you compare sig(X) and sig(Y) instead. This eliminates the need to process full documents. You can instead process and compare a set of signatures. md5 whould be bad sig function for most applications.

Related questions:

  • Who should compute these document similarities? At first blush, it seems that this is better located in CX rather than in Parsoid unless we find a useful generalization of this (see next section below) that is more broadly applicable.
  • If we go with the signature-based approach, these node signatures should be considered metadata, and can be stored somewhere which clients (like CX) would fetch to map nodes between two documents.

General problem:
Bloom filters use a set of hash functions for efficient membership tests but they have the problem of false positives.

We are more interested in an id assignment solution that is biased towards false negatives (at first glance).

But, whatever be the case, it seems like a set of k (for a very small value of k) signatures might meet the needs of different applications. In addition, Parsoid itself might use some similarity cut-off to preserve id assignment across revisions independent of edits.

I came across https://en.wikipedia.org/wiki/Locality-sensitive_hashing family of hashing and in quick look, this match with our problem description

I have used Locality-sensitive hashing as a method in several languages/projects and it is pretty stright forward. I usually hash and bin the text from a sliding window. Some years back I also proposed to store such hash digests together with the revisions, but it was rejected.

As long as you only need to use this within an article for resynchronization purposes the hashed strings should work pretty well, and you should also be able to reduce the length of the digests from 128 as I have used previously.

There are already a PHP version in an old Gerrit commit that I withdrew. It was proposed as a dissimilarity measure for labels and descriptions in Wikidata. (Dissimilarity is a weak indication that something is wrong in the interlinking of items.)

The code is PrettyDarnFast™, especially compared to edit distances, but you might add stuff that slows it down. It is of O(n), where n is the length of the text. Better hashing algorithms is heavier, but unless you want to compare huge text corpus', with a lot of similar texts, it should not matter.

The general algorithm I have used is

let n be 5
let m be 128
let w be a string of length n
let s be the string to be hashed
while s
  remove one char from s and preppend it to w
  shorten w to length n
  let h be the hash of the window
  increment the bin at h modulo m
let v be the median value over all bin counts
let d be a digest of length m
for each bit let the value be true if bin count larger than v

Sometimes I use the vector as-is, but often the digest.

Comparison of digests are done by a bean-counter running over all the similar bits. That is for two strings a and b with digests a' and b' you would run the bean-counter as count(a' & b'), and then normalize over the digest length m.

I believe a better approach to this problem is T149667: Build an article annotation service. I don't believe that a single definition of "stable IDs" will work for all the use cases considered; as @ssastry wrote above, different uses want different notions of stability. It's best to decouple this from the parser or content representation.