The following parameters were introduced for T324803: Better Diffs: Wikidiff2 revise algorithm. They should be tuned based on real-world test data. The default values are rough intuitive guesses.
maxSplitSize
The maximum number of lines in the RHS which may be considered for a word-level diff against a single line of the LHS.
| Lower value | Smaller splits |
| Higher value | Larger splits |
This was originally intended as a complexity limit for performance reasons. However, in testing, it was very difficult to find a test case which would actually have performance problems with a high maxSplitSize. So I think it can be set based on user expectations. Suggestion: 10.
initialSplitThreshold
The minimum similarity which must be maintained during a split detection search. The search terminates when the similarity falls below this level. Default: 0.1.
| Lower value | More sensitive. More diffs are detected as splits. |
| Higher value | Less sensitive. Fewer diffs are detected as splits. |
Setting this to 0.1 means at least 10% of the LHS line must appear in the corresponding RHS lines.
An excessively low value would cause split detection to compete with move detection. Consider an obvious paragraph move:
| LHS | RHS | split similarity |
|---|---|---|
| aaa | bbb | 0 |
| aaa | 50% | |
Comparing "aaa" to "bbb" the similarity is 0. If the initialSplitThreshold was too low (say 0), we might continue the split search to the next line. Now we compare "aaa" with "bbb aaa" and find that the similarity is 50%. We don't want to detect this case as a split so we set initialSplitThreshold high enough to ignore very bad initial matches.
Setting initialSplitThreshold too high would cause the algorithm to ignore splits where the first RHS line only has a short prefix of the LHS line.
| LHS | RHS | split similarity |
|---|---|---|
| 1 2 3 4 5 6 7 8 9 0 | 1 | 10% |
| 2 | 20% | |
| 3 | 30% | |
In this example, the split similarity increases with increasing split size, but the initial split similarity must be satisfied to continue the search. If initialSplitThreshold is greater than 10%, the split would not be detected.
initialSplitThreshold also helps to terminate the search when text was deleted. So increasing initialSplitThreshold improves performance. For example:
| LHS | RHS | split similarity |
|---|---|---|
| 1 2 3 4 5 6 7 8 9 0 | 1 2 3 | 30% |
| a | 27% | |
| b c d e | 20% | |
The similarity decreases as more irrelevant text is added. Eventually the similarity becomes less than initialSplitThreshold and the search stops.
finalSplitThreshold
The minimum similarity which must be achieved in order to display the comparison between one line and several lines as a split. Default 0.6.
| Lower value | More sensitive. More diffs are detected as splits. |
| Higher value | Less sensitive. Fewer diffs are detected as splits. |
This parameter controls how similar the joined lines must be at the optimal split size. For example:
| LHS | RHS | split similarity |
|---|---|---|
| 1 2 3 4 5 6 7 8 9 0 | 1 2 3 4 | 40% |
| 5 | 50% | |
| a | 45% | |
| b c d e f g h i j | 25% | |
The best split was achieved at a size of 2, with a similarity of 50%. If this similarity exceeds finalSplitThreshold then it will be presented to the user as a split.