Page MenuHomePhabricator

Wikidiff2 paragraph split parameter tuning
Closed, DeclinedPublic

Description

The following parameters were introduced for T324803: Better Diffs: Wikidiff2 revise algorithm. They should be tuned based on real-world test data. The default values are rough intuitive guesses.

maxSplitSize

The maximum number of lines in the RHS which may be considered for a word-level diff against a single line of the LHS.

Lower valueSmaller splits
Higher valueLarger splits

This was originally intended as a complexity limit for performance reasons. However, in testing, it was very difficult to find a test case which would actually have performance problems with a high maxSplitSize. So I think it can be set based on user expectations. Suggestion: 10.

initialSplitThreshold

The minimum similarity which must be maintained during a split detection search. The search terminates when the similarity falls below this level. Default: 0.1.

Lower valueMore sensitive. More diffs are detected as splits.
Higher valueLess sensitive. Fewer diffs are detected as splits.

Setting this to 0.1 means at least 10% of the LHS line must appear in the corresponding RHS lines.

An excessively low value would cause split detection to compete with move detection. Consider an obvious paragraph move:

LHSRHSsplit similarity
aaabbb0
aaa50%

Comparing "aaa" to "bbb" the similarity is 0. If the initialSplitThreshold was too low (say 0), we might continue the split search to the next line. Now we compare "aaa" with "bbb aaa" and find that the similarity is 50%. We don't want to detect this case as a split so we set initialSplitThreshold high enough to ignore very bad initial matches.

Setting initialSplitThreshold too high would cause the algorithm to ignore splits where the first RHS line only has a short prefix of the LHS line.

LHSRHSsplit similarity
1 2 3 4 5 6 7 8 9 0110%
220%
330%

In this example, the split similarity increases with increasing split size, but the initial split similarity must be satisfied to continue the search. If initialSplitThreshold is greater than 10%, the split would not be detected.

initialSplitThreshold also helps to terminate the search when text was deleted. So increasing initialSplitThreshold improves performance. For example:

LHSRHSsplit similarity
1 2 3 4 5 6 7 8 9 01 2 330%
a27%
b c d e20%

The similarity decreases as more irrelevant text is added. Eventually the similarity becomes less than initialSplitThreshold and the search stops.

finalSplitThreshold

The minimum similarity which must be achieved in order to display the comparison between one line and several lines as a split. Default 0.6.

Lower valueMore sensitive. More diffs are detected as splits.
Higher valueLess sensitive. Fewer diffs are detected as splits.

This parameter controls how similar the joined lines must be at the optimal split size. For example:

LHSRHSsplit similarity
1 2 3 4 5 6 7 8 9 01 2 3 440%
550%
a45%
b c d e f g h i j25%

The best split was achieved at a size of 2, with a similarity of 50%. If this similarity exceeds finalSplitThreshold then it will be presented to the user as a split.

Event Timeline

I did something like this for a small variety of parameter options when I was testing: https://tools-static.wmflabs.org/betterdiffsreport/report.html

It could in theory be repeated for a larger variety of parameters.

@TheresNoTime we've discussed this and said to wait for feedback before we pick this up.

Aklapper renamed this task from Wikdiff2 paragraph split parameter tuning to Wikidiff2 paragraph split parameter tuning.Sep 13 2023, 1:37 PM
MusikAnimal subscribed.

The project has been deemed complete, with no tuning needed! Closing as declined.