Investigate what the ideal change detection threshold would be
Closed, Resolved


The change detection threshold is the threshold that defines when two paragraphs are considered the same, but changed, or different ones where the first one was deleted and the latter added. T180259 shows different scenarios where the current threshold may not always bring the best results.

Investigate which if another threshold values bring better results. And document the process e.g. on the test wiki or any other persistent page that can be altered at a later stage and linked.

T181404: Make change detection threshold configurable from php, T182571: Create a test set for wikidiff2 and T183352: Investigate if character runs are better than character counts for threshold are prerequisites for this task.

I tried to fix some of these regressions from the regression ticket, with a value of 0.2 ( instead of 0.25 ) I fixed:

It's harder for the next example where 0.145 seems to be the magic border. ( see line 90 ):

Also the threshold alone seems to be not good enough to get the stuff going on in here:
( setting it to 0.015 fixes this completely, 0.08 most of the cases but then the threshold might be completely useless for most other cases )

We imported all diffs linked here and on other pages in our test-wiki:

You can see the old and new version there!

Feel free to add more!

For testing see the automated testing environment

WMDE-Fisch set the point value for this task to 8.Feb 6 2018, 4:37 PM

@jkroll cab you please summarize your findings. You said 0.2 would be the best compromise for the threshold.

I would say a default value of 0.2 looks pretty good for English. Some slightly annoying edge cases exist for any value, which could be fixed by special-case code. I also investigated "character runs" as an alternative to the character-based similarity but found no improvement.