Page MenuHomePhabricator

Investigate if character runs are better than character counts for threshold
Closed, ResolvedPublic

Description

Motivation
Currently, we define whether changed paragraphs are actually added and removed paragraphs based on a threshold that counts the percentage of changed characters as opposed to unchanged characters.

Generally we believe that people consider paragraphs unrelated, if the changes are very fragmented. If the change is e.g. a very big addition at the end, people are still likely to consider this a change.

Task
Please investigate, if we can use character runs instead of a character count for determining the threshold. Also try to find out, if a normalized number or a fixed number (e.g. anything more than 5 runs is an unrelated paragraph) would work better.

If this is easy, you may also implement the character run based algorithm, if it is deemed better.