During the HTML enrichment pipeline we are creating a unified_diff between the current HTML and the parent revision HTML. We are using the library difflib.
This library produces a unified_diff, but it has an issue reported in 2008 that isn't going to be solved. What happens is that if the text compared doesn't end on /n, the unified_diff is wrong.
We have merged an MR that fixes this issue by appending /n to the text, and removing it when it's being read.
The issue with this solution is that if there is a real text that ends on several /n, they will be removed. For example, if a text ends on several /n and the modified text is removing them, we'll skip it.
For example:
from_str = "one\ntwo\nthree\n\n\n\n\n\n" to_str = "one\ntwo\nthree\n\n"
Will produce a diff like:
--- from +++ to @@ -2,7 +2,3 @@ two three - - - -
Which is right, but our tool that rebuilds the original string will use strip() and will return
from_str = "one\ntwo\nthree"
Which isn't right.
The solutions for this library don't seem simple. Looking at some workarounds, it seems that using the added /n is one of the most common, although it generates this issue. Another solution is to use a custom marker for the end of line, but that means that the unified_diff won't be GNU compatible.
Looking at other solutions, we could probably use diff in a subprocess.