During the HTML enrichment pipeline we are creating a [[ https://en.wikipedia.org/wiki/Diff#Unified_format | unified_diff ]] between the current HTML and the parent revision HTML. We are using the [[ https://docs.python.org/es/3/library/difflib.html | library difflib ]].
This library produces a unified_diff, but it has an [[ https://bugs.python.org/issue2142 | issue reported in 2008 ]] that isn't going to be solved. What happens is that if the text compared doesn't end on `/n`, the unified_diff is wrong.
We have [[ https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/122 | merged an MR ]] that fixes this issue by appending `/n` to the text, and removing it when it's being read.
The issue with this solution is that if there is a real text that ends on several `/n`, they will be removed. For example, if a text ends on several `/n` and the modified text is removing them, we'll skip it.
For example:
```
from_str = "one\ntwo\nthree\n\n\n\n\n\n"
to_str = "one\ntwo\nthree\n\n"
```
Will produce a diff like:
```
--- from
+++ to
@@ -2,7 +2,3 @@
two
three
-
-
-
-
```
Which is right, but our tool that rebuilds the original string will use `strip()` and will return
```
from_str = "one\ntwo\nthree"
```
Which isn't right.
The solutions for this library doesn't seem simple. Looking at some workarounds, it seems that using the added `/n` is one of the most common, although it generates this issue. Another solution is to use a custom marker for the end of line, but that means that the unified_diff won't be GNU compatible.
Looking at other solutions, we could probably use `diff` in a subprocess.