Page MenuHomePhabricator

Library for unified_diffs
Open, Needs TriagePublic

Description

During the HTML enrichment pipeline we are creating a unified_diff between the current HTML and the parent revision HTML. We are using the library difflib.
This library produces a unified_diff, but it has an issue reported in 2008 that isn't going to be solved. What happens is that if the text compared doesn't end on /n, the unified_diff is wrong.

We have merged an MR that fixes this issue by appending /n to the text, and removing it when it's being read.

The issue with this solution is that if there is a real text that ends on several /n, they will be removed. For example, if a text ends on several /n and the modified text is removing them, we'll skip it.

For example:

from_str = "one\ntwo\nthree\n\n\n\n\n\n"
to_str = "one\ntwo\nthree\n\n"

Will produce a diff like:

--- from
+++ to
@@ -2,7 +2,3 @@
 two
 three
 
-
-
-
-

Which is right, but our tool that rebuilds the original string will use strip() and will return

from_str = "one\ntwo\nthree"

Which isn't right.

The solutions for this library don't seem simple. Looking at some workarounds, it seems that using the added /n is one of the most common, although it generates this issue. Another solution is to use a custom marker for the end of line, but that means that the unified_diff won't be GNU compatible.

Looking at other solutions, we could probably use diff in a subprocess.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Split line by \n to generate html diffrepos/data-engineering/mediawiki-event-enrichment!124akhatunakhatun/fix-html-diffmain
Customize query in GitLab

Event Timeline

and removing it when it's being read.

Yeah, and we want people to be able to easily apply the diff without access to our code. As is, they have to know and implement this workaround themselves.

Another problem, not with difflib specifically but with how we create lines before input into difflib. More here: https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/124

In short: splitlines() breaks on any Unicode line break: \n, \r\n, \r, U+2028, etc. If the only separator between two “lines” for splitlines is e.g. U+2028, they become two logical lines for difflib but one physical \n-terminated row in the file causing invalid unified diff format.

FWIW, if there turn out to be enough problems with unified diff, and enough advantages for diff-match-patch, we can still go with diff-match-patch. It is up to us! :)