See this session:
$ echo -e 'foo\n\n<!--bar-->\n\n[[Category:Baz]]' > /tmp/wt $ php bin/parse.php < /tmp/wt > /tmp/old.html $ sed 's/<\/p>/<\/p>\n<ul><li>x<\/li><\/ul>/g;' < /tmp/old.html > /tmp/new.html $ php bin/parse.php --selser --html2wt --oldtextfile /tmp/wt --oldhtmlfile /tmp/old.html < /tmp/new.html > /tmp/wt.new $ diff /tmp/wt /tmp/wt.new 3c3 < <!--bar--> --- > * x 4a5 > <!--bar--> $ cat /tmp/wt.new foo * x <!--bar--> [[Category:Baz]]
This mimics the most common dirty diff now seen on discussion tool edits (hiwiki and ptwiki being the wikis where this seems to be prevalent).
I originally thought this was because of DOMDiff (which marks a large part of the DOM modified) but even after I manually tweaked the diff algorithm, the diffs persisted. It turns out that the separator constraints code forces max-newlines to 2 to prevent empty paragraph-wrappers. However, in this particular case, the trailing newlines, comment, and the category link are not p-wrapped since they are all rendering transparent. So, this is somewhat of an edge case in the logic but something that is being repeated triggered when replies are being added close to the end of a document.
But, more pertinently, the code doesn't have smarts to realize that even though all the content is marked edited, all the content comes from source. I am not convinced we should add that logic since it just adds unnecessary complexity.
So, if there is a fix we should make, it might be to the newline constraints logic, if at all.