Page MenuHomePhabricator

Pywikibot diff highlights entire line when only text is removed
Closed, ResolvedPublic

Description

When the modification removes part of a word, such as replacing Pywikipediabot with Pywikibot, the removed text is shown as red in the old line, but the entire new line is highlighted in green. This can be visually distracting.

$ python pwb.py replace -family:mediawiki -lang:mediawiki -page:"Manual:Pywikibot/Compat/picasacopier.py" Pywikipediabot Pywikibot
setting username JVbot for mediawiki:mediawiki
The summary message for the command line replacements will be something like: Bot: Automated text replacement  (-Pywikipediabot +Pywikibot)
Press Enter to use this automatic message, or enter a description of the
changes your bot will make: 
Retrieving 1 pages from mediawiki:mediawiki.


>>> Manual:Pywikibot/Compat/picasacopier.py <<<
@@ -7 +7 @@
- Picasacopier is part of Pywikipedia. You first need to install pywikipedia. At [[Pywiki**pedia**bot/Basic use|Using the python wikipediabot]] you can find a manual on how to install pywikipedia. SVN install is recommended. 
+ **Picasacopier is part of Pywikipedia. You first need to install pywikipedia. At [[Pywikibot/Basic use|Using the python wikipediabot]] you can find a manual on how to install pywikipedia. SVN install is recommended. **

Do you want to accept these changes? ([y]es, [N]o, [e]dit, open in [b]rowser, [a]ll, [q]uit): N

An alternative is to not highlight any of the new line.
Are there any other ways of handling this scenario used by other colorized diff terminal interfaces?

Event Timeline

Change 278967 had a related patch set uploaded (by Mpaa):
diff: do not highlight entire line when only text is removed

https://gerrit.wikimedia.org/r/278967

The solution in Change 278967 is to highlight the same character positions in the amended line. So the highlights are in ** below

>>> Manual:Pywikibot/Compat/picasacopier.py <<<
@@ -7 +7 @@
- Picasacopier is part of Pywikipedia. You first need to install pywikipedia. At [[Pywiki**pedia**bot/Basic use|Using the python wikipediabot]] you can find a manual on how to install pywikipedia. SVN install is recommended. 
+ Picasacopier is part of Pywikipedia. You first need to install pywikipedia. At [[Pywiki**bot/B**asic use|Using the python wikipediabot]] you can find a manual on how to install pywikipedia. SVN install is recommended.

When testing on https://test.wikipedia.org/wiki/User:John_Vandenberg/diff_test, that change correctly handles when the amended line is shorter than the last position of the remove text.

>>> User:John Vandenberg/diff test <<<
@@ -1,2 +1,2 @@
- Pywiki**pedia**bot foo
+ Pywiki**bot f**oo
- Pywiki**pedia**bot
+ Pywiki**bot**

However it fails when the amended line is shorter than the starting position of the removed text, highlighting all of every line (this is actually a separate bug which also occurs in master).

$ python pwb.py replace -family:test -lang:test -page:"User:John_Vandenberg/diff_test" Pywikipediabot Pywiki
The summary message for the command line replacements will be something like: Bot: Automated text replacement  (-Pywikipediabot +Pywikibot)
Press Enter to use this automatic message, or enter a description of the
changes your bot will make: 
Retrieving 1 pages from test:test.


>>> User:John Vandenberg/diff test <<<
@@ -1,2 +1,2 @@
- **Pywikipediabot foo**
- **Pywikipediabot**
+ **Pywiki foo**
+ **Pywiki**

Now included in Change 278967 (I hope ...).

Hmm, PS4 indents one of the replaced lines, and marks all of both lines as red & green.

$ python pwb.py replace -family:test -lang:test -page:"User:John_Vandenberg/diff_test" Pywikipediabot Pywiki -user:JVbot-test
>>> User:John Vandenberg/diff test <<<
@@ -1,2 +1,2 @@
- Pywikipediabot foo
- Pywikipediabot
+ Pywiki foo
   + Pywiki

On the marking, I cannot do too much.
As you have changed the replacements, now the underline difflib gives:

'diff': [u'- Pywikipediabot foo\n', u'- Pywikipediabot\n', u'+ Pywiki foo\n', u'+ Pywiki\n']

instead of the previous case

[u'- Pywikipediabot foo\n', '?       -----\n', u'+ Pywikibot foo\n', u'- Pywikipediabot\n', '?       -----\n', u'+ Pywikibot\n']

so there is no format line to help, and I can't do anything but color them.

On the indenting, I will submit a fix.

On the recent patch I can still cause indents with two lines of original text

.foola.Pywikipediabot
.foo.Pywikipediabot.foo.
$ python pwb.py replace -family:test -lang:test -page:"User:John_Vandenberg/diff_test" Pywikipediabot Pywiki
The summary message for the command line replacements will be something like: Bot: Automated text replacement  (-Pywikipediabot +Pywiki)
Press Enter to use this automatic message, or enter a description of the
changes your bot will make:  
Retrieving 1 pages from test:test.


>>> User:John Vandenberg/diff test <<<
@@ -1,2 +1,2 @@
- .foola.Pywikipediabot
+ .foola.Pywiki
       - .foo.Pywikipediabot.foo.
+ .foo.Pywiki.foo.

Tried again to remove the indent

I cant find any more indent problems ;-)

There are lots of trailing whitespace problems, simply because there is attempt to indicate where the end of line is. I think that problem needs to be a separate design task. Not easy. I think we need to use a background color to solve that, and/or an illegal wikitext character for the end of line, if one exists.

However I have found one more strange coloring with the proposed algorithm.

With the following input, the old lines are completely red, when only minor changes were made.

{default}Foo bar Pywikipediabot foo bar
Pywikipediabot foo
$ python pwb.py replace -family:test -lang:test -page:"User:John_Vandenberg/diff_test" -regex " +" "   "
The summary message for the command line replacements will be something like: Bot: Automated text replacement  (- + +   )
Press Enter to use this automatic message, or enter a description of the
changes your bot will make: 
Retrieving 1 pages from test:test.


>>> User:John Vandenberg/diff test <<<
@@ -7,2 +7,2 @@
- {default}Foo bar Pywikipediabot foo bar
+ {default}Foo   bar   Pywikipediabot   foo   bar
- Pywikipediabot foo
+ Pywikipediabot   foo

The proposed algorithm doesnt highlight red other lines with whitespace only changes, so I think this scenario should also not highlight the entire line red. i.e. I think we need to leave whitespace only changes to a different task.

However if this last problem is not easy to solve, maybe we merge and create separate tasks for the remaining edge cases. And create tests for the merged solution, so we dont regress with future improvements.

It depends on how difflib tags the changes.
After the first line, there is no formatting line, so the only thing I can do is mark it red.
After the second, there is format indication, in this case it indicates that spaces have been added (which are colored in green but not visible ...)

- Foo bar Pywikipediabot foo bar\n
+ Foo   bar   Pywikipediabot   foo   bar\n
?              ++   ++                ++   ++\n
- Pywikipediabot foo\n'
+ Pywikipediabot   foo\n
?                ++\n

In the second case (which you have not reported, I include it here as ref), final whitespace have been trimmed.
Format information is provided and so a more precise coloring can be done:

- \\03  Pywikipediabot    \n', '
?                       -\n
+ \\03   Pywikipediabot   \n
?    +\n
- \\03  foo Pywikibot    foo\n
?                      -\n
+ \\03   foo   Pywikibot   foo\n
?      +   ++\n

Regarding the tests, the best approach IMO is to write tests for the Hunk class in diff.py, which is he class in charge of formatting changes.
Do you have any opinion?

Possible improvements can be done on how SequenceMatcher in difflib is used (parameters, word by word comparison, etc.)

I have done additional changes to (hopefully) improve, please give a look.

The proposed algorithm doesnt highlight red other lines with whitespace only changes, so I think this scenario should also not highlight the entire line red. i.e. I think we need to leave whitespace only changes to a different task.

However if this last problem is not easy to solve, maybe we merge and create separate tasks for the remaining edge cases. And create tests for the merged solution, so we dont regress with future improvements.

Support for background colors to support show diff: T135344

Change 278967 merged by jenkins-bot:
diff: do not highlight entire line when only text is removed

https://gerrit.wikimedia.org/r/278967

jayvdb assigned this task to Mpaa.