Introduce a configuration variable for the bailout threshold for moved-paragraph-detection in wikidiff2
Closed, ResolvedPublic

Description

Context: To enable detection of changes in moved paragraphs, a new (fourth) parameter is going to be added to wikidiff2_do_diff(). That new parameter controls the maximum number of paragraphs to compare, and will have a default value for backwards compatibility (probably 25 or so). Setting it to 0 will cause detection of moved paragraph detection to be disabled, returning wikidiff2_do_diff to the previous behavior.

We need a MediaWiki configuration variable to control this, named $wgWikiDiff2MoveParagraphDetectionCutoff or some such. That variable should default to (probably) 25.

For deployment, $wgWikiDiff2MoveParagraphDetectionCutoff should initially be set to 0, so on-wiki behavior does not change.

Once this has been done, the new version of wikidiff2 can be rolled out to the production servers.

The next step would be to set it to 25 or something on test.wikipedia.org to expose it to the community for testing. Once testing is complete, set it to 25 (or something) on all wikis.

daniel created this task.May 30 2017, 3:30 PM

@daniel you mentioned yesterday the spot in MediaWiki where this needs to be added. since I've forgotten it again, can you mention it here if you know it from the top of your head?

Nemo_bis added a subscriber: Nemo_bis.

This is precious work towards T15462 / T7072, right?

This is work towards T146781: Show changes in moved text chunks (C++), which looking at it seems to have a lot of overlap with T15462. I'm not sure if the tasks are identical, though.

daniel added a comment.Jun 2 2017, 2:18 PM

@Lea_WMDE T15462: Enhance line matching in diffs is about wikidiff2 "pairing" the wrong lines (which here means "paragraphs") into a "change", causing nearly all words in the paragraph to be marked as changed. This is the issue that Johannes discovered while trying to make a worst case benchmark.

In contrast, T139603: Show text changes when moving text chunks (#2) is about the inverse issue: wikidiff2 not "pairing" paragraphs into a "change" if the paragraph was moved.

These two issues are related, but not the same.

Tobi_WMDE_SW removed jkroll as the assignee of this task.
Tobi_WMDE_SW triaged this task as High priority.
Tobi_WMDE_SW claimed this task.

@daniel in reply to your comment on https://gerrit.wikimedia.org/r/#/c/357826/4/includes/diff/DifferenceEngine.php:

Wrap-up of the question: if we introduce a 4th parameter for the call to wikidiff2_do_diff, what would happen if an old version of wikidiff2 is deployed? Would this break? And how can we solve this?

IMO this would break (@jkroll please approve), so we need be careful when we deploy. One solution that comes to my mind is, that we could deploy a new version of wikidiff2 well beforehand that has the 4th parameter (maxMovedLines) set to 0 by default. This would then not change any behavior unless we deploy the core-change with the configuration value and set it to something else. The only case where we need to be careful here is when we would need to roll-back wikidiff2 to an old version that doesn't have the 4th parameter. But IMO this should not happen.

What do you think?

Well the solution would be to make MediaWiki work with both the old and the new version of Wikidiff2, right?

Options I can think of to make it work while the rollout is out of sync:

  • Test for the config var and only use the fourth parameter if it is set in LocalSettings.php. Otherwise call the function with 3 parameters, which will use the default value configured in Wikidiff2 (currently 25, can be changed) when the patch is rolled out.
  • Somehow check how many parameters the function has. I don't know if this is possible with the extension API or in PHP itself. Ok apparently it is possible: https://stackoverflow.com/a/7774743
  • Try calling it with 4 parameters, catching the exception it will throw if it takes only 3, then re-call it in the catch block with 3 args. Get extra points for confusing people. :D Again, not sure if this is possible in PHP. This is clearly the least desirable option.

All of those are ugly, but we need this at least in the interim, correct? Question is, can the hack be taken out later once the new Wikidiff2 is rolled out? Non-Wikimedia projects might still use old Wikidiff2.

"Test for the config var and only use the fourth parameter if it is set in LocalSettings.php."

This is not going to work, since at least the default setting for this variable will be deployed along with the code. So the config variable will always be set.

We could call the method with 3 parameters if the setting is false, and deploy that setting to the cluster before the code.
That's just as ugly, but a little less "magical", more explicit. The respective code should have a comment top this effect.

But before we go through all this trouble, please check that the new PHP code indeed has a problem with the old wikidiff2. PHP generally ignores extra parameters, so why not here? (If we rely on this, check on zend and hhvm, please).

But before we go through all this trouble, please check that the new PHP code indeed has a problem with the old wikidiff2. PHP generally ignores extra parameters, so why not here? (If we rely on this, check on zend and hhvm, please).

Good point. @jkroll can you check this for zend and hhvm?

This, of course, cannot work:

<?php
function hi($str) {
    print($str . "\n");
}
hi("hello");
hi("hello", "world"); # call the function with an invalid number of parameters
?>

But it does. The extra parameter is just ignored. PHP is even more broken than I thought.

I'll see if I can test whether it makes any difference for extensions. If it doesn't, let's just rely on this PHP not-a-bug. :)

That's pretty standard behavior for scripting languages :)

I just wasn't sure if it would work with functions implemented in C++-.

If it doesn't, let's just rely on this PHP not-a-bug. :)

Yah, and it's a wontfix: :D
https://bugs.php.net/bug.php?id=13892

Okay, so... This:

<?php
function hi($str) {
    print($str . "\n");
}
print wikidiff2_do_diff("hello\nworld\n", "world\nhello\n", 1, 1, "extra parameter that shouldn't be here");
hi("you should not see this string", "world");
?>

Outputs this when run from the command line:

PHP Warning:  wikidiff2_do_diff() expects at most 4 parameters, 5 given in /var/www/html/php-parameter-weirdness.php on line 5
you should not see this string

Note that there's no output from wikidiff2_do_diff(), so the function is just... Not called. Which means the extra parameter does get treated as an error somehow, despite the "Warning" output. The error doesn't seem to be erroneous enough to throw an exception or stop the script though, as a reasonable person might expect. Instead, the next line is executed -- where the extra parameter is obviously not treated as an error.

The same thing happens in hhvm. No output from wikidiff2_do_diff().

So we do need to check for this. Either with ReflectionFunction (https://stackoverflow.com/a/7774743) or per @daniel's way of setting the default value to false and checking for that. I don't have a strong preference.

I vote for using the 3 param call if the setting is false.

Change 357826 merged by jenkins-bot:
[mediawiki/core@master] Introduce config var for moved-paragraph-detection threshold

https://gerrit.wikimedia.org/r/357826

Tobi_WMDE_SW closed this task as Resolved.Jul 25 2017, 2:15 PM

Change 375784 had a related patch set uploaded (by Tobias Gritschacher; owner: Tobias Gritschacher):
[mediawiki/core@master] Check for Wikidiff2 version 0.3 NOT for 0.3.0

https://gerrit.wikimedia.org/r/375784

Change 375784 merged by jenkins-bot:
[mediawiki/core@master] Check for Wikidiff2 version 0.3 NOT for 0.3.0

https://gerrit.wikimedia.org/r/375784