Smart limit for word-level diffs
Closed, ResolvedPublic

Description

Currently, if a paragraph length exceeds 10000 bytes, you get no word-level diff, as illustrated by this example. The limit is for performance reason, however we can totally investigate this deeper.

  • What are the worst-case examples of diffable text?
  • What's the impact of this limit on non-Latin wikis? What's the average paragraph length in bytes on Chinese Wikipedia compared to English, for example?
  • What affects performance more - paragraph length or number of words?
  • On modern hardware (as opposed to 15 years ago when DairikiDiff was created and 10 years ago for wikidiff2), can we afford to increase the limit?
MaxSem created this task.Mar 3 2016, 2:52 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 3 2016, 2:52 AM
MaxSem updated the task description. (Show Details)Mar 3 2016, 2:56 AM
DannyH moved this task from Untriaged to Backlog on the Community-Tech board.
DannyH added a subscriber: DannyH.
Hinote added a subscriber: Hinote.Mar 4 2016, 8:24 PM

Hi there from the Russian Wikisource. The situation is even much worse for Wikisource projects. Unlike Wikipedian projects, where the users could adjust the long paragraphes in order to make the text more readable, we at Wikisources in most cases are simply not allowed to insert arbitrary empty lines in order to break up long paragraphes simply because these are the paragraphes in their original from the books and SHALL NOT be shorter... Plus, there are some issues in the current implementation of the proofread extension that prevent us from inserting single linefeeds inside a paragraph in order to split a long paragraph into several lines... In many, many cases, very often the result of the current wikidiff is completely unusable for our patrolling users since it marks the whole paragraph as changed...

I could collect the statistics on the paragraph lengths for our wikisource pages (taken from wikidump archive, for example) if it helps. What should be of interest here? The distribution of string lenghts between '\n\n' separating sequences?

MaxSem added a comment.Mar 4 2016, 8:33 PM

Yes, examples of pages where this creates problems will be highly helpful.

Hinote added a comment.Mar 4 2016, 8:57 PM

Sure, we have alot. Here comes the change made by a new user today: see it. I don't know how to patrol it without placing both variants into an external tool in order to view the real diffs made... As you see, the page represents an encyclopedic article comprising of a single paragraph (and the content was uploaded by bot), so we have thousands of similar pages containing long paragraphes. Patrolling of changes in such pages is very annoying in the meantime...

I'll collect some statistics on our pages. This will take some time, though...

Hinote added a comment.Mar 4 2016, 9:45 PM

Btw, it would be very nice to add the wgDiffSomethingLimit setting passed to Diff() or wikidiff2, so that we could request a bigger value. Our high limits may be a problem for busy projects like Wikipedias while we at not so crowded Wikisource could request a bugger one to be set in the project confuguration which would fullfill our needs...
(This probably should be moved into a seperate request)

Change 275111 had a related patch set uploaded (by MaxSem):
Instrument diff timing

https://gerrit.wikimedia.org/r/275111

MaxSem claimed this task.Mar 4 2016, 10:48 PM
Nemo_bis triaged this task as Normal priority.Mar 5 2016, 7:31 AM
Nemo_bis added a subscriber: Nemo_bis.

What's the impact of this limit on non-Latin wikis? What's the average paragraph length in bytes on Chinese Wikipedia compared to English, for example?

This bash loop is very slow, but does the job with a POSIX-compliant wc:

pbzip2 -dc zhwiki-20160203-pages-articles-multistream.xml.bz2 | while IFS= read -r line; do if [[ $( echo "$line" | wc -c ) -gt 10000 ]]; then echo "long"; fi; done | grep -c long

Nemo_bis added a comment.EditedMar 5 2016, 12:39 PM

Ah, gawk has a --characyters-as-bytes option:

$ time pbzip2 -dc zhwiki-20160203-pages-articles-multistream.xml.bz2 | gawk --characters-as-bytes 'length($0) > 10000' | wc -l
264

real    0m31.643s
user    3m42.036s
sys     0m13.675s
$ pbzip2 -dc zhwiki-20160203-pages-articles-multistream.xml.bz2 | gawk 'length($0) > 10000' | wc -l
223
$ pbzip2 -dc zhwiki-20160203-pages-articles-multistream.xml.bz2 | gawk --characters-as-bytes 'length($0) > 1000' | wc -l
264668
$ pbzip2 -dc zhwiki-20160203-pages-articles-multistream.xml.bz2 | gawk --characters-as-bytes 'length($0) > 500' | wc -l
1111187
$ pbzip2 -dc zhwiki-20160203-pages-articles-multistream.xml.bz2 | wc -l
118101527

Even zh.wiki only has 16 lines over 50k bytes. It's indeed orders of magn itude worse on the Russian Wikisource:

$ pbzip2 -dc ruwikisource-latest-pages-articles-multistream.xml.bz2  | gawk --characters-as-bytes 'length($0) > 10000' | wc
  14127 5903002 250910722
$ pbzip2 -dc ruwikisource-latest-pages-articles-multistream.xml.bz2  | gawk --characters-as-bytes 'length($0) > 50000' | wc
    330  629496 23173681
$ pbzip2 -dc ruwikisource-latest-pages-articles-multistream.xml.bz2  | gawk --characters-as-bytes 'length($0) > 100000' | wc
     28  114429 3839683

On a "latin" Wikipedia of similar size as zh.wiki:

$ pbzip2 -dc ptwiki-20160203-pages-articles-multistream.xml.bz2  | gawk --characters-as-bytes 'length($0) > 10000' | wc -l
113
$ pbzip2 -dc ptwiki-20160203-pages-articles-multistream.xml.bz2  | gawk 'length($0) > 10000' | wc -l
105
$ pbzip2 -dc ptwiki-20160203-pages-articles-multistream.xml.bz2  | gawk --characters-as-bytes 'length($0) > 50000' | wc -l
2

On the English Wikipedia, in contrast, the difference between bytes and characters is negligible:

$ pbzip2 -dc enwiki-20160113-pages-articles-multistream.xml.bz2  | gawk --characters-as-bytes 'length($0) > 10000' | wc
21813 7931407 564202925
$ pbzip2 -dc enwiki-20160113-pages-articles-multistream.xml.bz2  | gawk 'length($0) > 10000' | wc
  21750 7886200 563542459
$ pbzip2 -dc enwiki-20160113-pages-articles-multistream.xml.bz2  | gawk 'length($0) > 50000' | wc
   1641 3124343 166409617
Hinote added a comment.EditedMar 5 2016, 11:22 PM

Nemo_bis, thanks for stats, I've got similar results (of course) for ruwikisource. We have more than ten thousands of pages with long lines...

The first reason for the problem here is that the limit of 10000 bytes is close to 5000-6000 Unicode codepoints for a typical Russian text (calculating cyrillic letters as 2 bytes, spaces, punctuations and rare latin letters as 1 byte). If you fetch the number of lines exceding this limit in de, en, fr Wikisource projects, you'll also obtain thousands appearances. So the reason is not only in our specific profile of content, but in the language specifics in general. For CJK languages it's even worse -- 10000 bytes is close to 3000-4000 codepoints only...

Then, as I said, all Wikisources (not only Russian) contain much more longer lines comparing to Wikipedias.

Of course, we could probably normally live with a limit, say, in 50000 bytes. The rest of the pages exceding this limit (a few hundreds) could be manually reviewed and splitted accurately into smaller lines in proper places. This cannot be done aotomatically by bots, as I can see in the meantime, since this would produce many errors when a line is split inside a tag, for example, breaking the text layout.

In the same time, our prefferable limit set to 50000 bytes or higher looks too high for enwiki (for example) and can degrade the enwiki server performance. (Any performance data?) This is why an additional setting for this limit which could be adjusted for each wiki projects seems to be a best solution in addition to the global increasing this limit for all projects...

Change 275111 merged by jenkins-bot:
Instrument diff timing

https://gerrit.wikimedia.org/r/275111

Krinkle added a subscriber: Krinkle.Mar 6 2016, 8:56 PM

Yes, examples of pages where this creates problems will be highly helpful.

Another actual example: here. I've got doubts on correctness of these changes made manually by myself in a set of similar changes on other pages. So I had a look at the diff page. It's completely useless in this case...

What affects performance more - paragraph length or number of words?

As expected, P2752 demonstrates a quadratic dependency on word count:

$ hhvm benchmarkParagraphDiff.php
Running PHP version 5.6.99-hhvm (x86_64) on Linux 3.13.0-30-generic #55-Ubuntu SMP Fri Jul 4 21:40:53 UTC 2014

100 times: function BenchmarkParagraphDiff->test(5) :
   39850.19ms (398.50ms each)
100 times: function BenchmarkParagraphDiff->test(10) :
   11410.64ms (114.11ms each)
100 times: function BenchmarkParagraphDiff->test(20) :
   3259.90ms ( 32.60ms each)
Hinote added a comment.EditedMar 14 2016, 10:43 PM

As expected, P2752 demonstrates a quadratic dependency on word count:

It means that one can safely make the byte limit for all the Russian projects double the current limit and achieve same performance as for projects in English since codepoints for Cyrillic letters are encoded as 2 bytes in UTF-8 and thus currently we have approx. 2 times less words in 10000 bytes as in English texts... (Plus, an average Russian word consists of a bit more letters than an average English word...) Correct?

Okay, so first step would be to change the criteria for per-paragraph limitations. Lots of diffs are just couple of words in the middle of large paragraphs while our diff engine correctly handles such changes by skipping through parts that weren't changed. So first step, quite easy and uncontroversial, would be to estimate complexity based on the number of potentially affected words (linear time) instead of overall paragraph length. I'd like to take care of T128896: Decide what to do with Wikidiff3 first, however.

Out of curiosity given diffs are cached after generation and rarely change how big a deal is this added millisecond delay in practice?

How was the existing upper value defined?

Millisecond? How about minutes? We have diffs that take this long already. It's more of a user experience thingy though, not server load. Instead of bumping a limit, we need to improve algorithmically.

Jay8g added a subscriber: Jay8g.Apr 9 2016, 6:41 AM

Another fresh example: how can we verify this change with current diff tool which marks the whole paragraph as changed: see https://ru.wikisource.org/w/index.php?title=%D0%9C%D1%91%D1%80%D1%82%D0%B2%D1%8B%D0%B5_%D0%B4%D1%83%D1%88%D0%B8_(%D0%93%D0%BE%D0%B3%D0%BE%D0%BB%D1%8C)/%D0%A2%D0%BE%D0%BC_I/%D0%93%D0%BB%D0%B0%D0%B2%D0%B0_X&curid=30427&diff=2259376&oldid=607212

Only by copying and pasting both variants to an external diff tool... This sucks, completely sucks...

Krinkle removed a subscriber: Krinkle.May 16 2016, 1:40 PM

Change 285567 had a related patch set uploaded (by MaxSem):
Rethink diff limits

https://gerrit.wikimedia.org/r/285567

The above change implements a better approach in PHP. Example of the diff from T121469 that's currently unreadable:


For this to work in WMF production, however, these changes need to be ported into wikidiff2 which I'm going to do now.

Change 285567 merged by jenkins-bot:
Rethink diff limits

https://gerrit.wikimedia.org/r/285567

For this to work in WMF production, however, these changes need to be ported into wikidiff2 which I'm going to do now.

I assume this is still on the to-do list.

MaxSem renamed this task from Investigate increasing limit for word-level diffs to Smart limit for word-level diffs.Jun 29 2016, 5:48 AM

Change 296503 had a related patch set uploaded (by MaxSem):
Redo paragraph-level diffs

https://gerrit.wikimedia.org/r/296503

Change 296503 merged by jenkins-bot:
Redo paragraph-level diffs

https://gerrit.wikimedia.org/r/296503

OK, this should now be done in production — thanks, Operations and Max. Worth a mention in Tech/news?

Jdforrester-WMF closed this task as Resolved.Aug 31 2016, 7:29 PM
DannyH moved this task from Backlog to Archive on the Community-Tech board.Sep 13 2016, 11:42 PM