Page MenuHomePhabricator

Comparing revisions can fatal (from wikidiff2 via TextSlotDiffRenderer or ApiComparePages)
Open, MediumPublic5 Estimated Story PointsPRODUCTION ERROR

Event Timeline

You can find the logs by going to https://logstash.wikimedia.org, selecting "Dev Tools" and entering the query

GET _search
{
  "query": {
    "query_string": {
      "fields": ["_type", "channel", "@timestamp", "fatal_exception.trace"],
      "query": "_type:mediawiki AND channel:fatal AND fatal_exception.trace:TextSlotDiffRenderer"
    }
  }
}

I don't use the dashboard interface of logstash anymore because it freezes my browser for tens of seconds (T189333)

Seeing quite a bit of this error in logstash:

Fatal error: entire web request took longer than 60 seconds and timed out in /srv/mediawiki/php-1.33.0-wmf.24/includes/diff/TextSlotDiffRenderer.php on line 219

Specifically, 40 occurrences in the past 15 minutes.

Krinkle renamed this task from Timeouts in wikidiff2 to Comparing revisions can fatal (timeout from wikidiff2 via TextSlotDiffRenderer).Apr 11 2019, 1:04 AM

Would is be possible (in a relatively easy not to crazy investment kind of way) to artificially limit time spent in wikidiff2's main logic?

This would effectively be similar to what we do with ElasticSearch timeouts (which may be easier to cap, given it's over the network). The main point being that it avoids exposing uncached urls producing HTTP 500 responses.

And, it could still render the meta data of both revisions (timestamp, author, edit summary, etc.), the complete wikitext of both revisions skin-by-side (without difference). Perhaps with a notice indicating that the diff could not be generated.

I believe the Visual Diff beta feature also has a timeout where a more basic version is used as fallback. This would mean that the comparison API and HTML table format would still be rendered, just in a more dull manner.

Possibly unrelated, but curious: I also get a timeout for the the page in the first test case (the page itself, not the diff): https://de.wikipedia.org/w/index.php?title=Portal:Biologie/Fehlende_Artikel/Missing_Topics

PHP fatal error:
entire web request took longer than 60 seconds and timed out

Change 506148 had a related patch set uploaded (by Jkroll; owner: Jkroll):
[mediawiki/php/wikidiff2@master] [WiP] Correct AllowPrintMovedLineDiff to count lines, not edits

https://gerrit.wikimedia.org/r/506148

thiemowmde triaged this task as Medium priority.Apr 30 2019, 1:32 PM

My extremely unscientific sampling of diffs, using this patch: P8510, found that in benchmarks/revs.json.gz, there were 244 total chunks, with:

php benchmarks/bench.php | grep chunk | awk -e '{adds += $3} {dels += $6} END {print adds; print dels}'

505 added lines and 493 deleted lines, or (505 + 493) / 244 lines per chunk = approximately 4 lines per chunk. This is an estimate of the initial factor we should use to increase the chunk threshold to convert it to a line count threshold.

Based on the factor above, I would suggest this patch:

--- a/DiffEngine.h
+++ b/DiffEngine.h
@@ -33,7 +33,7 @@
 // Default value for INI setting wikidiff2.moved_line_threshold
 #define WIKIDIFF2_MOVED_LINE_THRESHOLD_DEFAULT "0.4"
 // Default value for INI setting wikidiff2.moved_paragraph_detection_cutoff
-#define WIKIDIFF2_MOVED_PARAGRAPH_DETECTION_CUTOFF_DEFAULT "30"
+#define WIKIDIFF2_MOVED_PARAGRAPH_DETECTION_CUTOFF_DEFAULT "120"

 #ifdef DEBUG_MOVED_LINES
 inline void debugLog(const char *fmt, ...) {

Change 506148 merged by jenkins-bot:
[mediawiki/php/wikidiff2@master] Correct AllowPrintMovedLineDiff to count lines, not edits

https://gerrit.wikimedia.org/r/506148

You might want to investigate the algorithm in DiffEngine.php, formerly known as Wikidiff3. While wikidiff2 was based on the previous MediaWiki diff implementation, wikidiff3 has a drastically improved handing of worst case scenarios at the expense of slightly decreased diff quality (in theory; in practice in my limited user testing people liked Wikidiff3 more).

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM

Looking at TextSlotDiffRenderer.php(214)[1] these timeouts seem to come from wikidiff2. I just wonder why this is happening so often since 15th January... Some diff cache invalidation?

[1] https://gerrit.wikimedia.org/g/mediawiki/core/+/d1a7e4043f47d291eff516081916d0431e8ecb02/includes/diff/TextSlotDiffRenderer.php#211

I see. Yeah, I suppose that's hard to handle indeed. If the criteria we have to decide whether to "support" a diff is time-spent, then maybe this could be incorporated into the diffing code itself. Eg. between its smaller internal steps, it could throw a diff-specific (not WMF-speific) error that the budget was exceeded.

Alternatively, if we dont think there are other ways diffing can fail and/or if those ways do not need to raise fatal report levels (e.g. if we're certain they won't be caused by MW deployments), then they could be having a catch-all around them in which the user is informed about the diff being unavailable.

@Anomie What direction would you prefer to take this in? Is it plausible for wikidiff2 to always produce a diff within a certain timeframe (within the limits of the text we allow to be fed to the diff engine), and/or are there other ways of falling back that you think would be useful?

I don't know much about wikidiff2, but I suspect it's similar to a parse in that there are probably complex cases that will always take a long time. I also don't know whether or not it has sufficient internal "steps" where a time check could be inserted.

Krinkle renamed this task from Comparing revisions can fatal (timeout from wikidiff2 via TextSlotDiffRenderer) to Comparing revisions can fatal (from wikidiff2 via TextSlotDiffRenderer or ApiComparePages).Mar 31 2021, 11:01 PM
Krinkle added subscribers: Umherirrender, Clarakosi, Krenair.