Page MenuHomePhabricator

WMFTimeoutException when loading some diffs on Wikidata [Timebox 8h]
Closed, ResolvedPublic

Authored By
abian
Jul 20 2016, 10:33 AM

Description

Certain diff URLs in Wikidata production consistently timeout and cause logspam.

At least one the 2 links I tested:

These pages are generally giant and also include some amount of LUA access.
It could be that the timeout is caused by displaying the preview under the diff

Acceptance Criteria: 🏕️🌟(September 2021)

  • Determine what is causing the timeouts on the pages linked above
  • Investigate ways to mitigate the problem (and implement if simple)

Original Description

Yesterday, when I visited the following URL...

https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930

... a 503 error raised. I reported this issue via IRC, on #wikimedia-operations. The rest of the day, I successfully accessed via the same URL, and no errors raised.

[19:39:10] <abian>	 I've got a 503 error on Wikidata
[19:39:14] <abian>	 Is this fixed?
[19:39:49] <jynus>	 abian, url?
[19:39:54] <abian>	 Okay, seems so
[19:40:08] <jynus>	 which url gave you the error?
[19:40:16] <abian>	 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358294930&oldid=358041578
[19:41:21] <jynus>	 (I am getting a bit lost on the new kibanad interface :-P)
[19:44:05] <jynus>	 I see the error now: Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER ERROR
[19:46:10] <jynus>	 abian, you seem to have been lucky by getting a rare error code
[19:46:36] <jynus>	 I will monitor memcache in case it happens again, but it doesn't seem very frequent
[19:47:34] * abian throws confetti
[19:47:42] <urandom>	 !log Lowering compaction throughput from 60MB/s to 45MB/s on restbase1013-c.eqiad.wmnet : T134016
[19:47:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:48:01] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[19:48:43] <jynus>	 it could be an actual software problem "item too big" to fit into a memcache value?
[19:49:01] <jynus>	 the page you mentioned was certainly quite large
[19:49:06] <abian>	 I still have the HTML content of the error page, but I think that won't be useful for you
[19:49:13] <jynus>	 although the error was not repatable
[19:49:25] <jynus>	 no need, I can see the errors on the log
[19:49:55] <jynus>	 if it doesn't happen again, just do not worry
[19:50:03] <jynus>	 I will keep monitoring it, just in case
[19:51:56] <abian>	 Great, thanks :)
[19:52:06] <jynus>	 thanks to you for reporting it!

Today, I've just got the same 503 error (HTML content can be downloaded via F4290881) while visiting the same URL. More users have also reported being experiencing this issue.

[10:04:05] <abian>	 Another 503 error on the same URL
[10:04:15] <abian>	 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930
[10:04:35] <jynus>	 abian, at this point you should file a bug on phabricator
[10:04:51] <jynus>	 indicate the url and when those happened
[10:05:10] <jynus>	 and any other information that could be useful
[10:05:30] <abian>	 I cannot figure out exactly when, really
[10:05:43] <abian>	 Sometimes it happens, and sometimes it doesn't
[10:06:09] <jynus>	 put the most acurate time and that that you can say when they happened
[10:06:12] <hashar>	 do fill a bug and it will be triaged / figured out eventually :-]
[10:06:21] <icinga-wm>	 PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: puppet fail
[10:07:04] <hashar>	 abian: there must be an error on the servers, so if we get a task one will be able to look at the log and see what is happening there
[10:13:05] <_joe_>	 that page renders correctly for me atm
[10:13:28] <hashar>	 got me a 503
[10:13:36] <gehel>	 503 as well
[10:13:50] <jynus>	 I saw some memcache errors yesterday
[10:16:02] <Dereckson>	 on esams I've a 200

Once the 503 error has raised, I can reload the page and read it as expected.

Potentially related: T221380, T221729

Event Timeline

jcrespo added a subscriber: jcrespo.
{
  "_index": "logstash-2016.07.20",
  "_type": "mediawiki",
  "_id": "AVYHyECFhDxk9Z8yut3T",
  "_score": null,
  "_source": {
    "message": "Memcached error for key \"wikidatawiki:pcache:idhash:15087958-0!*!0!!en!*!*\" on server \"/var/run/nutcracker/nutcracker.sock:0\": SERVER ERROR",
    "@version": 1,
    "@timestamp": "2016-07-20T10:09:38.000Z",
    "type": "mediawiki",
    "host": "mw1186",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "channel": "memcached",
    "normalized_message": "Memcached error for key \"{memcached-key}\" on server \"{memcached-server}\": SERVER ERROR",
    "url": "/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930",
    "ip": "10.64.32.105",
    "http_method": "GET",
    "server": "www.wikidata.org",
    "referrer": null,
    "unique_id": "V49MywpAIDgAAAVT6h0AAAAE",
    "wiki": "wikidatawiki",
    "mwversion": "1.28.0-wmf.10",
    "reqId": "V49MywpAIDgAAAVT6h0AAAAE",
    "memcached-server": "/var/run/nutcracker/nutcracker.sock:0",
    "memcached-key": "wikidatawiki:pcache:idhash:15087958-0!*!0!!en!*!*"
  },
  "fields": {
    "@timestamp": [
      1469009378000
    ]
  },
  "highlight": {
    "message": [
      "@kibana-highlighted-field@Memcached@/kibana-highlighted-field@ error for key \"wikidatawiki:pcache:idhash:15087958-0!*!0!!en!*!*\" on server \"/var/run/nutcracker/nutcracker.sock:0\": SERVER ERROR"
    ],
    "channel.raw": [
      "@kibana-highlighted-field@memcached@/kibana-highlighted-field@"
    ],
    "type": [
      "@kibana-highlighted-field@mediawiki@/kibana-highlighted-field@"
    ],
    "channel": [
      "@kibana-highlighted-field@memcached@/kibana-highlighted-field@"
    ],
    "normalized_message": [
      "@kibana-highlighted-field@Memcached@/kibana-highlighted-field@ error for key \"{@kibana-highlighted-field@memcached@/kibana-highlighted-field@-key}\" on server \"{@kibana-highlighted-field@memcached@/kibana-highlighted-field@-server}\": SERVER ERROR"
    ]
  },
  "sort": [
    1469009378000
  ]
}

The error happens on memcache, but it could be just the request taking so long because the size of the diff/revisions is quite large. There seems to be an afinity between specific keys and the errors:

https://logstash.wikimedia.org/goto/71124720067e576a79a8fae4953f31d1

elukey triaged this task as High priority.Jul 20 2016, 2:59 PM

I've tested visiting today's URL without a curid parameter...

https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&diff=358881096&oldid=358694531

The same error has raised, so this issue is not related to curid in any way.

For reference this will now result in a slightly different error due to request limits.

PHP fatal error: 
entire web request took longer than 60 seconds and timed out
Addshore renamed this task from 503 error raises again while trying to load a Wikidata page to Request timeout while loading Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930.Oct 31 2018, 4:33 PM
abian renamed this task from Request timeout while loading Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930 to Request timeout when loading diffs on Wikidata.Apr 24 2019, 10:02 AM
abian updated the task description. (Show Details)

Now we see a lovely.

[XQ6rOQpAMEsAAH3jMpAAAABQ] 2019-06-22 22:28:05: Fatal exception of type "WMFTimeoutException"
Addshore renamed this task from Request timeout when loading diffs on Wikidata to WMFTimeoutException when loading some diffs on Wikidata.Jun 22 2019, 10:28 PM

Without knowing much about how diffs are generated for Wikidata, I just wanted to mention current issues with wikidiff2 leading to timeouts on big diffs T204010: Comparing revisions can fatal (from wikidiff2 via TextSlotDiffRenderer or ApiComparePages) - The problem is fixed in a newer version that will be deployed eventually (WIP).

Marostegui added a subscriber: Marostegui.

Is this still happening?

Is this still happening?

It would appear so yes!
At least one the 2 links I tested:

Perhaps we should investigate this a bit!

This could just be that this is a huge page being diffed and that we can't do much about it.
There are some tickets in the backlog relating to quality constraints and the future of these reports in general that might end up rendering these pages not needed any more

Addshore renamed this task from WMFTimeoutException when loading some diffs on Wikidata to WMFTimeoutException when loading some diffs on Wikidata [Timebox 8h].Sep 22 2021, 10:55 AM

Here’s the parser output comment for the /P​570 page (current revision):

<!-- 
NewPP limit report
Parsed by mw1417
Cached time: 20210921103518
Cache expiry: 1814400
Reduced expiry: false
Complications: []
CPU time usage: 10.940 seconds
Real time usage: 13.201 seconds
Preprocessor visited node count: 36810/1000000
Post‐expand include size: 112150/2097152 bytes
Template argument size: 3542/2097152 bytes
Highest expansion depth: 9/40
Expensive parser function count: 5/500
Unstrip recursion depth: 0/20
Unstrip post‐expand size: 0/5000000 bytes
Lua time usage: 0.193/10.000 seconds
Lua memory usage: 3430383/52428800 bytes
Number of Wikibase entities loaded: 1/500
-->
<!--
Transclusion expansion time report (%,ms,calls,template)
100.00%  602.256      1 -total
 59.63%  359.107    113 Template:P
 52.19%  314.337    120 Template:Label
 26.38%  158.846      1 Template:Constraint_violations_report
  7.80%   46.964     17 Template:LangSwitch
  3.18%   19.155      1 Template:ISOdate
  3.15%   18.948      1 Template:Autodescription
  2.96%   17.854      6 Template:Q
  2.80%   16.868      2 Template:Ll
  2.37%   14.294    113 Template:Str_left
-->

Apparently this one only took 13 seconds of real time (far from the 60 seconds limit), but most of that time isn’t being spent in Lua.

But it doesn’t look like most of the time is spent in diffing either: I built some &diffonly=yes URLs from random recent-ish revisions of the page (example) and they all render almost instantly.

I looked at this a bit in XHGui. I’ll explain the requests I made chronologically, hopefully that’ll make sense.

First up is this XHGui run for a diff timeout (for the first URL in the task description). There are two functions taking up the majority of the wall time: Wikimedia\Rdbms\DatabaseMysqli::doQuery, and Preprocessor_Hash::buildDomTreeArrayFromText. doQuery() is called 5502 times, which feels a bit excessive.

Screenshot 2021-09-30 at 15-20-34 XHGui - Profile - 6826fe25-6fdb-4df6-ac06-cd54c59578ff w index php.png (700×676 px, 49 KB)

I then got this XHGui run for a non-diff non-timeout, for the latest version of the same page, with action=purge to make sure it did something. The same two functions are still on top of the wall time “leaderboard”, but with the Wikitext→DOM parsing well ahead of the database query; doQuery() also only has 309 calls this time, an order of magnitude less.

Screenshot 2021-09-30 at 15-23-44 XHGui - Profile - 313bd5cb-0c3b-45de-8df1-b07f0f666a01 wiki Wikidata Database_reports Con[...].png (700×676 px, 46 KB)

I then realized that comparing the latest version of the page to that old diff link wasn’t entirely representative, since the size of the page could’ve changed a lot. So I took the original URL again and removed the diff parameters while keeping the &oldid=. The result is this XHGui for a non-diff timeout (URL), where doQuery() suddenly gets called a lot less, only 253 times.

Screenshot 2021-09-30 at 15-30-09 XHGui - Profile - f06dafba-f0d7-4196-9556-a16513243495 w index php.png (700×676 px, 42 KB)

Finally, I wondered if the decreasing doQuery() counts were just because some results (e.g. item labels) were loaded from the database in the first request but then gotten from a cache afterwards, so I loaded the original diff URL again. This gave me another diff timeout XHGui – with, again, 5264 doQuery() calls taking a lot of the time.

Screenshot 2021-09-30 at 15-38-30 XHGui - Profile - 898c7663-7125-40b7-9726-aa984451cab4 w index php.png (700×676 px, 50 KB)

I think there are two different issues here:

  1. Rendering some revisions of these huge pages just takes a long time – as evidenced by the third request, not a diff, spending a majority of its time in Preprocessor_Hash before eventually timing out.
  2. Rendering some diffs of these pages not only takes a long time, but also makes an almost exorbitant amount of database queries.

Rendering some diffs of these pages not only takes a long time, but also makes an almost exorbitant amount of database queries.

And yet –

But it doesn’t look like most of the time is spent in diffing either: I built some &diffonly=yes URLs from random recent-ish revisions of the page (example) and they all render almost instantly.

And then I suddenly had an epiphany. You know this Wikibase feature where we show labels for entity links in edit summaries? As in, the edit summary just has a plain wikilink like [[Q463035]] or [[Property:P735]], but we show it like this:

given name (P735): Douglas (Q463035)

Screenshot 2021-09-30 at 15-45-26 Revision history of Douglas Adams (Q42) - Wikidata.png (263×1 px, 135 KB)

Screenshot 2021-09-30 at 15-45-51 Difference between revisions of Douglas Adams (Q42) - Wikidata.png (121×835 px, 21 KB)

I happen to remember how this feature is implemented (and you can look it up in the HtmlPageLinkRendereEndHookHandler class).

The feature isn’t: “add labels to entity links in edit summaries.”

The feature is: on diff and history pages, add labels to all links that don’t already have custom link text.

So what’s happening here is that we have a page which doesn’t use the {{Q}} template to render links to items, because that would make the page impossible to load – the page just has plain links to the thousands of items it mentions. But when we load the page content in a diff URL – and only then! – Wikibase tries to add labels to all those links, because of the way that the “entity labels in edit summaries” feature is implemented (the hook handler apparently doesn’t know whether the link comes from an edit summary or not, so it instead checks if the current request URL is one where you would expect to see an edit summary).

This also explains the thousands of database queries. This hook handler sees one link at a time – it has no chance to preload or batch any labels, it makes one query at a time. (Or so I assume.)

And if you sort that first XHGui report by inclusive wall time (i.e. how much time was spent in a function, including all the other functions it called), the HtmlPageLinkRendererEndHookHandler methods indeed take up some 35 seconds of wall time, at 2708 calls. (Which turns into 5347 calls to the “term lookup” layer, because the hook handler gets the label and description for each item, so the number of calls is doubled.)

Lucas_Werkmeister_WMDE claimed this task.

T292203 is fixed, and all the URLs mentioned in this task are working for me now. I think that’s good enough to close this.