Page MenuHomePhabricator

WMFTimeoutException when loading some diffs on Wikidata
Open, HighPublic

Description

Yesterday, when I visited the following URL...

https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930

... a 503 error raised. I reported this issue via IRC, on #wikimedia-operations. The rest of the day, I successfully accessed via the same URL, and no errors raised.

[19:39:10] <abian>	 I've got a 503 error on Wikidata
[19:39:14] <abian>	 Is this fixed?
[19:39:49] <jynus>	 abian, url?
[19:39:54] <abian>	 Okay, seems so
[19:40:08] <jynus>	 which url gave you the error?
[19:40:16] <abian>	 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358294930&oldid=358041578
[19:41:21] <jynus>	 (I am getting a bit lost on the new kibanad interface :-P)
[19:44:05] <jynus>	 I see the error now: Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER ERROR
[19:46:10] <jynus>	 abian, you seem to have been lucky by getting a rare error code
[19:46:36] <jynus>	 I will monitor memcache in case it happens again, but it doesn't seem very frequent
[19:47:34] * abian throws confetti
[19:47:42] <urandom>	 !log Lowering compaction throughput from 60MB/s to 45MB/s on restbase1013-c.eqiad.wmnet : T134016
[19:47:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:48:01] <stashbot>	 T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016
[19:48:43] <jynus>	 it could be an actual software problem "item too big" to fit into a memcache value?
[19:49:01] <jynus>	 the page you mentioned was certainly quite large
[19:49:06] <abian>	 I still have the HTML content of the error page, but I think that won't be useful for you
[19:49:13] <jynus>	 although the error was not repatable
[19:49:25] <jynus>	 no need, I can see the errors on the log
[19:49:55] <jynus>	 if it doesn't happen again, just do not worry
[19:50:03] <jynus>	 I will keep monitoring it, just in case
[19:51:56] <abian>	 Great, thanks :)
[19:52:06] <jynus>	 thanks to you for reporting it!

Today, I've just got the same 503 error (HTML content can be downloaded via F4290881) while visiting the same URL. More users have also reported being experiencing this issue.

[10:04:05] <abian>	 Another 503 error on the same URL
[10:04:15] <abian>	 https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930
[10:04:35] <jynus>	 abian, at this point you should file a bug on phabricator
[10:04:51] <jynus>	 indicate the url and when those happened
[10:05:10] <jynus>	 and any other information that could be useful
[10:05:30] <abian>	 I cannot figure out exactly when, really
[10:05:43] <abian>	 Sometimes it happens, and sometimes it doesn't
[10:06:09] <jynus>	 put the most acurate time and that that you can say when they happened
[10:06:12] <hashar>	 do fill a bug and it will be triaged / figured out eventually :-]
[10:06:21] <icinga-wm>	 PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: puppet fail
[10:07:04] <hashar>	 abian: there must be an error on the servers, so if we get a task one will be able to look at the log and see what is happening there
[10:13:05] <_joe_>	 that page renders correctly for me atm
[10:13:28] <hashar>	 got me a 503
[10:13:36] <gehel>	 503 as well
[10:13:50] <jynus>	 I saw some memcache errors yesterday
[10:16:02] <Dereckson>	 on esams I've a 200

Once the 503 error has raised, I can reload the page and read it as expected.

Potentially related: T221380, T221729

Event Timeline

abian created this task.Jul 20 2016, 10:33 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 20 2016, 10:33 AM
jcrespo added a subscriber: jcrespo.
{
  "_index": "logstash-2016.07.20",
  "_type": "mediawiki",
  "_id": "AVYHyECFhDxk9Z8yut3T",
  "_score": null,
  "_source": {
    "message": "Memcached error for key \"wikidatawiki:pcache:idhash:15087958-0!*!0!!en!*!*\" on server \"/var/run/nutcracker/nutcracker.sock:0\": SERVER ERROR",
    "@version": 1,
    "@timestamp": "2016-07-20T10:09:38.000Z",
    "type": "mediawiki",
    "host": "mw1186",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "channel": "memcached",
    "normalized_message": "Memcached error for key \"{memcached-key}\" on server \"{memcached-server}\": SERVER ERROR",
    "url": "/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930",
    "ip": "10.64.32.105",
    "http_method": "GET",
    "server": "www.wikidata.org",
    "referrer": null,
    "unique_id": "V49MywpAIDgAAAVT6h0AAAAE",
    "wiki": "wikidatawiki",
    "mwversion": "1.28.0-wmf.10",
    "reqId": "V49MywpAIDgAAAVT6h0AAAAE",
    "memcached-server": "/var/run/nutcracker/nutcracker.sock:0",
    "memcached-key": "wikidatawiki:pcache:idhash:15087958-0!*!0!!en!*!*"
  },
  "fields": {
    "@timestamp": [
      1469009378000
    ]
  },
  "highlight": {
    "message": [
      "@kibana-highlighted-field@Memcached@/kibana-highlighted-field@ error for key \"wikidatawiki:pcache:idhash:15087958-0!*!0!!en!*!*\" on server \"/var/run/nutcracker/nutcracker.sock:0\": SERVER ERROR"
    ],
    "channel.raw": [
      "@kibana-highlighted-field@memcached@/kibana-highlighted-field@"
    ],
    "type": [
      "@kibana-highlighted-field@mediawiki@/kibana-highlighted-field@"
    ],
    "channel": [
      "@kibana-highlighted-field@memcached@/kibana-highlighted-field@"
    ],
    "normalized_message": [
      "@kibana-highlighted-field@Memcached@/kibana-highlighted-field@ error for key \"{@kibana-highlighted-field@memcached@/kibana-highlighted-field@-key}\" on server \"{@kibana-highlighted-field@memcached@/kibana-highlighted-field@-server}\": SERVER ERROR"
    ]
  },
  "sort": [
    1469009378000
  ]
}

The error happens on memcache, but it could be just the request taking so long because the size of the diff/revisions is quite large. There seems to be an afinity between specific keys and the errors:

https://logstash.wikimedia.org/goto/71124720067e576a79a8fae4953f31d1

elukey triaged this task as High priority.Jul 20 2016, 2:59 PM
abian added a comment.Jul 21 2016, 9:43 AM

Today, this 503 error raises again with the corresponding URL (different diff and different oldid, but the same page)...

https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358694531&oldid=358447430

abian added a comment.Jul 22 2016, 9:57 AM

I've tested visiting today's URL without a curid parameter...

https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&diff=358881096&oldid=358694531

The same error has raised, so this issue is not related to curid in any way.

Krinkle edited projects, added Wikidata; removed MediaWiki-General.Jun 23 2017, 1:11 AM

For reference this will now result in a slightly different error due to request limits.

PHP fatal error: 
entire web request took longer than 60 seconds and timed out
Addshore renamed this task from 503 error raises again while trying to load a Wikidata page to Request timeout while loading Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930.Oct 31 2018, 4:33 PM
Addshore added a project: Performance.
abian renamed this task from Request timeout while loading Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930 to Request timeout when loading diffs on Wikidata.Apr 24 2019, 10:02 AM
abian updated the task description. (Show Details)

Now we see a lovely.

[XQ6rOQpAMEsAAH3jMpAAAABQ] 2019-06-22 22:28:05: Fatal exception of type "WMFTimeoutException"
Addshore renamed this task from Request timeout when loading diffs on Wikidata to WMFTimeoutException when loading some diffs on Wikidata.Jun 22 2019, 10:28 PM

Without knowing much about how diffs are generated for Wikidata, I just wanted to mention current issues with wikidiff2 leading to timeouts on big diffs T204010: Comparing revisions can fatal (timeout from wikidiff2 via TextSlotDiffRenderer) - The problem is fixed in a newer version that will be deployed eventually (WIP).