Page MenuHomePhabricator

Wikidata Query Service REST endpoint returns truncated results
Closed, ResolvedPublic

Description

This query is returning truncated results for me:

https://query.wikidata.org/sparql?query=SELECT%20%3Fqid%20%3Fimdb%20WHERE%20%7B%3Fqid%20wdt%3AP345%20%3Fimdb.%20FILTER%20regex%28str%28%3Fimdb%29%2C%20%22nm%7Ctt%22%29%7D%20ORDER%20BY%20%3Fqid

If I download it from the browser, the results are complete and the file size is 76.3 MB. However, if I download it through Curl or Python, I receive an unreadable XML file which has been truncated at 3.6 MB (same with JSON). Any idea why this might be?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Here is the SPARQL, for clarity:

SELECT ?qid ?imdb WHERE {?qid wdt:P345 ?imdb. FILTER regex(str(?imdb), "nm|tt")} ORDER BY ?qid

A similar issue has been reported by Ivan A. Krestinin here: https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team#Corrupted_response_end.
It happens to him with Wikidata API requests, not WQS, but it has oddly similar features:

  • Large response (in my case more than 70 MB)
  • Works fine in the browser
  • Fails through bots/Python/Curl etc.
  • Returns truncated results
  • Started about the same date (after 10 April)

I have noticed that if I set the --compressed option in Curl the file does not get truncated. I searched on Gerrit and found a Varnish configuration change from April 11 whose commit message says:

Note we still need to re-test assumptions about do_stream=false adding Content-Length to responses which lacked them in the common case (no gzip/gunzip on the fly on the response side).

Commit here: https://gerrit.wikimedia.org/r/#/c/282716. Related tasks: T128813, T131501, T131761

fgiunchedi triaged this task as Medium priority.Apr 28 2016, 9:46 AM
BBlack subscribed.

We now have some understanding of the mechanism of this bug ( T133866#2275985 ). It should go away in the imminent varnish 4 upgrade of the misc cluster in T131501.

Change 287633 had a related patch set uploaded (by BBlack):
cache_misc: remove all do_stream=true

https://gerrit.wikimedia.org/r/287633

Change 287633 merged by BBlack:
cache_misc: remove all do_stream=true

https://gerrit.wikimedia.org/r/287633

BBlack claimed this task.

This works now. There's a significant pause at the start of the transfer from the user's perspective if it's not a cache hit, because streaming is disabled as a workaround (so it has to completely load the data into each cache layer before starting the data stream to the user), but it does function correctly. The non-streamed pause behavior will go away with T131501 .