Page MenuHomePhabricator

Wikidata Query Service REST endpoint returns truncated results
Closed, ResolvedPublic


This query is returning truncated results for me:

If I download it from the browser, the results are complete and the file size is 76.3 MB. However, if I download it through Curl or Python, I receive an unreadable XML file which has been truncated at 3.6 MB (same with JSON). Any idea why this might be?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Here is the SPARQL, for clarity:

SELECT ?qid ?imdb WHERE {?qid wdt:P345 ?imdb. FILTER regex(str(?imdb), "nm|tt")} ORDER BY ?qid

A similar issue has been reported by Ivan A. Krestinin here:
It happens to him with Wikidata API requests, not WQS, but it has oddly similar features:

  • Large response (in my case more than 70 MB)
  • Works fine in the browser
  • Fails through bots/Python/Curl etc.
  • Returns truncated results
  • Started about the same date (after 10 April)

I have noticed that if I set the --compressed option in Curl the file does not get truncated. I searched on Gerrit and found a Varnish configuration change from April 11 whose commit message says:

Note we still need to re-test assumptions about do_stream=false adding Content-Length to responses which lacked them in the common case (no gzip/gunzip on the fly on the response side).

Commit here: Related tasks: T128813, T131501, T131761

fgiunchedi triaged this task as Medium priority.Apr 28 2016, 9:46 AM
BBlack added a subscriber: BBlack.

We now have some understanding of the mechanism of this bug ( T133866#2275985 ). It should go away in the imminent varnish 4 upgrade of the misc cluster in T131501.

Change 287633 had a related patch set uploaded (by BBlack):
cache_misc: remove all do_stream=true

Change 287633 merged by BBlack:
cache_misc: remove all do_stream=true

BBlack claimed this task.

This works now. There's a significant pause at the start of the transfer from the user's perspective if it's not a cache hit, because streaming is disabled as a workaround (so it has to completely load the data into each cache layer before starting the data stream to the user), but it does function correctly. The non-streamed pause behavior will go away with T131501 .