Page MenuHomePhabricator

Can't download large datasets from datasets.wikimedia.org
Closed, ResolvedPublic

Description

I can't seem to get the download to start for large datasets on datasets.wikimedia.org. To my knowledge this is somewhat recently broken, but it could have been months.

See example here: http://datasets.wikimedia.org/public-datasets/enwiki/etc/session_revisions.20131105.tsv.gz

Expected: Download starts right after click

Actual: No download will start.

@Ottomata suspects that it could be due to varnish trying to add the dataset to its cache.

[11:00:46] <ottomata> ha, halfak, misc varnishes only have 8G memory allocated to them
[11:00:49] <ottomata> that file is 9G

Event Timeline

Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak added subscribers: Halfak, Ottomata.

Change 221139 had a related patch set uploaded (by Ottomata):
Don't cache datasets.wikimedia.org

https://gerrit.wikimedia.org/r/221139

Change 221139 merged by Dzahn:
Don't cache datasets.wikimedia.org

https://gerrit.wikimedia.org/r/221139

Change 221177 had a related patch set uploaded (by Dzahn):
varnish-misc: fix syntax error for datasets config

https://gerrit.wikimedia.org/r/221177

Change 221177 merged by Dzahn:
varnish-misc: fix syntax error for datasets config

https://gerrit.wikimedia.org/r/221177

The patch should have been merged by now, but the problem persists.

confirmed this is still a problem, I think what's happening is that we're no longer caching in varnish but it will still try to fetch the whole file from the backend. IOW we are not setting beresp.do_stream in vcl_fetch here I think:

sub vcl_fetch {
        /* Don't cache private, no-cache, no-store objects */
        if (beresp.http.Cache-Control ~ "(private|no-cache|no-store)") {
                set beresp.ttl = 0s;
                /* This should be translated into hit_for_pass later */
        }
fgiunchedi set Security to None.
BBlack added subscribers: ArielGlenn, StudiesWorld.
BBlack subscribed.

Yeah this is all the same issue and still present. I think @fgiunchedi is on the right track here about streaming, I'm going to write up a generic misc-cluster patch to work around this for datasets and other services there that might run into similar issues. Probably the recent misc-cluster switches to dual-layer and dual-tier have exacerbated this problem compared to the past.

Change 256705 had a related patch set uploaded (by BBlack):
cache_misc: stream and hit-for-pass for large objects

https://gerrit.wikimedia.org/r/256705

Change 256705 merged by BBlack:
cache_misc: stream and hit-for-pass for large objects

https://gerrit.wikimedia.org/r/256705

This should be resolved now with the change above applied. I've tested the files from this ticket and the more-recent merged dupe ticket, and they both started streaming quickly and got through several hundred megs before I aborted the transfer (which would otherwise take forever!). Reopen if there's still an issue!