Page MenuHomePhabricator

Support Upload Resume via Server Side File Concatenation
Closed, ResolvedPublic

Description

Author: mdale

Description:
When uploading large video files via the upload api it would be very helpful to support uploading the file in chunks. That way if your POST request gets reset, dies in the middle, or your process crashes you don't need to start uploading from scratch. Additionally this will avoid long-time running large POSTs on the server side.

Innitialy we should support the firefogg chunk upload method:
http://www.firefogg.org/dev/chunk_post.html

Basically we just give the "next chunk url" back as a response until we get a done=1 in the post vars, then concatenate the pieces.


Version: unspecified
Severity: enhancement
URL: http://www.firefogg.org/dev/chunk_post.html

Details

Reference
bz17255

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:26 PM
bzimport set Reference to bz17255.
bzimport added a subscriber: Unknown Object (MLST).

Is that something we can reasonably have our PHP-based code understand, or would this take patching PHP's own HTTP upload-handling code?

Bryan.TongMinh wrote:

We should collect different parts, concatenate them and them pass them on to LocalFile::recordUpload2. I think we can do this but is it wise to invent our own standard for chunked upload?

mdale wrote:

in response to comment #1
yea its really simple we just collect a set of post files and concatenate them.. no special HTTP upload handling code. maybe concatenate by running shell command to avoid sending all that data through php but not a big deal either way.

in response to comment #2
is there some standard that you would prefer we use? The only thing I am aware of is the google proposed resume upload code
http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal

Notice even Google (with their custom clients) has taken the 1 meg chunk at a time approach because it avoids modifications to HTTP protocol. They have proposed modifications to HTTP protocol and we can support that once it "gets out there".

(In reply to comment #3)

in response to comment #1
yea its really simple we just collect a set of post files and concatenate
them.. no special HTTP upload handling code. maybe concatenate by running shell
command to avoid sending all that data through php but not a big deal either
way.

Assigning this to Bryan, he's the one working on the upload API at the moment.

in response to comment #2
is there some standard that you would prefer we use? The only thing I am aware
of is the google proposed resume upload code
http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal

Notice even Google (with their custom clients) has taken the 1 meg chunk at a
time approach because it avoids modifications to HTTP protocol. They have
proposed modifications to HTTP protocol and we can support that once it "gets
out there".

Modifications to the HTTP protocol shouldn't be supported by us, but by PHP.

IIRC the new API upload module supports Firefogg chunck uploading. Marking as FIXED.

(In reply to comment #5)

IIRC the new API upload module supports Firefogg chunck uploading. Marking as
FIXED.

Based on comments at bug 25676, I don't think this bug is properly fixed currently. Re-opening for now.

(Copied from bug 25676 comment 12 by Neil)

(In reply to bug 25676 comment 11)

(In reply to bug 25676 comment 9)

Reopening since Tim's arguments for WONTFIX pertained mostly to the Firefogg
add-on (client-side) rather than the FirefoggChunkedUpload extension
(server-side support).

Actually I think the second paragraph in comment 5, where I explained why I
don't think the server-side extension should be enabled, was longer than the
first paragraph, which dealt with my objections to the client-side.

I've had a look at Google's Resumable Upload Protocol. They do things in a
reasonable manner, also very RESTy. We have never used HTTP Headers or Status
fields for application state signalling, but we can emulate most of this in
ordinary POST parameters and returned data.

http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#ResumableUpload

Okay, so if the following things were added to a chunked upload protocol, would
this be satisfactory?

  • Before starting an upload, the client tells the server the length of the file

to be uploaded, in bytes.

  • With each upload chunk, the client also tells the server which range of bytes

this corresponds to.

  • With each response to an upload chunk, the server indicates the largest range

of contiguous bytes starting from zero that it thinks it has. (The client
should use this information to set its filepointer for subsequent chunks). n.b.
this means it's possible for the client to send overlapping ranges; the server
needs to be smart about this

  • The server is the party that decides when the upload is done. (By signaling

that the full range of packets are received, saying "ok, all done", and then
returning the other usual information about how to refer to the reassembled
file).

We could also add optional checksums here, at least for each individual chunk.
(A complete-file checksum would be nice, but it's not clear to me whether it is
practical for Javascript FileAPI clients to obtain them).

And each upload could return some error status, particularly if checksums or
expected length doesn't match.

(Copied from bug 25676 comment 13 by Neil)

Okay, from some comments on IRC, it is apparently unclear why I just posted
some suggestions for changing the Firefogg protocol. I am trying to answer
Tim's objections that the way of uploading chunks is not robust enough.

In Firefogg, the client is basically POSTing chunks to append until it says
that it's done. The server has no idea when this process is going to end, and
has no idea if it missed any chunks. I believe this is what bothered Tim about
this.

There was some confusion about whether PLupload was any better. As far as I can
tell, it isn't, so I looked to the Google Resumable Upload Protocol for
something more explicit.

(Copied from bug 25676 comment 14 by Tim)

(In reply to bug 25676 comment 12)

Okay, so if the following things were added to a chunked upload protocol, would
this be satisfactory?

Yes, that is the sort of thing we need.

(In reply to bug 25676 comment 13)

In Firefogg, the client is basically POSTing chunks to append until it says
that it's done. The server has no idea when this process is going to end, and
has no idea if it missed any chunks. I believe this is what bothered Tim about
this.

Yes. For example, if the PHP process for a chunk upload takes a long time, the
Squid server may time out and return an error message, but the PHP process will
continue and the chunk may still be appended to the file eventually. In this
situation, Firefogg would retry the upload of the same chunk, resulting in it
being appended twice. Because the original request and the retry will operate
concurrently, it's possible to hit NFS concurrency issues, with the duplicate
chunks partially overwriting each other.

A robust protocol, which assumes that chunks may be uploaded concurrently,
duplicated or omitted, will be immune to these kinds of operational details.

Dealing with concurrency might be as simple as returning an error message if
another process is operating on the same file. I'm not saying there is a need
for something complex there.

mdale wrote:

So... the current consensus based on bug 25676 is a transport level chunk support. This points to the support being written into core and have it handled somewhat above the 'upload' api entry points.

The flow would look like the following: based on http://code.google.com/apis/documents/docs/3.0/developers_guide_protocol.html#ResumableUploadInitiate and http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal

Your initial post sets all the upload parameters
filename
comment
text
token
stash
etc.

In addition to Content-Length ( for the parameters ). We set the "X-Upload-Content-Type" and "X-Upload-Content-Length" headers that give the target file type and upload size but we /DO NOT/ include any portion of the file in this initial request. These special X-Upload-Content-Length headers indicate to the server that this is a resumable / chunk upload request. ( Ideally we don't want to explicitly tag with a mediaWiki specific api parameter ) We may need a way to initially communicate to the client that the server supports resumable uploads. ( ie

The server then checks the requested size, validates all the initial upload parameters ( token, valid file name etc ) then responds with a unique url that only the current session can upload to.

HTTP/1.1 200 OK
Location: <upload_uri>

NOTE: We are slightly abusing the resume protocol, since normally you send a request to upload the entire file ( but because small chunks are more friendly on wikimedia's back-end system we want clients to send things in smaller parts )

Then the client then starts to send the file in 1 meg chunks. The chunks are specified via the Content-Range header ie something like:

Content-Length: 10
Content-Range: bytes 0-10/100

The server revives the content-range POSTs and checks that the chunk is authenticated via the session and unique url, the chunks byte ranges are checked and only valid unseen sequential byte ranges are appended to the file.

If there are no errors the server responds with a header specify the next chunk
HTTP/1.1 308 Resume Incomplete
Content-Length: 0
Range: 11-20

The client then responds to the Resume Incomplete and sends the next chunk to the server, if the POST breaks or is incomplete the client can query the server for where it left off with:

PUT <upload_uri> HTTP/1.1
Host: docs.google.com
Content-Length: 0
Content-Range: bytes */100

The client should only do this every 30 seconds for 5 min and then give up. The server should also "give up" after 30 min and invalidate any chunks that attempt to be appended to an old file. Likewise partially uploaded files should be purged every so often, possibly with the same purge system used for stashed files?

Finally if all is well, the when the final chunk is sent and the normal api repose code is run where the file is validated and stashed or added to the system.

If this sounds reasonable all that left to do is implementation ;)

(In reply to comment #10)

If this sounds reasonable all that left to do is implementation ;)

The parts about X-Foo HTTP headers and using PUT sound like they would make a JS client implementation difficult. Same for using HTTP headers to return data (Range header in the 308 response) and probably to a lesser degree for using status codes to convey information (308).

Supporting the Google protocol verbatim is probably a good idea for interoperability, but I think we should also implement a slightly modified version that uses only POST data (query params) to send data and only the response body to receive data, just like api.php

mdale wrote:

Modern XHR browser implementations do support reading and sending custom headers and reading custom response codes etc. ie xhr.setRequestHeader('custom-header', 'value'); and xhr.getAllResponseHeaders() etc.

We won't be supporting non-modern XHR systems since the browser needs to support the blob slice api... I don't think the issue would be on the client, my concern would be if there are any foreseen issues on the back end.

I suppose we could support both... by also supporting these headers in the api request and response.

(In reply to comment #12)

Modern XHR browser implementations do support reading and sending custom
headers and reading custom response codes etc. ie
xhr.setRequestHeader('custom-header', 'value'); and xhr.getAllResponseHeaders()
etc.

We won't be supporting non-modern XHR systems since the browser needs to
support the blob slice api... I don't think the issue would be on the client,
my concern would be if there are any foreseen issues on the back end.

I suppose we could support both... by also supporting these headers in the api
request and response.

Does jQuery AJAX support this?

mdale wrote:

(In reply to comment #13)

(In reply to comment #12)

Modern XHR browser implementations do support reading and sending custom
headers and reading custom response codes etc. ie
xhr.setRequestHeader('custom-header', 'value'); and xhr.getAllResponseHeaders()
etc.

We won't be supporting non-modern XHR systems since the browser needs to
support the blob slice api... I don't think the issue would be on the client,
my concern would be if there are any foreseen issues on the back end.

I suppose we could support both... by also supporting these headers in the api
request and response.

Does jQuery AJAX support this?

Sure, jQuery returns the raw XHR object. But it would need to be a "plugin" that extended the ajax support. Ie the jQuery plugin would take an input[type=file] or parent form and "POST" it to an api target and give you status updates via callbacks or polling.

If done at the protocol level, the jQuery plugin could be a general purpose plugin for any php based Google resumable upload implementation.

(In reply to comment #14)

Sure, jQuery returns the raw XHR object. But it would need to be a "plugin"
that extended the ajax support. Ie the jQuery plugin would take an
input[type=file] or parent form and "POST" it to an api target and give you
status updates via callbacks or polling.

If done at the protocol level, the jQuery plugin could be a general purpose
plugin for any php based Google resumable upload implementation.

OK, as long as we're not making protocol design choices that make a jQuery implementation disproportionately harder to do, I'm fine with it.

So what's the deal with this bug? We have chunked upload support in core, and have had for years now. Is that sufficient, or is this bug really wanting some specific protocol instead of the code we already have?

If it's sufficient, let's close this as Resolved. If it's wanting some other protocol, based on what's written here about these other protocols I'd close this as Declined.

If someone wants to change the existing chunked upload provided by the action API, they should probably open a new task. Closing this.