Page MenuHomePhabricator

Chunked upload fails in UploadWizard with the server aborting the connection, and no errors in the server logs
Closed, ResolvedPublic

Description

https://www.fsf.org/blogs/community/user-liberation-watch-and-share-our-new-video#info

The video on this page (click "full resolution" to download it), when uploaded through UploadWizard (with chunking turned on, in Firefox) causes a weird behaviour:

  • UW says "Unknown error"
  • The file doesn't upload more than 10-20% of the way
  • No parameters are sent in the API request that I can see
  • The server doesn't respond in any way - no status code, no response headers, no response body

I have looked at the logs a little bit, but I cannot find anything related. Logstash is down, so you may need to look at fluorine - which is what I was doing, except that I had no success in finding what I needed.

Event Timeline

MarkTraceur raised the priority of this task from to High.
MarkTraceur updated the task description. (Show Details)
MarkTraceur added subscribers: MarkTraceur, Gilles, Tgr.

HttpFox reports a NS_ERROR_NET_RESET error.

I think that's roughly the same as what I said according to Mozilla

Fatal/OOM? The zend interpreter can die without even invoking the shutdown handler on certain fatal errors (not sure about hhvm), and OOM errors would probably be logged in syslog only.

Can't repro this on beta, although the behavior is weird there as well: it finishes the upload, says Unknown error: "unknown"., and the API response is the same help page you get when visiting api.php in a browser. (That was the second attempt - the first just crashed Firefox.)

Can reproduce in production although the API request parameters seem fine to me.

In T89018#1026574, @Tgr wrote:

Fatal/OOM? The zend interpreter can die without even invoking the shutdown handler on certain fatal errors (not sure about hhvm), and OOM errors would probably be logged in syslog only.

Which requires root access, so someone from SRE will have to look into this.

Steps to reproduce:

  1. fetch https://static.fsf.org/nosvn/FSF30-video/FSF_30_video.ogv
  2. install http://firefogg.org/ in Firefox
  3. visit Commons, log in
  4. open https://commons.wikimedia.org/wiki/Special:UploadWizard
  5. tail syslog
  6. click "Add more files", add the video
  7. wait for the upload to fail
  8. hopefully see a PHP OOM error or something similar in the syslog

Same just happened to me, with chunked uploads enabled, on both Chrome and Firefox, with a 150MB PDF file, on a strong connection (from the WMF office). uploads complete, but "unknown error".

I can reproduce this on https://test.wikipedia.org/wiki/Special:UploadWizard but can't actually get something relevant in the log files

gerritbot subscribed.

Change 189828 had a related patch set uploaded (by MarkTraceur):
Fix call to FormDataTransport

https://gerrit.wikimedia.org/r/189828

Patch-For-Review

Change 189828 merged by jenkins-bot:
Fix call to FormDataTransport

https://gerrit.wikimedia.org/r/189828

Change 189849 had a related patch set uploaded (by Gergő Tisza):
Fix call to FormDataTransport

https://gerrit.wikimedia.org/r/189849

Patch-For-Review

Change 189849 merged by jenkins-bot:
Fix call to FormDataTransport

https://gerrit.wikimedia.org/r/189849

Tgr renamed this task from Investigate server error when uploading an OGV to Chunked upload fails in UploadWizard with the server aborting the connection, and no errors in the server logs.Feb 10 2015, 11:33 PM
Tgr set Security to None.

For the record, @ori and @BBlack have investigated this and found that the nginx server handling the uploads killed them with 413 Request Entity Too Large; nginx had a 100M file size limit (client_max_body_size) and the UploadWizard logic which should have ensured the file is split into <100M chunks had a bug.

The nginx error log record was

2015/02/10 20:57:02 [error] 26175#0: *160941080 client intended to send too large body: 133365469 bytes, client: 66.41.61.126, server: wikimedia.org, request: "POST /w/api.php HTTP/1.1", host: "commons.wikimedia.org", referrer: "https://commons.wikimedia.org/wiki/Special:UploadWizard?debug=true"

If I understand correctly, the difficulties in tracking this down were caused by most boxes having an older version of nginx which did not respond/log properly, just aborted the request.

Tgr claimed this task.

Backported to wmf16; this should be fixed on Commons now.

(ori/BBlack: if you can fill in where/how exactly you looked for the logs, that would greatly decrease the chance of people bugging you next time a similar problem happens :-)

I pulled that log line from looking directly at /var/log/nginx/wikimedia.org.error.log on one of the text cache nodes in eqiad (cp1066).

Also, re nginx versions: both versions log the same error. My understanding is that the newer one in testing (1.6.x being tested on jessie) returns a 413 error to the user, whereas the older version (1.1.x on precise) simply aborted the connection to the client without response in the same scenario.

In my opinion, progress bar still work incorrectly.
It displays the upload progress for each part (0-100%) then processed and repeated from zero many times.
Creates the illusion of termination and re-upload.
I almost did not create a new ticket.

Errors in console when opening pages (see #T88958) also preserved. Is it right and good style?

In my opinion, progress bar still work incorrectly.
It displays the upload progress for each part (0-100%) then processed and repeated from zero many times.

That's unrelated and tracked at T89194.

Errors in console when opening pages (see T88958) also preserved. Is it right and good style?

Do you mean NS_ERROR_FILE_CORRUPTED? That's again unrelated, sounds like a ResourceLoader error; you should open a separate ticket under that component.

@BBlack, I have a different theory regarding the nginx 1.1.x vs 1.6.x behaviour. Well not my theory but found it in:

http://stackoverflow.com/questions/4947107/nginx-upload-client-max-body-size-issue

Note the:

nginx "fails fast" when the client informs it that it's going to send a body larger than the client_max_body_size by sending a 413 response and closing the connection.

Most clients don't read for responses until the entire request body is sent. Because nginx closes the connection, the client sends data to the closed socket, causing a TCP RST.

@akosiaris that sounds consistent with what we've been experiencing when reproducing this bug, i.e. nginx closed the connection way before the whole content was uploaded

@BBlack, I have a different theory regarding the nginx 1.1.x vs 1.6.x behaviour. Well not my theory but found it in:

http://stackoverflow.com/questions/4947107/nginx-upload-client-max-body-size-issue

Note the:

nginx "fails fast" when the client informs it that it's going to send a body larger than the client_max_body_size by sending a 413 response and closing the connection.

Most clients don't read for responses until the entire request body is sent. Because nginx closes the connection, the client sends data to the closed socket, causing a TCP RST.

Well, that may all be true. Whatever client ori was testing with that day, though, he actually got the 413 response from nginx 1.6.x, but not from 1.1.x.