Page MenuHomePhabricator

Upload by URL should use the job queue, possibly chunked with range requests
Open, HighPublic

Description

I would probably recommend retooling the upload via URL to run via the job queue, like long-running video transcodes do. This would allow for downloads to take longer if needed, without the possibility of dosing the app servers. This is a bigger task though...

But more to my point: what should change is the fact that upload-by-url can take up to 200 seconds to complete a single HTTP request.
HTTP requests should terminate reasonably quickly and the work should be done in the background via the jobqueue, and the client should have a way of polling it.

I also discussed this with @tstarling recently, if a single long-running job doesn't work, we could split it into multiple jobs using range requests similar to how chunked uploads work.

Event Timeline

Joe triaged this task as High priority.Dec 6 2023, 3:50 PM

We intend to try to take a stab at this during next week's MediaWiki CodeJam.

It will if anything be a chance for me to refresh my mediawiki internals knowledge and for others to mock my ability to interact with a web frontend :P

Change 982757 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[mediawiki/core@master] Add job for upload from UploadFromUrl.

https://gerrit.wikimedia.org/r/982757

Change 983196 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[mediawiki/core@master] Allow async upload by url via the Api

https://gerrit.wikimedia.org/r/983196

What's the status of this task?

Change #1007344 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[mediawiki/core@master] Switch Special:Upload to use async upload-by-url

https://gerrit.wikimedia.org/r/1007344

What's the status of this task?

Hi, sorry for not replying earlier but this work is kind-of a side hustle for me - as you might have noticed, this stuff is far from my area of expertise :)

I have completed the basic work to make this work both in Special:Upload and in the API. While I think we can merge the patches to add asynchronous behaviour to the API as soon as I find a reviewer, the Special:Upload stuff will need me to find someone to help me polishing the UI part of it.

Doing frontend web development is really both not my job nor my area of expertise.

With the API change, you should be able to upload by url large files (up to 4 GB IIRC) without incurring in timeouts; and that will also allow us to move file processing to Shellbox, making our infrastructure more secure.

Here is an example of a script that works with the async api: P58902

Would this be enough to unblock people who want to upload large files, while I try to polish the Special:Upload patch?

With the API change, you should be able to upload by url large files (up to 4 GB IIRC) without incurring in timeouts; and that will also allow us to move file processing to Shellbox, making our infrastructure more secure.

@Joe: How about 5 GiB per https://gerrit.wikimedia.org/r/1002813 ? See https://phabricator.wikimedia.org/T191804#9363066 for a discussion
about the capacity implications of this change.

With the API change, you should be able to upload by url large files (up to 4 GB IIRC) without incurring in timeouts; and that will also allow us to move file processing to Shellbox, making our infrastructure more secure.

@Joe: How about 5 GiB per https://gerrit.wikimedia.org/r/1002813 ? See https://phabricator.wikimedia.org/T191804#9363066 for a discussion
about the capacity implications of this change.

Once we've gone completely async and we've confirmed it works, it might be a good idea to increase the allowed size for files that are uploaded by URL.

As-is, the synchronous process doesn't allow you to upload files larger than 2-3GB because it often times out before a larger file is processed. So even getting to the current official limit of 4 GB and make it work reliably seems like a good first step.

Change #982757 merged by jenkins-bot:

[mediawiki/core@master] Add job for upload from UploadFromUrl

https://gerrit.wikimedia.org/r/982757

Change #983196 merged by jenkins-bot:

[mediawiki/core@master] Allow async upload by url via the Api

https://gerrit.wikimedia.org/r/983196

Change #1007344 merged by jenkins-bot:

[mediawiki/core@master] Switch Special:Upload to use async upload-by-url

https://gerrit.wikimedia.org/r/1007344