Page MenuHomePhabricator

Upload by URL should use the job queue, possibly chunked with range requests
Closed, ResolvedPublic

Description

In T118887#2233280, @brion wrote:

I would probably recommend retooling the upload via URL to run via the job queue, like long-running video transcodes do. This would allow for downloads to take longer if needed, without the possibility of dosing the app servers. This is a bigger task though...

But more to my point: what should change is the fact that upload-by-url can take up to 200 seconds to complete a single HTTP request.
HTTP requests should terminate reasonably quickly and the work should be done in the background via the jobqueue, and the client should have a way of polling it.

I also discussed this with @tstarling recently, if a single long-running job doesn't work, we could split it into multiple jobs using range requests similar to how chunked uploads work.

Event Timeline

Joe triaged this task as High priority.Dec 6 2023, 3:50 PM

We intend to try to take a stab at this during next week's MediaWiki CodeJam.

It will if anything be a chance for me to refresh my mediawiki internals knowledge and for others to mock my ability to interact with a web frontend :P

Change 982757 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[mediawiki/core@master] Add job for upload from UploadFromUrl.

https://gerrit.wikimedia.org/r/982757

Change 983196 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[mediawiki/core@master] Allow async upload by url via the Api

https://gerrit.wikimedia.org/r/983196

What's the status of this task?

Change #1007344 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[mediawiki/core@master] Switch Special:Upload to use async upload-by-url

https://gerrit.wikimedia.org/r/1007344

What's the status of this task?

Hi, sorry for not replying earlier but this work is kind-of a side hustle for me - as you might have noticed, this stuff is far from my area of expertise :)

I have completed the basic work to make this work both in Special:Upload and in the API. While I think we can merge the patches to add asynchronous behaviour to the API as soon as I find a reviewer, the Special:Upload stuff will need me to find someone to help me polishing the UI part of it.

Doing frontend web development is really both not my job nor my area of expertise.

With the API change, you should be able to upload by url large files (up to 4 GB IIRC) without incurring in timeouts; and that will also allow us to move file processing to Shellbox, making our infrastructure more secure.

Here is an example of a script that works with the async api: P58902

Would this be enough to unblock people who want to upload large files, while I try to polish the Special:Upload patch?

With the API change, you should be able to upload by url large files (up to 4 GB IIRC) without incurring in timeouts; and that will also allow us to move file processing to Shellbox, making our infrastructure more secure.

@Joe: How about 5 GiB per https://gerrit.wikimedia.org/r/1002813 ? See https://phabricator.wikimedia.org/T191804#9363066 for a discussion
about the capacity implications of this change.

With the API change, you should be able to upload by url large files (up to 4 GB IIRC) without incurring in timeouts; and that will also allow us to move file processing to Shellbox, making our infrastructure more secure.

@Joe: How about 5 GiB per https://gerrit.wikimedia.org/r/1002813 ? See https://phabricator.wikimedia.org/T191804#9363066 for a discussion
about the capacity implications of this change.

Once we've gone completely async and we've confirmed it works, it might be a good idea to increase the allowed size for files that are uploaded by URL.

As-is, the synchronous process doesn't allow you to upload files larger than 2-3GB because it often times out before a larger file is processed. So even getting to the current official limit of 4 GB and make it work reliably seems like a good first step.

Change #982757 merged by jenkins-bot:

[mediawiki/core@master] Add job for upload from UploadFromUrl

https://gerrit.wikimedia.org/r/982757

Change #983196 merged by jenkins-bot:

[mediawiki/core@master] Allow async upload by url via the Api

https://gerrit.wikimedia.org/r/983196

Change #1007344 merged by jenkins-bot:

[mediawiki/core@master] Switch Special:Upload to use async upload-by-url

https://gerrit.wikimedia.org/r/1007344

Let's enable this in beta cluster. If anyone is feeling like it.

Change #1024731 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] Enable async upload-by-URL on testwiki

https://gerrit.wikimedia.org/r/1024731

Change #1024731 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable async upload-by-URL on testwiki

https://gerrit.wikimedia.org/r/1024731

Change #1025790 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] Enable async upload-by-URL via jobqueue on commons

https://gerrit.wikimedia.org/r/1025790

Change #1025790 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable async upload-by-URL via jobqueue on testwiki

https://gerrit.wikimedia.org/r/1025790

Mentioned in SAL (#wikimedia-operations) [2024-05-13T13:47:20Z] <logmsgbot> lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1025790|Enable async upload-by-URL via jobqueue on testwiki (T295007)]]

Mentioned in SAL (#wikimedia-operations) [2024-05-13T13:49:46Z] <logmsgbot> lucaswerkmeister-wmde@deploy1002 hnowlan and lucaswerkmeister-wmde: Backport for [[gerrit:1025790|Enable async upload-by-URL via jobqueue on testwiki (T295007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-05-13T14:12:30Z] <logmsgbot> lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1025790|Enable async upload-by-URL via jobqueue on testwiki (T295007)]] (duration: 25m 09s)

Change #1031028 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] Enable async jobqueue-powered URL uploads on commons

https://gerrit.wikimedia.org/r/1031028

Testing of this feature on testwiki has been successful so far, we are hoping to enable it on commons tomorrow in the UTC backport window. If there are any pending large URL uploads waiting to be rolled out it would be very handy to test using these

Change #1031028 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable async jobqueue-powered URL uploads on commons

https://gerrit.wikimedia.org/r/1031028

Mentioned in SAL (#wikimedia-operations) [2024-05-16T13:14:11Z] <jsn@deploy1002> Started scap: Backport for [[gerrit:1031028|Enable async jobqueue-powered URL uploads on commons (T295007)]]

Mentioned in SAL (#wikimedia-operations) [2024-05-16T13:16:52Z] <jsn@deploy1002> jsn and hnowlan: Backport for [[gerrit:1031028|Enable async jobqueue-powered URL uploads on commons (T295007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-05-16T13:32:29Z] <jsn@deploy1002> Finished scap: Backport for [[gerrit:1031028|Enable async jobqueue-powered URL uploads on commons (T295007)]] (duration: 18m 18s)

This feature is now live on commons and appears to be functioning correctly. Please update ASAP if you notice anything amiss. I'd be particularly interested to know about success or failure of large file uploads.

hnowlan claimed this task.

I am going to resolve this ticket for now - please reopen if you notice any issues

I’m not aware of any issues, but if this is working (great!), then we should probably enable it by default in MediaWiki, and everywhere in production? Right now AFAICT $wgEnableAsyncUploadsByURL still defaults to false in core and is only enabled in commonswiki, testwiki and the Beta Cluster in wmf-config.