Page MenuHomePhabricator

Asynchronous or chunked Special:Import
Open, Needs TriagePublic

Description

T383962 shows that Special:Import is not fit for purpose. It's not feasible to import thousands of revisions from Incubator in a single POST request. It's not safe to retry the request after it fails. The current situation has left us with an epic mess in the database of new wikis.

The solutions are to either use the job queue or to segment the file on the client side and to import it in chunks. The job queue solution has some scalability and performance challenges. The client-side solution is apparently scalable.

To implement a purely job queue solution, we would need to accept a potentially large file in a single POST request, store it before the request times out, and then load the data into numerous jobs. The jobs could either download the whole file and filter it, or the initial POST request could segment the file and store it separately for each job. Segmenting and storing the file in a single request is comparable to just importing it, so the asynchronous advantage is limited.

Client-side JS can read a file selected by the user with Blob.stream(), parse it, segment it, and post it in chunks to ApiImport. Importing a multi-gigabyte file would be feasible with this approach.

If the user's browser exits part-way through the import, we would want some way to recover, say by saving the current state to IndexedDB when each request is sent.

Components:

  • New form
  • Progress and message display UI
  • Resumable session
  • Upload controller
  • Stream segmenter
  • Limit revision count and file size in the legacy no-JS implementation

Event Timeline

Should we do it completely server side instead? e.g. reuse upload stash for this purpose.

Should we do it completely server side instead? e.g. reuse upload stash for this purpose.

Why? I explained why I don't think it's a good idea in the task description.

Pppery subscribed.

Hopefully this will get done before the Flow uninstall and I can use it to import the entire support desk with history to archives which I'm planning to do someday.

tstarling renamed this task from Asynchronous Special:Import to Asynchronous or chunked Special:Import.Jan 20 2025, 4:27 AM

(In theory it should be possible to support chunked file uploading in the import API, but than rather than just reassembling the chunks like the upload API does, extract them into parallel jobs. Seems like a lot of effort for not much gain, though.)

MSantos added a project: patch-welcome.
MSantos added subscribers: Krinkle, daniel, MSantos.

We don't have a clear ownership definition for this module on MediaWiki-Engineering side, I'm adding more people to chime in in the discussion.

Proposals and patches you want us to review are welcome while we discuss this ownership matter internally.

Just a note that this has also been a recurring theme on Discord and mediawiki.org support page for 3rd parties. Especially starting wiki's that have pretty limited php resource usage limits defined fall into this trap.