Page MenuHomePhabricator

Functional replacement for importImages.php on Kubernetes
Open, Needs TriagePublic

Description

The current system for server-side uploads involves getting media files onto the maintenance host, then running importImages.php to upload them from the local filesystem. On Kubernetes, that won't work.

The small reason is that presently mwscript-k8s doesn't support copying files into the MediaWiki pod for maintenance scripts to read. But even when it does (T376230) the big reason is that these media files are too large (up to 5 GB per file, sometimes up to dozens of gigabytes in a single import) for that to work. We'd have to get the media file onto the deployment host, then copy it from there to the worker, then complete the import -- if we can even set up a big enough volume backed by kubelet-managed disk. So we probably need to invent something different.

Event Timeline

sometimes dozens of gigabytes

The current maximum size of a file that MediaWiki/Swift can handle is 5GB. But see T191802 which proposes to increase it further.

Thanks -- that was in reference to:

The mediafiles can be very large – I've certainly uploaded files that had dozens of GBs in total. As long as mwmaint had enough space and sufficient sleep was allowed for videoscalers to catch up, things worked well.

On a closer look I think that was referring to a single batch of files that added up to dozens of gigs. Thanks for flagging.

sometimes dozens of gigabytes

The current maximum size of a file that MediaWiki/Swift can handle is 5GB. But see T191802 which proposes to increase it further.

Thanks for the note! You're correct that 5GB is the maximum size per file. However, importImages.php supports importing an arbitrary number of files. If one imports 10 files 4 GB each, they would need 40 GB of space (this is what was meant by "dozens of gigabytes" in my original message from T341553#10217525).

Of course, one solution to that is only allow importing one file at a time. While that is possible, it would grossly degrade the user experience. Plus, it would still mean the file gets copied three times: once from its original location to the deployment host, once from the deployment host to the pod and once from the pod to swift.

I have thought of a few options for this:

  • I think the easiest solution to this problem is to make a two-step process, and it involves changing the script we use quite a bit. We would need to have a program that first uploads the file(s) to a special swift container, then generates an output we can actually pass to a modified importImages.php that can then just copy over the file to the right swift container and/or download it locally for consumpltion. The original program would not need to either be MediaWiki based or even written in php.
  • As an alternative, we can create a new version of "mwmaint" that is actually a k8s worker with specific labels, and allow mounting specific directories of it into a pod. This second solution has some limitations but also the advantage that the workflow wouldn't change much.

And finally, by far my favourite option:

  • Given now uploads by url are async, just raise the file size limit for those to more than 5 GB and stop doing server side uploads, which is an archaic idea that generates toil work for a small and precious slice of our community members

I should add, this is yet another example of how the unmaintained and substantially abandoned parts of MediaWiki, like the file uploads and manipulation stack, are a constant source of technical debt and slow down every single migration/change we want to make to MediaWiki.

When we'll have resolved this issue, we will have dedicated, as the SRE team migrating MediaWiki to k8s, more engineering to file uploads in core than anyone else in a decade, and probably delayed the migration by at least a quarter.

[...]
And finally, by far my favourite option:

  • Given now uploads by url are async, just raise the file size limit for those to more than 5 GB and stop doing server side uploads, which is an archaic idea that generates toil work for a small and precious slice of our community members

Can you clarify this, please? I do not understand how would raising this limit help to avoid server side uploads. Unless I'm missing something, server side uploads aren't here to bypass the per-file size limit – that cannot be done, as swift-level limitations kick in regardless of how the file is getting uploaded. Instead, server side uploads are here to allow bypassing bugs that prevented the upload from finishing.

For example, video2commons (as one of the major sources of Server-side-upload-request) first attempts to upload the file itself. If it receives an error such as backend-fail-internal, it prompts the user to request a server side upload instead (see source for more details). Of course, if we can identify and fix all of those bugs, that would be amazing and far better than figuring out how to continue doing server side uploads.

[...]
And finally, by far my favourite option:

  • Given now uploads by url are async, just raise the file size limit for those to more than 5 GB and stop doing server side uploads, which is an archaic idea that generates toil work for a small and precious slice of our community members

Can you clarify this, please? I do not understand how would raising this limit help to avoid server side uploads.

One of the main limitations we have currently is the limit of file size allowed via any method of upload is 2 GB, IIRC. If that wasn't the case, why would we even have a server-side upload process? I'm fairly certain it wasn't born to "cover for bugs" of the interface.

Unless I'm missing something, server side uploads aren't here to bypass the per-file size limit – that cannot be done, as swift-level limitations kick in regardless of how the file is getting uploaded. Instead, server side uploads are here to allow bypassing bugs that prevented the upload from finishing.

I wasn't referring to the swift limits but the ones in the interface: the goal of server side uploads is explicitly declared as being overcoming the file size limit from 2 GB per file, which is what is allowed by the web interface at the moment, to 5: https://commons.wikimedia.org/wiki/Help:Server-side_upload.

That documentation isn't quite accurate. The goal of server-side uploads as they are used today is to work around the fact that uploads of large files are flaky for various reasons, and more likely to flake the larger the file gets.

That documentation isn't quite accurate. The goal of server-side uploads as they are used today is to work around the fact that uploads of large files are flaky for various reasons, and more likely to flake the larger the file gets.

What I'm failing to understand, and it might be for lack of knowledge of the server-side upload process, is how that could succeed now if upload-by-url fails. We've removed the biggest limitation that was specific to that process by making it asynchronous, so I'd like to understand what works server-side and doesn't work using upload by url from the web interface.

For example, video2commons (as one of the major sources of Server-side-upload-request) first attempts to upload the file itself. If it receives an error such as backend-fail-internal, it prompts the user to request a server side upload instead (see source for more details). Of course, if we can identify and fix all of those bugs, that would be amazing and far better than figuring out how to continue doing server side uploads.

I finally went to look at the code, and video2commons specifically checks a file is under 5 GB but doesn't check it's below the current upload limit. So it is expected to fail if the file is too large. I fully expect our api to give back to the user a not-very informative error in that case. I would investigate further but I've seen enough of that code - but to be very clear here the intent of the authors of video2commons seems to be, as far as I can tell, being able to overcome the size limit and not generically fixing with manual labor the bugs in the software as you seem to suggest.

Again, I'm not contrary to server side uploads because they require a lot of work from us, but rather I'd like to understand how much is this process really needed anymore, and if we're not still doing it because no one pays enough attention to media uploads, and we could just raise the limit in the public interface.

That documentation isn't quite accurate. The goal of server-side uploads as they are used today is to work around the fact that uploads of large files are flaky for various reasons, and more likely to flake the larger the file gets.

What I'm failing to understand, and it might be for lack of knowledge of the server-side upload process, is how that could succeed now if upload-by-url fails. We've removed the biggest limitation that was specific to that process by making it asynchronous, so I'd like to understand what works server-side and doesn't work using upload by url from the web interface.

Upload-by-url only allows one to upload images from an allowlist. If you want to upload a file as an one-off, it's harder to make use of, as you need to upload the file somewhere, request allowlisting and upload the file (and possibly delisting, although that is not strictly needed). I'm not 100% sure what the purpose of allowlisting is here.

I see @RoyZuo requested allowlisting video2commons, which was done. I'm not sure whether video2commons attempted to use upload-by-url at some point, and if so, whether there were any issues with that. Maybe @RoyZuo can share that information?

I finally went to look at the code, and video2commons specifically checks a file is under 5 GB but doesn't check it's below the current upload limit. So it is expected to fail if the file is too large. I fully expect our api to give back to the user a not-very informative error in that case.

I don't think that's true. In theory, MediaWiki's API supports uploads of any file size (up until 5 GB), even without using upload-by-url. The only limitation I'm aware of here is that only files up to 100 MB can be uploaded "at once" (in one piece). Files that are above 100 MB but less than 5 GB can be uploaded as well, but they need to be uploaded in chunks, see docs on chunked uploading.

Chunked uploading is something we're making use of in our own code (namely, UploadWizard), and it is a capability video2commons attempts to make use of as well. As far as I can see, video2commons works like this:

  1. If a file is <100 MB, upload it normally, that is without chunking and without server-side upload as a fallback (check in source code)
  2. If a file is >100 MB, but <5GB, attempt to upload it via chunked uploading (check for this is less visible, see determination of chunk size in video2commons and actual chunking happening in Pywikibot, a library video2commons uses internally to upload files).
  3. If chunked uploading fails, request a server side upload.
  4. If a file is >5GB, refuse it outright.

In theory, step 3 shouldn't be triggered at all, because MediaWiki is supposed to be able to upload files of any size (up to 5GB) when chunking is used. However, due to bugs in MediaWiki, this is not the case, and server side uploads are getting requested. Those server side uploads are requested not because there is no way for video2commons to upload the file by itself, but because the documented way of doing so has failed. Hence, the purpose of those server side uploads is indeed to overcome bugs, rather than to overcome file size limits. Considering chunked uploading is a generic technique that anything that uploads files to Commons can use, I'm fairly certain other server side uploads (besides those from video2commons) have the same purpose.

I would investigate further but I've seen enough of that code - but to be very clear here the intent of the authors of video2commons seems to be, as far as I can tell, being able to overcome the size limit and not generically fixing with manual labor the bugs in the software as you seem to suggest.

The intention of video2commons authors indeed is to overcome the size limits. As I showed above, server side uploads are only used when chunking (as the documented way of overcoming the size limits) fails to do its job. That being said, the intention of video2commons authors probably isn't to fix issues that originate in MediaWiki, and those issues are being overcomed with server side uploads. Hope this clarifies.

Again, I'm not contrary to server side uploads because they require a lot of work from us, but rather I'd like to understand how much is this process really needed anymore, and if we're not still doing it because no one pays enough attention to media uploads, and we could just raise the limit in the public interface.

Agreed. I would love any removable manual workflow to be removed, especially if said workflow requires server access. I only disagree on why are server side uploads currently needed. As far as I can see, they are needed to bypass MediaWiki bugs (that prevent chunked uploads from succeeding). You mention some kind of file limits, but I'm not aware of any limits (besides the swift-imposed 5GB cap) that would prevent chunked uploading from working. If I'm missing the existence of such limits, I would appreciate being pointed to those.

I'm not 100% sure what the purpose of allowlisting is here.

I think the point is to allow people to upload only from sites known to have free licenses to prevent upload-by-url from being misused as a vector for copyvios.

I'm not 100% sure what the purpose of allowlisting is here.

I think the point is to allow people to upload only from sites known to have free licenses to prevent upload-by-url from being misused as a vector for copyvios.

It's a security measure, see T65961#679911. Not sure how many of those points are still relevant a decade later.

Change #1084279 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] scap: Exclude importImages from mwscript deprecation warning

https://gerrit.wikimedia.org/r/1084279

(For the avoidance of doubt: We'll need some form of solution to this problem before turning off the mwmaint hosts, but I'm not working on it as a mwscript-k8s feature.)

Change #1084279 merged by RLazarus:

[operations/puppet@production] scap: Exclude importImages from mwscript deprecation warning

https://gerrit.wikimedia.org/r/1084279

I see @RoyZuo requested allowlisting video2commons, which was done. I'm not sure whether video2commons attempted to use upload-by-url at some point, and if so, whether there were any issues with that. Maybe @RoyZuo can share that information?

it was a desperate attempt to try if upload-by-url can work, because apparently the video file was uploaded to the toolforge server but it could not be transferred to wiki servers.

I am old enough to remember the days when only Special:Upload existed. The only way to upload any relatively large file was a server-side import.

Nowadays, we have chunked uploading, upload-by-url, and it is even managed directly by commons admins (T300407, thanks @taavi !)

It _shouldn't_ be needed to bother sysadmins with server-side-uploads. That should really be exceptional (if someone shipped a physical drive full of files to a dc, maybe).

I think we should focus, for any request received, on _why_ those uploads are failing. Hopefully, that would allow us to fix a general problem, and make the feature no longer needed.

@RoyZuo: are those files still failing? If so, I would recommend opening a task to investigate for such failure.

Hi, I was told to add here some comments. Today I used importImages.php to recover 4 files. This is a common occurence (losing files, not using importImages.php), as we lose files (in particular, mw loses files, not Swift) regularly, as frequently as once every week. This is a summary of past identified cases: T289996. This is known and there are plans to address it: T271530 (the cause is unsafe practices whenever a file is moved, deleted, restored, etc. at app level). There are approximately 100K lost files ATM (to be fair, there are much fewer actual lost files, many of those are just invalid references to files we never hosted). Thanks to backups, I don't think we have lost (irrevocably) any file since those started happening.

Normally, using the workflow of reuploading is preferred because I backup MediaWiki files (that was the design decision by the committee at the time, not something I decided), not Swift. So I can recover a file to mw, but don't have enough data to recover Swift (that would require a complete backup strategy). That means using mw to reupload the file as a new entry, rather than trying to attack Swift directly. Normally this can happen through normal upload form, but in the case at T393049 that didn't work, because it said the file was a duplicate. Using the script worked. I know this is not an infrastructure issue, but sadly, as far as I am aware, no other team cares about losing data. I have no api to check deleted or archived files, no upload api, and no facilities to make my life easier in general. Hence the usage of such script. I know this is not what you wanted to hear, but this is the current state of things. :-/

I am happy to implement any recovery strategy I am required from product owners (Commons file upload maintainers), but this is what we have ATM. Give me an api to know where to put stuff and I can handle the swift recovery myself, what I refuse to do is manually handle recoveries and blindly write to swift and expect "it works", given how dynamic mw file handling is (e.g. a lost file won't be on the same path as 1 week later or 1 month ago).

this is yet another example of how the unmaintained and substantially abandoned parts of MediaWiki, like the file uploads and manipulation stack, are a constant source of technical debt and slow down every single migration/change we want to make to MediaWiki

I cannot agree more, now imagine having to backup that :-D. Trying to get some sympathy here. But I don't need the script, I just need a way to recover stuff because for me mw is an opaque box that I cannot fathom to understand.

[A brief aside: data-persistence sometimes need to use this script for restoring images we had to fish out of backups (or one of the ms clusters if an image was only uploaded to one); we could "just" upload the image to ms directly with the swift CLI tool, but have avoided doing so in the past because it's not clear what (if any) metadata (either in swift or in MW) would also need to be updated for this to work.]

Hello hello! I heard @Clement_Goubert saying that "The "old" foreachwiki/mwscript wrappers WILL be deprecated completely soon". As far as I understand things, removing the mwscript wrapper (and the ability to run maint. scripts directly on deploy1003) will make it impossible to use importImages.php. What are the plans with this task?

(For the avoidance of doubt: We'll need some form of solution to this problem before turning off the mwmaint hosts, but I'm not working on it as a mwscript-k8s feature.)

It seems @RLazarus previously said this will need to be somehow-solved before turning off the mwmaint hosts (which has already happened). Is there any kind of timeline/solution on hand? I would hate to lose the capability to run that script altogether, as it is very useful from time to time.

Hello hello! I heard @Clement_Goubert saying that "The "old" foreachwiki/mwscript wrappers WILL be deprecated completely soon". As far as I understand things, removing the mwscript wrapper (and the ability to run maint. scripts directly on deploy1003) will make it impossible to use importImages.php. What are the plans with this task?

(For the avoidance of doubt: We'll need some form of solution to this problem before turning off the mwmaint hosts, but I'm not working on it as a mwscript-k8s feature.)

It seems @RLazarus previously said this will need to be somehow-solved before turning off the mwmaint hosts (which has already happened). Is there any kind of timeline/solution on hand? I would hate to lose the capability to run that script altogether, as it is very useful from time to time.

The solution that was chosen in order to turn off mwmaint was to allow running non-k8s mwscript on deployment hosts.

Given the risk of leaving foreachwiki available outside of the mw-script-k8s flag (the creation of huge amounts of kubernetes objects, causing the saturation of the internal k8s etcd database), and the fact that despite communication, I still see old-style mwscript invocations from time to time, I think we need to enforce a little more strictly that except for importImages.php, scripts need to be run with mwscript-k8s.

I'll need to discuss timeline and possible solutions with @RLazarus