Page MenuHomePhabricator

Produce regular public dumps of Commons media files
Open, Needs TriagePublic

Description

Commons is a multimedia project; its primary dump should be its media.
This is not currently so, which is confusing; and the lack of that media dump is hidden in every place where dumps are discussed (or linked to ancient dumps from ~2013.)

A Commons media dump should be made again on a regular basis.

  1. It is urgent to have a current dump made and mirrored (_there is a deadline_, for our systems too)
  2. It is important to have a periodic update, at least annually
  3. Many commons files are already in public archives (such as the Internet Archive or Library of Congress), and include links to those archival sources in their metadata. The most useful dump would be of images that are not already sourced to one of those archives.

As of 2021 we have regular backups for Commons media working (T262668#7332883 - thanks @jcrespo @ArielGlenn and others!) but no public dumps of them.
These backups are not world-readable, nor accessible to would-be mirrors.

Event Timeline

Peachey88 renamed this task from Produce regular dumps of Commons media files. to Produce regular public dumps of Commons media files..Dec 30 2021, 11:38 PM
Reedy renamed this task from Produce regular public dumps of Commons media files. to Produce regular public dumps of Commons media files.Dec 31 2021, 12:44 AM

See also: T73405 (requesting dump of resixed images) and T53001 (Image tarball dumps on your.org are not being generated).

What is limiting here? That backups are large so it is hard to host them? So if backups are made, then it is just a question of pushing them somewhere? If somebody offers storage for those backups, would then help moving this issue further?

What is limiting here?

Mitar: I am not in charge of dumps, but from an external (but I believe technically-informed) perspective, there is at the moment one big issue that has many ramifications.

The one I personally would point as the main cause is the lack of any team at Wikimedia supporting the multimedia stack/Commons. When I worked on backup setup, I found no one whose job was maintaining commons/mw file stack and that could help me achieve it (except people from other teams and volunteers volunteering their time to help me, outside of their duties) and figure out certain things or overcome challenges found, such as corruption (T289996). When trying to have someone help me to fix apparently simple issues that were leading to data corruption, I found out no known team was in charge of that: See T290462

I reflected this gap in support at https://www.mediawiki.org/wiki/Developers/Maintainers#MediaWiki_core to make it apparent to everyone.

Taking a "simple backup" took many more months than it should because this lack of support.

However, a backup is "easy" compared to a public dump- while I just get to "store everything" in a private place all together, public dumps need to discriminate between public and private images; public and private wikis, versioning, etc, (must be context-aware) and a lot of measures should be put in place to make sure people's privacy and safety is ensured (think of vandals uploading bad files or ones with private information). There also needs to be some kind of classification and guidance given on how to use those (export metadata). That's certainly not impossible, but it requires work- which is made very very very difficult (and to some extent, impossible) with the current (old) mediawiki metadata file architecture, which has been known for years that it is far from great- and there is a plan to improve it (T28741). For non-technical people- not having a primary key means there is currently no reliable way to identify individual files- meaning that creating a dump in a highly dynamic environment (where files are constantly being uploaded, deleted and renamed) is very very difficult. Data loss was mitigated thanks to prioritizing backups. But backups are not dumps, we are aware of it- and we also want to generate those. The steps are clear (even if long) to move forward- but we need the people to work on it!

Having said that, lack of support of the media stack is not the only challenge- our full backups at the moment have 108 million files (there may be duplicates there) and almost 400TB in size- and that is only the originals! However, generating and distributing 400TB of data among the many consumers that will likely be interested on those will still require some serious architectural design (e.g. compared to serving 356 KB pages, or 2.5T wikidata exports). We have the know-how of how to handle small files inhouse, but we will need to invest on new solutions for those (and usually cannot (or chose not to) offload that to an external provider due to privacy concerns). We had to deploy new technologies for backups but -for the reasons stated above-, they are not public, and its performance needs are very limited (rarely read, usually by 1 single client). Even our media production cluster could have serious limitations if it wasn't because we are able to heavily cache the most used images on our CDN locations. A dedicated project needs to be allocated to this (like it was for backups, in which I have been working intermittently since 2020).

Not everything is bad news- I believe the documenting work I did as part of implementing backups could speed up a future dumps implementation- plus also inform file management rearchitecture.

I beg you to please not harass any employee about this (not saying you would, but others could or did in the past) because we are also negatively impacted _a lot_ in our ability to work effectively because of this (e.g. if we are attending outages due to file-related bugs, we are not improving the site in other areas!). Plus we regularly get (understandably) unhappy people reporting problems with files- which get no answer, and that personally disappoints me (I am personally invested on making Commons succeed). I invite you, however, to join in our concerns and make your voice heard through official channels so soon someone makes Commons and file management in general a higher priority for the Foundation!

I see. Thank you so much for detailed update. This helps a lot to understand things.

I am not sure if what you are describing is then an argument for making such dumps externally by somebody else? E.g., by scraping whole Wikimedia Commons (downloading 333 TB of data in progress) and uploading it somewhere. It looks like this could address the issue of public data (only that can be downloaded) and privacy. I am not sure though if downloading everything is viable: that could be a lot of requests and if limits to only one request at a time without parallelization (which is what I think is a policy for scraping of Wikipedia content) it could take a while (years, by quick estimation). I am also not sure if one can apply for higher rate limits for this (as it is not really API use, but just file downloading) and where/how.