Page MenuHomePhabricator

Produce regular public dumps of Commons media files
Open, Needs TriagePublic

Description

Commons is a multimedia project; its primary dump should be its media.
This is not currently so, which is confusing; and the lack of that media dump is hidden in every place where dumps are discussed (or linked to ancient dumps from ~2013.)

A Commons media dump should be made again on a regular basis.

  1. It is urgent to have a current dump made and mirrored (_there is a deadline_, for our systems too)
  2. It is important to have a periodic update, at least annually
  3. Many commons files are already in public archives (such as the Internet Archive or Library of Congress), and include links to those archival sources in their metadata. The most useful dump would be of images that are not already sourced to one of those archives.

As of 2021 we have regular backups for Commons media working (T262668#7332883 - thanks @jcrespo @ArielGlenn and others!) but no public dumps of them.
These backups are not world-readable, nor accessible to would-be mirrors.

Event Timeline

Peachey88 renamed this task from Produce regular dumps of Commons media files. to Produce regular public dumps of Commons media files..Dec 30 2021, 11:38 PM
Reedy renamed this task from Produce regular public dumps of Commons media files. to Produce regular public dumps of Commons media files.Dec 31 2021, 12:44 AM

See also: T73405 (requesting dump of resixed images) and T53001 (Image tarball dumps on your.org are not being generated).

What is limiting here? That backups are large so it is hard to host them? So if backups are made, then it is just a question of pushing them somewhere? If somebody offers storage for those backups, would then help moving this issue further?

What is limiting here?

Mitar: I am not in charge of dumps, but from an external (but I believe technically-informed) perspective, there is at the moment one big issue that has many ramifications.

The one I personally would point as the main cause is the lack of any team at Wikimedia supporting the multimedia stack/Commons. When I worked on backup setup, I found no one whose job was maintaining commons/mw file stack and that could help me achieve it (except people from other teams and volunteers volunteering their time to help me, outside of their duties) and figure out certain things or overcome challenges found, such as corruption (T289996). When trying to have someone help me to fix apparently simple issues that were leading to data corruption, I found out no known team was in charge of that: See T290462

I reflected this gap in support at https://www.mediawiki.org/wiki/Developers/Maintainers#MediaWiki_core to make it apparent to everyone.

Taking a "simple backup" took many more months than it should because this lack of support.

However, a backup is "easy" compared to a public dump- while I just get to "store everything" in a private place all together, public dumps need to discriminate between public and private images; public and private wikis, versioning, etc, (must be context-aware) and a lot of measures should be put in place to make sure people's privacy and safety is ensured (think of vandals uploading bad files or ones with private information). There also needs to be some kind of classification and guidance given on how to use those (export metadata). That's certainly not impossible, but it requires work- which is made very very very difficult (and to some extent, impossible) with the current (old) mediawiki metadata file architecture, which has been known for years that it is far from great- and there is a plan to improve it (T28741). For non-technical people- not having a primary key means there is currently no reliable way to identify individual files- meaning that creating a dump in a highly dynamic environment (where files are constantly being uploaded, deleted and renamed) is very very difficult. Data loss was mitigated thanks to prioritizing backups. But backups are not dumps, we are aware of it- and we also want to generate those. The steps are clear (even if long) to move forward- but we need the people to work on it!

Having said that, lack of support of the media stack is not the only challenge- our full backups at the moment have 108 million files (there may be duplicates there) and almost 400TB in size- and that is only the originals! However, generating and distributing 400TB of data among the many consumers that will likely be interested on those will still require some serious architectural design (e.g. compared to serving 356 KB pages, or 2.5T wikidata exports). We have the know-how of how to handle small files inhouse, but we will need to invest on new solutions for those (and usually cannot (or chose not to) offload that to an external provider due to privacy concerns). We had to deploy new technologies for backups but -for the reasons stated above-, they are not public, and its performance needs are very limited (rarely read, usually by 1 single client). Even our media production cluster could have serious limitations if it wasn't because we are able to heavily cache the most used images on our CDN locations. A dedicated project needs to be allocated to this (like it was for backups, in which I have been working intermittently since 2020).

Not everything is bad news- I believe the documenting work I did as part of implementing backups could speed up a future dumps implementation- plus also inform file management rearchitecture.

I beg you to please not harass any employee about this (not saying you would, but others could or did in the past) because we are also negatively impacted _a lot_ in our ability to work effectively because of this (e.g. if we are attending outages due to file-related bugs, we are not improving the site in other areas!). Plus we regularly get (understandably) unhappy people reporting problems with files- which get no answer, and that personally disappoints me (I am personally invested on making Commons succeed). I invite you, however, to join in our concerns and make your voice heard through official channels so soon someone makes Commons and file management in general a higher priority for the Foundation!

I see. Thank you so much for detailed update. This helps a lot to understand things.

I am not sure if what you are describing is then an argument for making such dumps externally by somebody else? E.g., by scraping whole Wikimedia Commons (downloading 333 TB of data in progress) and uploading it somewhere. It looks like this could address the issue of public data (only that can be downloaded) and privacy. I am not sure though if downloading everything is viable: that could be a lot of requests and if limits to only one request at a time without parallelization (which is what I think is a policy for scraping of Wikipedia content) it could take a while (years, by quick estimation). I am also not sure if one can apply for higher rate limits for this (as it is not really API use, but just file downloading) and where/how.

Hello @jcrespo and @Mitar -- thanks, this does clarify.

  1. @jcrespo wrote:

I invite you, however, to join in our concerns and make your voice heard through official channels so soon someone makes Commons and file management in general a higher priority for the Foundation!

How are we doing on this front? What more pushing is needed, around Commons prioritization?

  1. How can we do the easy 10% here and produce
  • an index of filenames (on Commons) and URLs to their latest revision, run through some sanity check to avoid vandalism (files older than X days, last unreverted revision older than Y days, &c)
  • this index chunked into subsets (per initial two letters? could be ~100 subsets, each under each 1TB, where you can tell by looking at a filename which chunk it would be in; so it's not terribly if you have chunks 1-50 from one year and chunks 51-100 from another)?
  1. What sorts of processes are needed to do the last 90%:
  • snapshots of each chunk, with metadata and one size of the file (if not the original size; maybe some XXL files have a smaller format in the dump; maybe someone wants to make XXS dumps with only thumbs) generated periodically (once a year?)

The whole thing should be under 1PB, distribution of which is relatively solved and can be done by WM partners.
If each slice is around 1TB, torrents should work well enough with 1-2 dozen seeders.
IA regularly passes around its PB-scale archives; including via IPFS and torrent iirc.
NOAA has a 25PB dataset, and sees 5PB of traffic a month.
Without leaving the US there are plenty of other public-service datasets with similar issues (and which are downloaded more often than our full dumps would be).

Given periodic enterprise interest (T316618) and suggestions they are already hitting existing API endpoints to do this (in an inefficient and costly way?), could Enterprise help normalize this? @FNavas-foundation what do you think

@Sj howdy -- Enterprise doesn't handle any non-text data. Commons, we decided some time ago, was too expensive to serve in our APIs.

That said, should we find our customers need commons through us, we may reconsider. Seems possible given image-generating AI etc ... But for clarity, there is no thinking at all to change.

Is this helpful?

Yes, that's helpful. Thank you. When a group like Amazon talks to (non-Enterprise) WMF folk about technical changes that would simplify their use of the site, I wonder if there is any mechanism for converting that to a free-knowledge-coalition membership, formalizing the link and providing a bit of funding.

Some time ago I made T300907 to ask for HTML Enterprise dump of Wikimedia Commons. I think that could be seen as 10% of what this issue is. Images namespace is already dumped for English Wikipedia and having it be also done for Commons would allow one to use it for AI training - you would get image description and other information and links to image/media. This is similar to how other AI training datasets look like, they just point to media on the Internet but generally do not include the media itself (also for copyright reasons). So it is something people are familiar with.

My use case is different, I am researching ways to improve search on Commons (developing an open source search engine for Wikipedia data). But still, I would be interested in having such dump of at least just HTML/text content.

My use case is different, I am researching ways to improve search on Commons (developing an open source search engine for Wikipedia data). But still, I would be interested in having such dump of at least just HTML/text content.

We do dumps of the text content on a monthly cadence: https://dumps.wikimedia.org/commonswiki

Those are very much useless because they contain wiki markup and this is the reason why Enterprise HTML dumps are so useful, because they contain rendered page. Wiki markup content is useless because a lot of content gets pulled in through templates and other mechanisms. HTML dumps contain all that.