Page MenuHomePhabricator

Wikimedia Enterprise HTML dump for Wikimedia Commons
Open, Needs TriagePublic

Description

I would like to request public HTML dump for Wikimedia Commons. For enwiki you already provide File namespace, it would be great if there would be also dump for Wikimedia Commons so that I could find for all media files on Wikipedia their HTML descriptions.

Event Timeline

@ArielGlenn: Do you think dumps of file descriptions (so not media files themselves, but wikitext rendered) could be provided for Wikimedia Commons as part of public Enterprise dumps? Given that so many other wikis are generated, why not also Wikimedia Commons? This could help me obtain descriptions for files on Wikimedia Commons (and given already no other dumps for Wikimedia Commons, it would help me hit its API less).

This would be something for the Enterprise team. Anything they provide, we'll grab twice a month and make available for public download.

Who could I ask from their team about this?

They are already tagged on the task, that should be enough.

Hm, there was no response since February. :-( OK, I will wait.

I tried now to use API to fetch things myself, but it is going very slow (also because rate limit on HTML REST API endpoint is 100 requests per second and not documented 200 requests per second, see T307610). I would like to understand if I should at least hope for this to be done at some point soon or not at all. I find it surprising that so many dumps are made but just this one is missing. Would that be just one switch to enable dump on one more wiki?

So my main motivation here is to be able to obtain descriptions of all files on Wikimedia Commons in their HTML form.

@LWyatt Can you shade any light on this please? Or anyone else?

Hello @Mitar - and sorry for the wait. I wasn't aware of this ticket, apologies on behalf of the team. I'll make sure others see it.

Hey @Mitar sorry for late response. We are working through some process oriented work to make sure we triage this type of stuff more quickly.

Hmmm, we do not have plans at the moment to provide a dump for Wikimedia Commons - I'll cite mostly challenges around handling its most sensitive content + the sheer size of project and our ability to maintain it. However, reading what you're looking for seems different (quoted below) - are you looking for the media file or just purely the HTML w/ a link to a commons file?

So my main motivation here is to be able to obtain descriptions of all files on Wikimedia Commons in their HTML form.

Just HTML dumps. So what you provide here https://dumps.wikimedia.org/other/enterprise_html/ but also for commons wiki. (You already provide namespace 6 for other wikis.)

I do not think this should be too large?

Sure - I'm not actually that sure on the size, so fair point. Some of my assumptions here came from @ArielGlenn on their past experiences working on those dumps (and since they're on the ticket, I'm tagging them). Do we have an idea of # of HTML entities in commons?

I think this is worth exploring but may not be able to prioritize quite yet btw.

I don't quite understand the question about the number of html entities. However I can tell you that the number of pages in the File namespace (namespace 6) for Commons that are dumped in our latest SQL/XML files is 86,213,972 so that's rather a lot.

A gentle ping on this. I understand that it would be a large dump. But on the other hand it is a very important one: Wikimedia Commons is lacking any other substantial dump so having a dump of file pages would help one obtain at least descriptions of all files through a dump. That can be useful for many use cases, like training AI models, search engines, etc.

I tried using API but it takes around 35 days to go through all files to obtain all file descriptions (while staying inside API limits). That is a lot. Not to mention a lot of API requests to WMF servers.

Is there any way I could help push this further? I really think it would be very useful to have coverage of commons as well. I would guess that also enterprise users would love to have those dumps now that everyone is training AI models on images + descriptions.