Aug 18 2023
BTW, there is a related issue T300124 that some categories and templates are not always extracted.
You are right. It is listed now in docs.
May 13 2023
Awesome! Thanks. This looks really amazing. I am not too convinced that we should introduce a different dump format, but changing compression seems to really be a low hanging fruit.
May 9 2023
Yes, great summary. Thanks.
May 8 2023
I think it would be useful to have a benchmark with more options: JSON with gzip, bzip (decompressed with lbzip2), and zstd. And then for QuickStatements the same. Could you do that?
May 1 2023
To my knowledge it is. https://www.mediawiki.org/wiki/Wikimedia_REST_API#Terms_and_conditions still says that 200 requests/second per REST API endpoint is fine (unless documented to have less, for example transform API endpoint) and configuration says differently: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/text-frontend.inc.vcl.erb#431.
Mar 17 2023
Yes, I made a library for processing those dumps in Go.
In large dumps there are multiple files inside one archive. So tar serves as a standard way to combine those multiple files into one file, and then compression is made over all of that.
So tar format is really made for streaming, so I am surprised that this is hard to do in your programming language. Seeking is what is problem in tar, but streaming is really easy. It is really just concatenation of files. So it is similar to any other buffered stream.
I will respond about tar layer in T332045 you made.
Feb 15 2023
In OIDC however the same data should also be returned alongside the access token
I agree. Let's close this issue then and track OIDC parameters in the issue you referenced.
Feb 14 2023
So there is API call to get all categories/templates which is then included into dumps. The issue here is that sometimes that call seems to fail and not have data included.
You should use a parallel gzip decompressor. Just using standard gzip (which is invoked by tar) is not that.
How can you parse all the templates used from Parsoid HTML? Some templates do not have any output into HTML?
Feb 13 2023
Since I opened that issue I learned more about OIDC and I think there is no need really for authenticate endpoint anymore. authorize and access_token endpoints currently are good enough. I think what throw me off is that (at least at the time, I have not checked recently) is that it looked like there is no way to get automatically redirected back to the app if user is still signed-in into the Mediawiki instance AND has already authorized the app. Other OIDC providers just redirect back. But Mediawiki (at the time at least) always showed the authorization dialog every time. So one click sign-in flow was less fluid than one would like to have.
Dec 27 2022
Is this something a community contribution could help with?
So what is the plan about this? Is this something a community contribution could help with?
Sep 22 2022
A gentle ping on this. I understand that it would be a large dump. But on the other hand it is a very important one: Wikimedia Commons is lacking any other substantial dump so having a dump of file pages would help one obtain at least descriptions of all files through a dump. That can be useful for many use cases, like training AI models, search engines, etc.
Any luck finding a reason?
Aug 6 2022
A workaround is available in this StackOverflow question/answer: https://stackoverflow.com/questions/73223844/get-the-number-of-pages-in-a-mediawiki-wikipedia-namespace
Jul 13 2022
Just HTML dumps. So what you provide here https://dumps.wikimedia.org/other/enterprise_html/ but also for commons wiki. (You already provide namespace 6 for other wikis.)
I tried now to use API to fetch things myself, but it is going very slow (also because rate limit on HTML REST API endpoint is 100 requests per second and not documented 200 requests per second, see T307610). I would like to understand if I should at least hope for this to be done at some point soon or not at all. I find it surprising that so many dumps are made but just this one is missing. Would that be just one switch to enable dump on one more wiki?
Jul 6 2022
Jul 5 2022
OK, it is not connected to characters in the filename. There are files in entities with above characters. But I do not get why not all files on Wikimedia Commons have entities.
Jul 4 2022
Oh, what a sad issue T149410. :-(
Jun 29 2022
Hm, I am pretty sure that I am doing rate limiting correctly on my side, but I am hitting 429s after a brief time when trying to do 1000/10s rate limit to the REST API endpoint. If I lower it to 500/10s then I do not hit 429s. No idea why, but I am doing many requests in parallel.
Jun 28 2022
Hm, there was no response since February. :-( OK, I will wait.
Who could I ask from their team about this?
So if I understand correctly, those files have never been generated so that particular dump for that particular date will not be available?
@ArielGlenn: Do you think dumps of file descriptions (so not media files themselves, but wikitext rendered) could be provided for Wikimedia Commons as part of public Enterprise dumps? Given that so many other wikis are generated, why not also Wikimedia Commons? This could help me obtain descriptions for files on Wikimedia Commons (and given already no other dumps for Wikimedia Commons, it would help me hit its API less).
@Protsack.stephan Was there any progress on this?
Jun 24 2022
I checked commons-20220620-mediainfo.json.bz2 and it contains title field (alongside other fields which are present in API).
I checked wikidata-20220620-all.json.bz2 and it contains now modified field (alongside other fields which are present in API).
Jun 13 2022
So for the next dump which will run, this will now be included? Or is there some deployment which is still necessary?
Jun 11 2022
Awesome. I will try to do so when you are online, but feel free also to just merge it without me. I do not know if I can be of much help being around anyway. :-)
Jun 9 2022
What is this subsetting you are talking about?
So what is the next step here?
Jun 8 2022
Yes, this change should fix both this issue and T278031.
Jun 7 2022
Thanks for testing!
Jun 5 2022
Done. Added it to June 7 puppet request window. Please review/advise if I did something wrong.
May 27 2022
Awesome. Thanks for explaining.
So fix to the dump script has been merged to the Wikibase extension. It is gated behind a CLI switch. What is the process that this gets turned on for dumps from Wikimedia Commons (and ideally also for Wikidata)?
May 22 2022
https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/793934 is ready for a review, it has both opt-in configuration option and a test.
May 21 2022
I thin this might be related to T274359.
I think this might be related to T305407.
I made another pass, adding configuration option to not include page metadata (then dump is without title and other page metadata).
May 20 2022
I made a first pass. Feedback welcome.
So the plan is:
May 11 2022
Most of that is controlled by the SRE team at a level in front of the REST API, since the frontend caching layer is a shared resource across everything.
May 10 2022
Because our edge traffic code enforces a stricter limit of ~100/s (for responses that aren't frontend cache hits due to popularity), before the requests ever get to the Restbase service.
Sadly bulk downloads do not have HTML dumps, and Enterprise dumps do not offer them for template/module documentation (only articles, categories, and files). Also, there are no Enterprise dumps for Wikimedia Commons.
Hm, but documentation for REST API says I can use 200 requests per second? https://en.wikipedia.org/api/rest_v1/
May 4 2022
Even if you request a single title, I think you still might get continue param.
May 3 2022
Are you using Mediawiki API to obtain categories and templates? I am betting you are not processing continue properly to merge multiple API responses when one batch of data is distributed across multiple responses. You have to merge data, otherwise some pages look like they have no templates/categories. I just now encountered that when I was using API to populate templates/categories manually (because dumps are missing them randomly). I used the following API query and you can see with some luck that some of returned pages are missing templates/categories, because you have to follow continue params, but then other pages are missing. Only when batchcomplete is true you know you got everything (but you have to merge everything you got before that).
May 1 2022
I think I misunderstood in T301039 from documentation that those pointers are pointing to the text table.
Apr 28 2022
I would be interesting in doing that, but I probably need a helping hand to do it. So I have programming background, but zero understanding of where and how this could be fixed. My understanding is that hackathon would be suitable for this? Do I have to make a session? How do I find other people who might be able to help me?
Apr 27 2022
I added it to wikimedia-hackathon-2022. I think it would be a nice thing to fix as part of it.
Apr 22 2022
Apr 19 2022
Thanks for linking to that task.
I was interested in this primarily for my own self-approved app used only by me. There it should be trivial to just change grants.
Hm, should this be prioritized more? It is 8 years now.
Apr 6 2022
Is there any way I could help to push this further?
Apr 5 2022
I see. Thank you so much for detailed update. This helps a lot to understand things.
What is limiting here? That backups are large so it is hard to host them? So if backups are made, then it is just a question of pushing them somewhere? If somebody offers storage for those backups, would then help moving this issue further?
Apr 4 2022
I think all media files should be made available through IPFS. Then it would be easy to host a copy of files, or contribute to hosting part of a copy of files. You could pin files you are interested. And it would work like torrent, just that it is dynamic (new files can be added as they are uploaded, removed files can be unpinned by Wikimedia and can be hosted by others, or get lost by the IPFS). It could probably be made it so that Wikimedia does not have to host files twice, so that IPFS would use same files otherwise used for serving the web/API. This is something people behind IPFS are thinking about as well, so it could align: https://filecoin.io/store/#foundation I think this could help the fact that it is hard to make a static dump of all media files at the current size. So making this more distributed and fluid could help.
I think there are two actionable things to do here:
Mar 30 2022
So should we delete all except the Test_conductitivity.mpg files? Or should I re-code the first and third file as MPG and re-upload them?
Mar 1 2022
What was the condition you searched for? Because there are PDFs which have 0x0 in the database but also in the web interface. See T301291. At least in English Wikipedia and Wikimedia Commons I could not find any other PDF or Djvu which would have 0x0 but in the web interface a reasonable number.
Feb 28 2022
Given that this is the only row where this is the case (I wen through whole dump), could I suggest that somebody just writes the numbers in and this is it? :-) Investigating how this happened and why does the website still report correct numbers (maybe it is some cache?) might be more work.