I added it to wikimedia-hackathon-2022. I think it would be a nice thing to fix as part of it.
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Apr 27 2022
Apr 22 2022
Apr 19 2022
Thanks for linking to that task.
I was interested in this primarily for my own self-approved app used only by me. There it should be trivial to just change grants.
Hm, should this be prioritized more? It is 8 years now.
Apr 6 2022
Is there any way I could help to push this further?
Apr 5 2022
I see. Thank you so much for detailed update. This helps a lot to understand things.
What is limiting here? That backups are large so it is hard to host them? So if backups are made, then it is just a question of pushing them somewhere? If somebody offers storage for those backups, would then help moving this issue further?
Apr 4 2022
I think all media files should be made available through IPFS. Then it would be easy to host a copy of files, or contribute to hosting part of a copy of files. You could pin files you are interested. And it would work like torrent, just that it is dynamic (new files can be added as they are uploaded, removed files can be unpinned by Wikimedia and can be hosted by others, or get lost by the IPFS). It could probably be made it so that Wikimedia does not have to host files twice, so that IPFS would use same files otherwise used for serving the web/API. This is something people behind IPFS are thinking about as well, so it could align: https://filecoin.io/store/#foundation I think this could help the fact that it is hard to make a static dump of all media files at the current size. So making this more distributed and fluid could help.
I think there are two actionable things to do here:
Mar 30 2022
So should we delete all except the Test_conductitivity.mpg files? Or should I re-code the first and third file as MPG and re-upload them?
Mar 1 2022
What was the condition you searched for? Because there are PDFs which have 0x0 in the database but also in the web interface. See T301291. At least in English Wikipedia and Wikimedia Commons I could not find any other PDF or Djvu which would have 0x0 but in the web interface a reasonable number.
Feb 28 2022
Given that this is the only row where this is the case (I wen through whole dump), could I suggest that somebody just writes the numbers in and this is it? :-) Investigating how this happened and why does the website still report correct numbers (maybe it is some cache?) might be more work.
Feb 27 2022
I fixed 06.45 Management rep letter.pdf using mutool.
Feb 16 2022
Oh, what will then happen when you upgrade PHP? Is there a ticket to track about it and issues related to title names because of it? So then upgrade will change title of the page I linked above.
Feb 15 2022
Currently I am not able to upload mp3s (no autopatrol flag), so somebody else will have to look into this.
So what to do then here?
If I were to convert this to mp3 or ogg and upload it as a new revision of this file, what would happen? Can file type be changed with a new file revision?
@mau If you made this PDF yourself, could I recommend removing the first blank page? Because otherwise the first thumbnail does not show anything.
So I fixed it using mutool clean. But the ones I listed above cannot be fixed this way. And this is what I am reporting. So mutool clean does not fix it, looking at MediaBox values show reasonable page sizes (including the first page), and even metadata (example for the first file above shows page size available:
No, this one seems just a slightly broken PDF. I just fixed it.
I filled T301807 for two MPG files which are misclassified.
Files do provide this info, see output of ffprobe (there is both duration and width and height in there). But this is not detected correctly by Mediawiki software. So it seems support for mpg files is not complete and some are not handled correctly. So this task is about supporting those files, too.
Interesting that even remuxing does not fix this. I will try recoding, given that flac is lossless.
Oh, and I wanted to do the same for mp3 files, but I could not because I do not have autopatrol flag which seems to be required for uploading (fixed) mp3 files.
OK. I made a pass over all application/ogg files. The fmpeg -err_detect command I mentioned above detected some badly broken files which I reported for deletion and they got deleted.
From my observations looking at imageinfo metadata available for these files (as available in image Mediawiki table), I think an easy fix for many (if not all) of these files would be to simply look at existing imageinfo metadata: if a file is application/ogg and has width and height, mark it as video, if width and height is 0, mark it as audio. Similar for audio/webm and video/webm. So this could be fixed now for existing files with a simple script going over the image table. While we wait for the perfect fix for new files.
From my observations looking at imageinfo metadata available for these files (as available in image Mediawiki table), I think an easy fix for many (if not all) of these files would be to simply look at existing imageinfo metadata: if a file is application/ogg and has width and height, mark it as video, if width and height is 0, mark it as audio. Similar for audio/webm and video/webm. So this could be fixed now for existing files with a simple script going over the image table. While we wait for the perfect fix for new files.
Thanks for fixing the status.
Oh, thanks. That is good to now. This resolves things for me.
I would say related, because I am asking for https://github.com/wikimedia/operations-mediawiki-config gets to be updated (it is just a static setting, no?), and not for an API. API would be cool, too.
Feb 14 2022
Feb 12 2022
It would be great though, if my bot would get apihighlimits flag, so that I can make more API requests in parallel. I have made a bot request on Commons some time ago but then I thought that dumps will be enough so i have purse it further. I think I would like to request apihighlimits again now, as a workaround for this issue.
Feb 10 2022
No worries. For now I made it work with doing API requests.
It looks like there's about 2.7 million such items on commons, and for each one of those we are talking about a separate db query to get the metadata from the external store.
This worked very well for many files. But sadly there is a bug in ffmpeg so files which include attachments (e.g., a cover image) cannot be easily remuxed: https://trac.ffmpeg.org/ticket/4591
Feb 9 2022
Interesting, I found a case where the file passes oggz validation and ffmpeg does not complain. I remuxed it with ffmpeg and re-uploaded it. But Mediawiki still complains that it cannot decode it: https://commons.wikimedia.org/wiki/File:2018-10-10_kreisende_stare.ogg
OK, it looks like much better text is to run:
So I went through all Ogg files on Wikimedia Commons for which processing has failed. I downloaded them and run oggz validate on them. Some of them (424) failed validation. But many (1082) have passed validation and those are most curious why they failed processing. I think to make this issue more actionable, we should focus on those which are passing oggz validate but are failing here. Why and how could we improve that.
So in the metadata column of the image table there is instead now a value like tt:12345 where this 12345 is an ID of the blob in the external storage. Generally every file has two such blobs: for data (for metadata) and for text (extracted text from documents).
I went through all PDF and Djvu files on Commons and made a list of those which are valid, but are shown as 0x0 without thumbnails, the list is here: T301291
What is this wikimirror.org? Why change links to that?
Feb 8 2022
Feb 7 2022
I fixed the issue by using a chunk of different size. I think the check should be done only on the first chunk, not on later chunks, because it seems it can have false positives based on how upload is chunked.
Let's take this example: https://commons.wikimedia.org/wiki/File:17de_Mai_1872.djvu
Feb 6 2022
And it was accessible through a dump in the past, before database structure change.
Feb 5 2022
Oh, POST works already instead of GET. This should be documented somewhere in https://www.mediawiki.org/wiki/API:Main_page. Because when I read documentation like https://www.mediawiki.org/wiki/API:Imageinfo it explicitly says a "GET request".
I think the only viable solution here is that instead of increasing limits, is to allow payload in a request body. ElasticsSearch uses body in GET requests, which is not standard and one could have issues somewhere in the path because of that, so instead maybe all GET API endpoints should also allow POST for read-only queries. Maybe with support for X-HTTP-Method-Override header, so that one could make a POST with X-HTTP-Method-Override: GET and query parameters in the body.
I see. So those dumps would not be SQL but something else?
I made T301039 as a followup, because now that metadata is moved to blobs, it is not possible anymore to access the metadata from the image table SQL dump alone.
Feb 3 2022
It is interesting to note that some files looks valid, e.g, File:06.45 Management rep letter.pdf above does open for me in Firefox.
FYI, both example entries are close to the beginning of the first ndjson file.
Some of those files got deleted since then. Maybe it would be useful to rerun the query?
Jan 26 2022
I would vote for simply including hashes in dumps. They would make dumps bigger, but they would be consistent with output of EntityData which currently includes hashes for all snaks.
Just a followup from somebody coming to Wikidata dumps in 2021: it is really confusing that dumps do not include hashes, especially because EntityData seems to show them now for all snaks (main, qualifiers, references). So when one is debugging this, using EntityData as a reference throws you of.
I checked enwiki-NS0-20220120-ENTERPRISE-HTML.json.tar.gz, it has the same issue. Example entry for: https://en.wikipedia.org/wiki/Egli_Trimi
Jan 24 2022
Have you evaluated also Zstandard? I have some time ago tested it and it performers really very well. I have not yet tested Go libraries though.
Jan 23 2022
I see there is a patch to fix this already made. What is the progress on it? What is yet to be done for this to be addressed?
Could I suggest that audio files are categorized as audio/ogg and video files as video/ogg instead of both of them application/ogg? All those media types are valid ones: https://en.wikipedia.org/wiki/Ogg But it would be really nice if it was possible from the media type to know if it is audio or video.
Jan 22 2022
Another example: https://commons.wikimedia.org/wiki/File:The_Lost_Express_(1926).webm
I opposite this because I have a use case where it is useful to know from Wikidata's commonsMedia reference what is the file type/media type of the referenced file without having to call any API. I can now determine this only by looking at the file extension of the commonsMedia reference while otherwise I would have to call an API to determine this. If one is processing whole Wikidata's dump, this could be a lot of API calls required.
Jan 20 2022
Jan 18 2022
Interesting. I also use Go to parse those dumps (see library here) (it would be nice if we could use shared public Go struct representation of JSON) and am using pbzip2 to decompress bzip2 which supports parallel decompression. But I also cannot find any parallel compression for bzip2. So it is an interesting mismatch: pgzip has parallel compression implementation but serial decompression, while pbzip2 has parallel decompression, but not compression yet.
Jan 15 2022
In meantime, is the code which produces current JSON available somewhere, open source? I could use that to generate similar JSONs for myself, while waiting for the API endpoint.