Page MenuHomePhabricator
Feed Search

Apr 27 2022

Mitar added a comment to T301104: Wikimedia Commons structured data dump does not contain all fields, e..g, title.

I added it to wikimedia-hackathon-2022. I think it would be a nice thing to fix as part of it.

Apr 27 2022, 9:25 PM · Wikimedia-Hackathon-2022, Structured-Data-Backlog, Structured Data Engineering, Commons
Mitar added a project to T301104: Wikimedia Commons structured data dump does not contain all fields, e..g, title: Wikimedia-Hackathon-2022.
Apr 27 2022, 9:24 PM · Wikimedia-Hackathon-2022, Structured-Data-Backlog, Structured Data Engineering, Commons

Apr 22 2022

Mitar updated subscribers of T306409: Regression in processing PDFs on Wikimedia Commons: No width, height, no metadata.
Apr 22 2022, 10:45 AM · Wikimedia-maintenance-script-run, User-TheDJ, MediaWiki-File-management, Commons

Apr 19 2022

Mitar added a comment to T62380: OAuth developers should be able to change what grants their application asks for instead of having to submit a new application.

Thanks for linking to that task.

Apr 19 2022, 11:34 AM · MediaWiki-extensions-OAuth
Mitar added a comment to T62380: OAuth developers should be able to change what grants their application asks for instead of having to submit a new application.

I was interested in this primarily for my own self-approved app used only by me. There it should be trivial to just change grants.

Apr 19 2022, 10:19 AM · MediaWiki-extensions-OAuth
Mitar added a comment to T62380: OAuth developers should be able to change what grants their application asks for instead of having to submit a new application.
Apr 19 2022, 8:41 AM · MediaWiki-extensions-OAuth
Mitar added a comment to T62380: OAuth developers should be able to change what grants their application asks for instead of having to submit a new application.

Hm, should this be prioritized more? It is 8 years now.

Apr 19 2022, 8:12 AM · MediaWiki-extensions-OAuth
Mitar created T306409: Regression in processing PDFs on Wikimedia Commons: No width, height, no metadata.
Apr 19 2022, 5:24 AM · Wikimedia-maintenance-script-run, User-TheDJ, MediaWiki-File-management, Commons

Apr 6 2022

Mitar updated the task description for T305548: Better handling of orphan local file descriptions when a Wikimedia Commons file is renamed.
Apr 6 2022, 12:20 PM · Commons
Mitar created T305548: Better handling of orphan local file descriptions when a Wikimedia Commons file is renamed.
Apr 6 2022, 12:19 PM · Commons
Mitar added a comment to T301104: Wikimedia Commons structured data dump does not contain all fields, e..g, title.

Is there any way I could help to push this further?

Apr 6 2022, 11:57 AM · Wikimedia-Hackathon-2022, Structured-Data-Backlog, Structured Data Engineering, Commons

Apr 5 2022

Mitar added a comment to T298394: Produce regular public dumps of Commons media files.

I see. Thank you so much for detailed update. This helps a lot to understand things.

Apr 5 2022, 10:03 PM · Datasets-Archiving, Internet-Archive, Dumps-Generation, Commons-Datasets, Commons
Mitar added a comment to T298394: Produce regular public dumps of Commons media files.

What is limiting here? That backups are large so it is hard to host them? So if backups are made, then it is just a question of pushing them somewhere? If somebody offers storage for those backups, would then help moving this issue further?

Apr 5 2022, 12:22 PM · Datasets-Archiving, Internet-Archive, Dumps-Generation, Commons-Datasets, Commons

Apr 4 2022

Mitar added a comment to T53001: Image tarball dumps on your.org are not being generated.

I think all media files should be made available through IPFS. Then it would be easy to host a copy of files, or contribute to hosting part of a copy of files. You could pin files you are interested. And it would work like torrent, just that it is dynamic (new files can be added as they are uploaded, removed files can be unpinned by Wikimedia and can be hosted by others, or get lost by the IPFS). It could probably be made it so that Wikimedia does not have to host files twice, so that IPFS would use same files otherwise used for serving the web/API. This is something people behind IPFS are thinking about as well, so it could align: https://filecoin.io/store/#foundation I think this could help the fact that it is hard to make a static dump of all media files at the current size. So making this more distributed and fluid could help.

Apr 4 2022, 3:06 PM · Dumps-Generation, Datasets-Archiving, Datasets-General-or-Unknown
Mitar added a comment to T73405: Medium-sized image dump.

I think there are two actionable things to do here:

Apr 4 2022, 12:59 PM · Dumps-Generation, Internet-Archive, Datasets-Archiving

Mar 30 2022

Mitar added a comment to T301788: Metadata issues with few .mpg files on Wikimedia Commons.

So should we delete all except the Test_conductitivity.mpg files? Or should I re-code the first and third file as MPG and re-upload them?

Mar 30 2022, 8:57 AM · TimedMediaHandler, Commons

Mar 1 2022

Mitar added a comment to T302677: Metadata of a PDF in image table dump does not match the website.

What was the condition you searched for? Because there are PDFs which have 0x0 in the database but also in the web interface. See T301291. At least in English Wikipedia and Wikimedia Commons I could not find any other PDF or Djvu which would have 0x0 but in the web interface a reasonable number.

Mar 1 2022, 5:15 PM · MediaWiki-File-management, Dumps-Generation, Commons

Feb 28 2022

Mitar added a comment to T302677: Metadata of a PDF in image table dump does not match the website.

Given that this is the only row where this is the case (I wen through whole dump), could I suggest that somebody just writes the numbers in and this is it? :-) Investigating how this happened and why does the website still report correct numbers (maybe it is some cache?) might be more work.

Feb 28 2022, 2:16 PM · MediaWiki-File-management, Dumps-Generation, Commons

Feb 27 2022

Mitar updated the task description for T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.
Feb 27 2022, 5:23 PM · MediaWiki-extensions-PdfHandler, Commons
Mitar updated the task description for T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.
Feb 27 2022, 5:02 PM · MediaWiki-extensions-PdfHandler, Commons
Mitar updated the task description for T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.
Feb 27 2022, 5:01 PM · MediaWiki-extensions-PdfHandler, Commons
Mitar updated the task description for T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.
Feb 27 2022, 4:57 PM · MediaWiki-extensions-PdfHandler, Commons
Mitar updated the task description for T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.
Feb 27 2022, 4:56 PM · MediaWiki-extensions-PdfHandler, Commons
Mitar updated the task description for T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.
Feb 27 2022, 3:32 PM · MediaWiki-extensions-PdfHandler, Commons
Mitar updated the task description for T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.
Feb 27 2022, 2:49 PM · MediaWiki-extensions-PdfHandler, Commons
Mitar added a comment to T155741: img_metadata missing.

I fixed 06.45 Management rep letter.pdf using mutool.

Feb 27 2022, 2:49 PM · Wikimedia-database-issue (Bad data), Commons, MediaWiki-File-management, CommonsMetadata, WMF-General-or-Unknown, Multimedia
Mitar updated the task description for T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.
Feb 27 2022, 2:45 PM · MediaWiki-extensions-PdfHandler, Commons
Mitar created T302677: Metadata of a PDF in image table dump does not match the website.
Feb 27 2022, 1:33 PM · MediaWiki-File-management, Dumps-Generation, Commons

Feb 16 2022

Mitar added a comment to T301758: OverrideUcfirstCharacters not in public settings.

Oh, what will then happen when you upgrade PHP? Is there a ticket to track about it and issues related to title names because of it? So then upgrade will change title of the page I linked above.

Feb 16 2022, 9:03 AM · Wikimedia-Site-requests

Feb 15 2022

Mitar added a comment to T301807: Two MPG files are audio files, but are classified as video.

Currently I am not able to upload mp3s (no autopatrol flag), so somebody else will have to look into this.

Feb 15 2022, 11:32 PM · MediaWiki-File-management, Commons
Mitar added a comment to T301807: Two MPG files are audio files, but are classified as video.

So what to do then here?

Feb 15 2022, 9:07 PM · MediaWiki-File-management, Commons
Mitar added a comment to T301807: Two MPG files are audio files, but are classified as video.

If I were to convert this to mp3 or ogg and upload it as a new revision of this file, what would happen? Can file type be changed with a new file revision?

Feb 15 2022, 8:57 PM · MediaWiki-File-management, Commons
Mitar added a comment to T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.

@mau If you made this PDF yourself, could I recommend removing the first blank page? Because otherwise the first thumbnail does not show anything.

Feb 15 2022, 8:52 PM · MediaWiki-extensions-PdfHandler, Commons
Mitar added a comment to T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.

So I fixed it using mutool clean. But the ones I listed above cannot be fixed this way. And this is what I am reporting. So mutool clean does not fix it, looking at MediaBox values show reasonable page sizes (including the first page), and even metadata (example for the first file above shows page size available:

Feb 15 2022, 8:52 PM · MediaWiki-extensions-PdfHandler, Commons
Mitar added a comment to T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.

No, this one seems just a slightly broken PDF. I just fixed it.

Feb 15 2022, 7:48 PM · MediaWiki-extensions-PdfHandler, Commons
Mitar added a comment to T155320: Implement strict mime type detection and media type inferring of audio/video files.

I filled T301807 for two MPG files which are misclassified.

Feb 15 2022, 6:06 PM · TimedMediaHandler-Transcode, Commons, MediaWiki-File-management, Multimedia, Technical-Debt
Mitar added a parent task for T301807: Two MPG files are audio files, but are classified as video: T155320: Implement strict mime type detection and media type inferring of audio/video files.
Feb 15 2022, 6:05 PM · MediaWiki-File-management, Commons
Mitar added a subtask for T155320: Implement strict mime type detection and media type inferring of audio/video files: T301807: Two MPG files are audio files, but are classified as video.
Feb 15 2022, 6:05 PM · TimedMediaHandler-Transcode, Commons, MediaWiki-File-management, Multimedia, Technical-Debt
Mitar created T301807: Two MPG files are audio files, but are classified as video.
Feb 15 2022, 6:04 PM · MediaWiki-File-management, Commons
Mitar added a comment to T301788: Metadata issues with few .mpg files on Wikimedia Commons.

Files do provide this info, see output of ffprobe (there is both duration and width and height in there). But this is not detected correctly by Mediawiki software. So it seems support for mpg files is not complete and some are not handled correctly. So this task is about supporting those files, too.

Feb 15 2022, 5:20 PM · TimedMediaHandler, Commons
Mitar added a subtask for T44725: Multimedia file format support (tracking): T301788: Metadata issues with few .mpg files on Wikimedia Commons.
Feb 15 2022, 3:47 PM · Tracking-Neverending, WMF-General-or-Unknown
Mitar added a parent task for T301788: Metadata issues with few .mpg files on Wikimedia Commons: T44725: Multimedia file format support (tracking).
Feb 15 2022, 3:47 PM · TimedMediaHandler, Commons
Mitar created T301788: Metadata issues with few .mpg files on Wikimedia Commons.
Feb 15 2022, 3:46 PM · TimedMediaHandler, Commons
Mitar added a comment to T301774: Multiple .flac files on Wikimedia Commons have zero reported duration despite not having them with other tools.

Interesting that even remuxing does not fix this. I will try recoding, given that flac is lossless.

Feb 15 2022, 2:56 PM · Commons
Mitar added a parent task for T301774: Multiple .flac files on Wikimedia Commons have zero reported duration despite not having them with other tools: T44725: Multimedia file format support (tracking).
Feb 15 2022, 1:41 PM · Commons
Mitar added a subtask for T44725: Multimedia file format support (tracking): T301774: Multiple .flac files on Wikimedia Commons have zero reported duration despite not having them with other tools.
Feb 15 2022, 1:41 PM · Tracking-Neverending, WMF-General-or-Unknown
Mitar created T301774: Multiple .flac files on Wikimedia Commons have zero reported duration despite not having them with other tools.
Feb 15 2022, 1:38 PM · Commons
Mitar added a comment to T63900: Invalid Ogg file: Stream Undecodable.

Oh, and I wanted to do the same for mp3 files, but I could not because I do not have autopatrol flag which seems to be required for uploading (fixed) mp3 files.

Feb 15 2022, 1:29 PM · TimedMediaHandler
Mitar added a comment to T63900: Invalid Ogg file: Stream Undecodable.

OK. I made a pass over all application/ogg files. The fmpeg -err_detect command I mentioned above detected some badly broken files which I reported for deletion and they got deleted.

Feb 15 2022, 1:25 PM · TimedMediaHandler
Mitar added a comment to T226311: Some WebM video files are misdetected as audio files due to the MIME detector not scanning enough bytes.

From my observations looking at imageinfo metadata available for these files (as available in image Mediawiki table), I think an easy fix for many (if not all) of these files would be to simply look at existing imageinfo metadata: if a file is application/ogg and has width and height, mark it as video, if width and height is 0, mark it as audio. Similar for audio/webm and video/webm. So this could be fixed now for existing files with a simple script going over the image table. While we wait for the perfect fix for new files.

Feb 15 2022, 1:16 PM · User-TheDJ, MW-1.39-notes (1.39.0-wmf.9; 2022-04-25), MW-1.38-notes (1.38.0-wmf.23; 2022-02-21), MediaWiki-File-management, MediaWiki-Gallery, Multimedia, Commons
Mitar added a comment to T155320: Implement strict mime type detection and media type inferring of audio/video files.

From my observations looking at imageinfo metadata available for these files (as available in image Mediawiki table), I think an easy fix for many (if not all) of these files would be to simply look at existing imageinfo metadata: if a file is application/ogg and has width and height, mark it as video, if width and height is 0, mark it as audio. Similar for audio/webm and video/webm. So this could be fixed now for existing files with a simple script going over the image table. While we wait for the perfect fix for new files.

Feb 15 2022, 1:14 PM · TimedMediaHandler-Transcode, Commons, MediaWiki-File-management, Multimedia, Technical-Debt
Mitar added a comment to T301758: OverrideUcfirstCharacters not in public settings.

Thanks for fixing the status.

Feb 15 2022, 10:57 AM · Wikimedia-Site-requests
Mitar closed T301758: OverrideUcfirstCharacters not in public settings as Resolved.
Feb 15 2022, 9:46 AM · Wikimedia-Site-requests
Mitar added a comment to T301758: OverrideUcfirstCharacters not in public settings.

Oh, thanks. That is good to now. This resolves things for me.

Feb 15 2022, 9:45 AM · Wikimedia-Site-requests
Mitar added a comment to T301758: OverrideUcfirstCharacters not in public settings.
Feb 15 2022, 9:00 AM · Wikimedia-Site-requests
Mitar added a comment to T301758: OverrideUcfirstCharacters not in public settings.

I would say related, because I am asking for https://github.com/wikimedia/operations-mediawiki-config gets to be updated (it is just a static setting, no?), and not for an API. API would be cool, too.

Feb 15 2022, 8:58 AM · Wikimedia-Site-requests
Mitar created T301758: OverrideUcfirstCharacters not in public settings.
Feb 15 2022, 8:38 AM · Wikimedia-Site-requests

Feb 14 2022

Mitar added a comment to P17424 settings/enwiki.json.
Feb 14 2022, 10:56 PM

Feb 12 2022

Mitar added a comment to T301039: Provide a dump of PDF/DjVU metadata.

It would be great though, if my bot would get apihighlimits flag, so that I can make more API requests in parallel. I have made a bot request on Commons some time ago but then I thought that dumps will be enough so i have purse it further. I think I would like to request apihighlimits again now, as a workaround for this issue.

Feb 12 2022, 11:47 AM · Dumps-Generation, Commons

Feb 10 2022

Mitar added a comment to T301039: Provide a dump of PDF/DjVU metadata.

No worries. For now I made it work with doing API requests.

Feb 10 2022, 3:15 PM · Dumps-Generation, Commons
Mitar added a comment to T301039: Provide a dump of PDF/DjVU metadata.

It looks like there's about 2.7 million such items on commons, and for each one of those we are talking about a separate db query to get the metadata from the external store.

Feb 10 2022, 3:06 PM · Dumps-Generation, Commons
Mitar added a comment to T63900: Invalid Ogg file: Stream Undecodable.

This worked very well for many files. But sadly there is a bug in ffmpeg so files which include attachments (e.g., a cover image) cannot be easily remuxed: https://trac.ffmpeg.org/ticket/4591

Feb 10 2022, 6:52 AM · TimedMediaHandler
Mitar created T301438: Support extracting cover from the ogg file and use it as a thumbnail.
Feb 10 2022, 1:45 AM · TimedMediaHandler, Thumbor, MediaWiki-File-management, Commons

Feb 9 2022

Mitar added a comment to T63900: Invalid Ogg file: Stream Undecodable.

Interesting, I found a case where the file passes oggz validation and ffmpeg does not complain. I remuxed it with ffmpeg and re-uploaded it. But Mediawiki still complains that it cannot decode it: https://commons.wikimedia.org/wiki/File:2018-10-10_kreisende_stare.ogg

Feb 9 2022, 11:57 PM · TimedMediaHandler
Mitar added a comment to T63900: Invalid Ogg file: Stream Undecodable.

OK, it looks like much better text is to run:

Feb 9 2022, 11:29 PM · TimedMediaHandler
Mitar added a comment to T63900: Invalid Ogg file: Stream Undecodable.

So I went through all Ogg files on Wikimedia Commons for which processing has failed. I downloaded them and run oggz validate on them. Some of them (424) failed validation. But many (1082) have passed validation and those are most curious why they failed processing. I think to make this issue more actionable, we should focus on those which are passing oggz validate but are failing here. Why and how could we improve that.

Feb 9 2022, 9:25 PM · TimedMediaHandler
Mitar created T301332: webp images on commons do not show duration if animated.
Feb 9 2022, 10:33 AM · Commons
Mitar created T301323: MIDI files on Commons do not have duration.
Feb 9 2022, 9:48 AM · Commons
Mitar added a comment to T301039: Provide a dump of PDF/DjVU metadata.

So in the metadata column of the image table there is instead now a value like tt:12345 where this 12345 is an ID of the blob in the external storage. Generally every file has two such blobs: for data (for metadata) and for text (extracted text from documents).

Feb 9 2022, 9:44 AM · Dumps-Generation, Commons
Mitar added a comment to T297942: Specific PDF on Commons has no image thumbnails, dimensions shown as 0x0 pixels.

I went through all PDF and Djvu files on Commons and made a list of those which are valid, but are shown as 0x0 without thumbnails, the list is here: T301291

Feb 9 2022, 6:52 AM · MediaWiki-extensions-PdfHandler, Commons
Mitar added a comment to T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.

What is this wikimirror.org? Why change links to that?

Feb 9 2022, 6:51 AM · MediaWiki-extensions-PdfHandler, Commons

Feb 8 2022

Mitar created T301291: PDF and Djvu files on Commons failed to be processed (no thumbnails, zero pages) but otherwise valid.
Feb 8 2022, 9:15 PM · MediaWiki-extensions-PdfHandler, Commons

Feb 7 2022

Mitar added a comment to T301154: Unable to upload a PDF: Internet Explorer would detect it as "text/html".

I fixed the issue by using a chunk of different size. I think the check should be done only on the first chunk, not on later chunks, because it seems it can have false positives based on how upload is chunked.

Feb 7 2022, 8:52 PM · Commons
Mitar created T301154: Unable to upload a PDF: Internet Explorer would detect it as "text/html".
Feb 7 2022, 5:05 PM · Commons
Mitar added a comment to T301039: Provide a dump of PDF/DjVU metadata.

Let's take this example: https://commons.wikimedia.org/wiki/File:17de_Mai_1872.djvu

Feb 7 2022, 1:50 PM · Dumps-Generation, Commons
Mitar created T301104: Wikimedia Commons structured data dump does not contain all fields, e..g, title.
Feb 7 2022, 10:03 AM · Wikimedia-Hackathon-2022, Structured-Data-Backlog, Structured Data Engineering, Commons

Feb 6 2022

Mitar added a comment to T301039: Provide a dump of PDF/DjVU metadata.

And it was accessible through a dump in the past, before database structure change.

Feb 6 2022, 6:33 PM · Dumps-Generation, Commons

Feb 5 2022

Mitar added a comment to T209590: HTTP/2 requests fail with too-long URLs.

Oh, POST works already instead of GET. This should be documented somewhere in https://www.mediawiki.org/wiki/API:Main_page. Because when I read documentation like https://www.mediawiki.org/wiki/API:Imageinfo it explicitly says a "GET request".

Feb 5 2022, 5:11 PM · Traffic, SRE
Mitar added a comment to T209590: HTTP/2 requests fail with too-long URLs.

I think the only viable solution here is that instead of increasing limits, is to allow payload in a request body. ElasticsSearch uses body in GET requests, which is not standard and one could have issues somewhere in the path because of that, so instead maybe all GET API endpoints should also allow POST for read-only queries. Maybe with support for X-HTTP-Method-Override header, so that one could make a POST with X-HTTP-Method-Override: GET and query parameters in the body.

Feb 5 2022, 4:49 PM · Traffic, SRE
Mitar added a comment to T301039: Provide a dump of PDF/DjVU metadata.

I see. So those dumps would not be SQL but something else?

Feb 5 2022, 4:40 PM · Dumps-Generation, Commons
Mitar added a comment to T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata.

I made T301039 as a followup, because now that metadata is moved to blobs, it is not possible anymore to access the metadata from the image table SQL dump alone.

Feb 5 2022, 12:15 PM · MW-1.39-notes (1.39.0-wmf.7; 2022-04-11), MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), MW-1.37-notes (1.37.0-wmf.23; 2021-09-13), User-Ladsgroup, Wikimedia-Performance-publish, Performance-Team (Radar), MediaWiki-File-management, Patch-For-Review, DBA, Commons
Mitar updated the task description for T301039: Provide a dump of PDF/DjVU metadata.
Feb 5 2022, 12:13 PM · Dumps-Generation, Commons
Mitar created T301039: Provide a dump of PDF/DjVU metadata.
Feb 5 2022, 12:12 PM · Dumps-Generation, Commons

Feb 3 2022

Mitar created T300907: Wikimedia Enterprise HTML dump for Wikimedia Commons.
Feb 3 2022, 6:11 PM · Wikimedia Enterprise Volunteer Request, Wikimedia Enterprise
Mitar added a comment to T155741: img_metadata missing.

It is interesting to note that some files looks valid, e.g, File:06.45 Management rep letter.pdf above does open for me in Firefox.

Feb 3 2022, 3:05 PM · Wikimedia-database-issue (Bad data), Commons, MediaWiki-File-management, CommonsMetadata, WMF-General-or-Unknown, Multimedia
Mitar added a comment to T300124: In Wikimedia Enterprise HTML Dumps, categories and templates are not always extracted.

FYI, both example entries are close to the beginning of the first ndjson file.

Feb 3 2022, 9:59 AM · Wikimedia Enterprise Volunteer Request, Wikimedia Enterprise
Mitar added a comment to T155741: img_metadata missing.

Some of those files got deleted since then. Maybe it would be useful to rerun the query?

Feb 3 2022, 9:48 AM · Wikimedia-database-issue (Bad data), Commons, MediaWiki-File-management, CommonsMetadata, WMF-General-or-Unknown, Multimedia

Jan 26 2022

Mitar added a comment to T174029: Two kinds of JSON dumps?.

I would vote for simply including hashes in dumps. They would make dumps bigger, but they would be consistent with output of EntityData which currently includes hashes for all snaks.

Jan 26 2022, 1:58 PM · patch-welcome, MediaWiki-extensions-Wikibase-Repo, Wikidata
Mitar added a comment to T171607: Main snak and reference snaks do not include hash in JSON output.

Just a followup from somebody coming to Wikidata dumps in 2021: it is really confusing that dumps do not include hashes, especially because EntityData seems to show them now for all snaks (main, qualifiers, references). So when one is debugging this, using EntityData as a reference throws you of.

Jan 26 2022, 1:57 PM · Wikidata-Former-Sprint-Board, MediaWiki-extensions-Wikibase-Repo, Wikidata
Mitar added a comment to T300124: In Wikimedia Enterprise HTML Dumps, categories and templates are not always extracted.

I checked enwiki-NS0-20220120-ENTERPRISE-HTML.json.tar.gz, it has the same issue. Example entry for: https://en.wikipedia.org/wiki/Egli_Trimi

Jan 26 2022, 11:10 AM · Wikimedia Enterprise Volunteer Request, Wikimedia Enterprise
Mitar created T300124: In Wikimedia Enterprise HTML Dumps, categories and templates are not always extracted.
Jan 26 2022, 10:49 AM · Wikimedia Enterprise Volunteer Request, Wikimedia Enterprise

Jan 24 2022

Mitar added a comment to T298436: Wikimedia Enterprise HTML dumps as bzip2 archive.

Have you evaluated also Zstandard? I have some time ago tested it and it performers really very well. I have not yet tested Go libraries though.

Jan 24 2022, 10:38 PM · Wikimedia Enterprise Volunteer Request, Wikimedia Enterprise

Jan 23 2022

Mitar added a comment to T226311: Some WebM video files are misdetected as audio files due to the MIME detector not scanning enough bytes.

I see there is a patch to fix this already made. What is the progress on it? What is yet to be done for this to be addressed?

Jan 23 2022, 3:32 PM · User-TheDJ, MW-1.39-notes (1.39.0-wmf.9; 2022-04-25), MW-1.38-notes (1.38.0-wmf.23; 2022-02-21), MediaWiki-File-management, MediaWiki-Gallery, Multimedia, Commons
Mitar added a comment to T155320: Implement strict mime type detection and media type inferring of audio/video files.

Could I suggest that audio files are categorized as audio/ogg and video files as video/ogg instead of both of them application/ogg? All those media types are valid ones: https://en.wikipedia.org/wiki/Ogg But it would be really nice if it was possible from the media type to know if it is audio or video.

Jan 23 2022, 3:30 PM · TimedMediaHandler-Transcode, Commons, MediaWiki-File-management, Multimedia, Technical-Debt

Jan 22 2022

Mitar added a comment to T226311: Some WebM video files are misdetected as audio files due to the MIME detector not scanning enough bytes.

Another example: https://commons.wikimedia.org/wiki/File:The_Lost_Express_(1926).webm

Jan 22 2022, 9:28 PM · User-TheDJ, MW-1.39-notes (1.39.0-wmf.9; 2022-04-25), MW-1.38-notes (1.38.0-wmf.23; 2022-02-21), MediaWiki-File-management, MediaWiki-Gallery, Multimedia, Commons
Mitar added a comment to T6421: Image file extension should not be part of the name.

I opposite this because I have a use case where it is useful to know from Wikidata's commonsMedia reference what is the file type/media type of the referenced file without having to call any API. I can now determine this only by looking at the file extension of the commonsMedia reference while otherwise I would have to call an API to determine this. If one is processing whole Wikidata's dump, this could be a lot of API calls required.

Jan 22 2022, 12:40 AM · Commons, Multimedia, MediaWiki-File-management

Jan 20 2022

Mitar created T299621: Add to article schema the timestamp of the first revision.
Jan 20 2022, 3:12 AM · Wikimedia Enterprise Volunteer Request, Wikimedia Enterprise

Jan 18 2022

Mitar created T299464: Add to article schema number of revision the article has, at the current revision.
Jan 18 2022, 10:16 PM · Wikimedia Enterprise Volunteer Request, Wikimedia Enterprise
Mitar added a comment to T298436: Wikimedia Enterprise HTML dumps as bzip2 archive.

Interesting. I also use Go to parse those dumps (see library here) (it would be nice if we could use shared public Go struct representation of JSON) and am using pbzip2 to decompress bzip2 which supports parallel decompression. But I also cannot find any parallel compression for bzip2. So it is an interesting mismatch: pgzip has parallel compression implementation but serial decompression, while pbzip2 has parallel decompression, but not compression yet.

Jan 18 2022, 10:10 PM · Wikimedia Enterprise Volunteer Request, Wikimedia Enterprise

Jan 15 2022

Mitar added a comment to T298437: Provide a public pull API endpoint.

In meantime, is the code which produces current JSON available somewhere, open source? I could use that to generate similar JSONs for myself, while waiting for the API endpoint.

Jan 15 2022, 9:37 PM · Wikimedia Enterprise Volunteer Request, Wikimedia Enterprise