Page MenuHomePhabricator

Index DjVu files as OFFICE instead of BITMAP?
Closed, ResolvedPublic

Description

mime.info currently categorizes DjVu files as BITMAP img_media_type, and that's also how CirrusSearch indexes them (https://commons.wikimedia.org/w/index.php?search=filetype%3Abitmap+filemime%3Aimage%2Fvnd.djvu&title=Special%3ASearch)

PDF files, on the other hand, are considered OFFICE.

Is there a reason DjVu files are not, like PDF files, also OFFICE?
If not: any other reason we wouldn't want to change that?
Who do we need to consult if we'd like to make that happen?

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 3 2019, 1:21 PM
TheDJ added a subscriber: TheDJ.Jan 3 2019, 2:11 PM

i think that's just how it was initially done, and now one ever thought about alignment between the two.

Added as BITMAP by @daniel in 715b1aa8f1a80309a17ee0e30822c30531d89bd5 in 2007; not sure what'd happen if we just changed that to OFFICE.

Adding @dcausse for his opinion

TheDJ added a comment.Jan 3 2019, 3:39 PM

You'd have to purge all the djvu files, because this is cached in the database with img_media_type

You'd have to purge all the djvu files, because this is cached in the database with img_media_type

Sure, but that's a relatively simple operation in comparison to enabling WikibaseMediaInfo, which we're also doing. ;-)

dcausse added a comment.EditedJan 3 2019, 4:14 PM

Cirrus refreshes its document constantly, if the entries in the DB are fixed then cirrus will fix its index on its own.
In theory after 2 months all entries should have been updated. If it's too long we can certainly prepare a script to force-update only these specific files.

Change 482067 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/core@master] MIME: Re-classify DjVu files as OFFICE, like PDFs, and not as BITMAP

https://gerrit.wikimedia.org/r/482067

Ramsey-WMF triaged this task as Normal priority.Jan 3 2019, 6:36 PM
Ramsey-WMF moved this task from Untriaged to Triaged on the Multimedia board.

Change 482067 merged by jenkins-bot:
[mediawiki/core@master] MIME: Re-classify DjVu files as OFFICE, like PDFs, and not as BITMAP

https://gerrit.wikimedia.org/r/482067

Jdforrester-WMF removed a project: Patch-For-Review.

This will go live on 2019-02-05 on TestCommons and 2019-02-06 on Commons.

I'm thrilled to see that someone actually cares about img_media_type, and uses it for something :)

Confirmed that newly uploaded DjVu files are indexed as office. Will keep this ticket open for a bit longer while we see if Cirrus does reindex old files properly.

Ramsey-WMF moved this task from Triaged to Tracking on the Multimedia board.

Confirmed that newly uploaded DjVu files are indexed as office. Will keep this ticket open for a bit longer while we see if Cirrus does reindex old files properly.

It is not enought that CirrusSearch catch this up. CirrusSearch would use the fields in the image table, which is not updated for old files. So CirrusSearch can not changed the index.

You can see on https://commons.wikimedia.org/wiki/Special:MediaStatistics that djvu now listed under "Bitmap images" for the old files (106,637 at 06:58, 7 February 2019) and "Office" for the new files (3 at 06:58, 7 February 2019)

I would say it needs a maintenance script, something like:

refreshImageMetadata.php --mediatype BITMAP --mime image/vnd.djvu

I have create a task for wmf.
For third party this should be added to the updater, if you want.

For third parties, I've quickly thrown together https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/489337 but it's not ready to merge yet (amongst other things, it should be tested).